Breast Cancer Is not one disease as most of us think. There are four main female breast cancer subtypes categorized according to hormone receptor (ER and PR) and HER2:

**Luminal A**: HR^{+}/HER2^{–}**Luminal B**: HR^{+}/HER2^{+}**HER2-enriched**: HR^{–}/HER2^{+}**Triple Negative**: HR^{–}/HER2^{–}

These biomrker proteins (HER2, ER, and PR), are encoded by three genes, including *ERBB2* (Erb-B2 Receptor Tyrosine Kinase 2),*ESR1* (Estrogen Receptor 1), and *PGR* (Progesterone Receptor). Classifying breast cancers according to tumor subtypes could help improve treatment of the disease.

Here is an interesting review paper on the topic:

Nature Reviews Disease Primers **volume 5**, Article number: 66 (2019)

In this project, we will estimate whether we can subtype the breast cancer tumors based on gene expression profiles of these biomarker genes.

**Note:** The codes in the project are written in R. Is there any difference between R and python for Machine Learning? R is mainly used for statistical analysis, while Python provides a more general approach to data wrangling.

I divide the project into 8 steps:

**0. Prepare Data.**

Before we build any models, we need to come up with a problem/question and prepare the data, especially clean the raw data. Data Cleaning is a critical aspect of the domain of data management. The data cleansing process involves reviewing all the data present within a database to either remove or update information that is incomplete, incorrect or duplicated and irrelevant.

I use two R packages for data retrieval from TCGA database, including TCGA2STAT and TCGAbiolinks.

RPKM values (Reads Per Kilobase of transcript, per Million mapped reads) are used as nromalzied expression levels across all the samples. The files are also available here.

You can find that I retrieved the gene expression profiles for PAM50 genes (What is PAM50 for breast cancer?) rather than the three biomakers because PAM50 genes are more informative and accurate to classify breast tumors. To simplify the project, we reduce the attributes to 3.

In total, we have downloaded 1,081 BRCA samples with corresponding subtype annotations.

```
library("TCGA2STAT")
library("TCGAbiolinks")
options(timeout=10000)
rnaseq.RPKM.BRCA <- getTCGA(disease="BRCA", data.type="RNASeq2", type="RPKM",clinical = TRUE)
rnaseq.count.BRCA <- getTCGA(disease="BRCA", data.type="RNASeq2", type="count",clinical = TRUE)
query_BRCA <- GDCquery(project = "TCGA-BRCA",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query_BRCA)
clinical.BCRtab.all_BRCA <- GDCprepare(query_BRCA)
BRCA.subtype <- TCGAquery_subtype(tumor = "brca")
BRCA_Subtypes<-data.frame(Patients=brca.subtype$patient,Subtypes=brca.subtype$BRCA_Subtype_PAM50)
BRCA_Subtypes<-BRCA_Subtypes[BRCA_Subtypes$Subtypes!="NA",]
BRCA__Expression<-rnaseq.RPKM.BRCA$dat
PAM50_Genes<-read.table("PAM50_Genes.txt",header = FALSE,col.names = "Genes")
PAM50_Expression<-BRCA__Expression[PAM50_Genes$Genes,]
PAM50_Expression_unique<-PAM50_Expression[,!duplicated(gsub("-01A-.*|-01B-.*|-11A-.*|-11B-.*|-06A-.*|-06B-.*","",colnames(PAM50_Expression)))]
colnames(PAM50_Expression_unique)<-gsub("-01A-.*|-01B-.*|-11A-.*|-11B-.*|-06A-.*|-06B-.*","",colnames(PAM50_Expression_unique))
Overlap_Samples<-intersect(BRCA_Subtypes$Patients,colnames(PAM50_Expression_unique))
PAM50_Expression_clean<-PAM50_Expression_unique[,match(Overlap_Samples,colnames(PAM50_Expression_unique))]
PAM50_Subtype_clean<-BRCA_Subtypes[match(Overlap_Samples,BRCA_Subtypes$Patients),]
View(PAM50_Expression_clean)
write.table(PAM50_Expression_clean,file = "BRCA_PAM50_Expression.txt",sep = ",",col.names = TRUE,row.names = TRUE,quote = FALSE)
write.table(PAM50_Subtype_clean,file = "BRCA_Subtypes.txt",sep = ",",col.names = TRUE,row.names = TRUE,quote = FALSE)
```

**1. Load Data.**

Next, we loaded the data we retrieved and extracted the expression profiles of *ERBB2*, *ESR1*, and *PGR*. Log1p was adopted to calculate the natural logarithmic value of x+1, where x belongs to all the input array elements. The log1p() is reverse of exp(x) – 1.

```
# Load data
BRCA_PAM50_Expression<-read.table("BRCA_PAM50_Expression.txt",sep = ",",header = TRUE)
colnames(BRCA_PAM50_Expression)<-gsub("\\.","-",colnames(BRCA_PAM50_Expression))
# Make sure the samples are identical between the two data set
BRCA_Subtypes$Patients==colnames(BRCA_PAM50_Expression)
BRCA_Subtypes<-read.table("BRCA_Subtypes.txt",sep = ",",header = TRUE)
Subtype_Biomarkers<-c("ERBB2","ESR1","PGR")
Biomarkers_Expression<-BRCA_PAM50_Expression[Subtype_Biomarkers,]
Biomarkers_Expression<-as.data.frame(t(Biomarkers_Expression))
Biomarkers_Expression$Subtypes<-BRCA_Subtypes$Subtypes
Biomarkers_Expression[,1:3]<-log1p(Biomarkers_Expression[,1:3])
Biomarkers_Expression$Subtypes<-as.factor(Biomarkers_Expression$Subtypes)
```

**2. Define Model.**

The `caret`

package (short for **C**lassification **A**nd **RE**gression **T**raining) is a set of functions that attempt to streamline the process for creating predictive models. There are ~200 available models. We selected five commonly used ones.

- Linear Discriminant Analysis (lda)
- Classification And Regression Tree (cart)
- k-nearest neighbors (knn)
- Support vector machines (svm)
- Random Forest (rf)

The sample size of the breast cancer tumors is large enough to allow us to split the data into Training_Data (80%; 866) and Validation_Data (20%; 215).

```
library(caret)
# create a list of 80% of the rows in the original dataser for training
training_index <- createDataPartition(Biomarkers_Expression$Subtypes, p=0.80, list=FALSE)
# 80% for training
Training_Data<-Biomarkers_Expression[training_index,]
# 20% for validation
Validation_Data <- Biomarkers_Expression[-training_index,]
```

**3. Compile Model.**

We explored the Training_Data.

Next, 10-fold cross validation was set for error estimation.

**Note: **As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that.

```
# list types for each attribute
sapply(Training_Data, class)
# take a peek at the first 5 rows of the data
head(Training_Data)
# list the levels for the class
levels(Training_Data$Subtypes)
# summarize the class distribution
percentage <- prop.table(table(Training_Data$Subtypes)) * 100
cbind(freq=table(Training_Data$Subtypes), percentage=percentage)
# summarize attribute distributions
summary(Training_Data)
# split input and output
x <- Training_Data[,1:3]
y <- Training_Data[,4]
colors<-c("darkred","lightblue","lightgreen")
par(mfrow=c(1,3))
for(i in 1:3) {
boxplot(x[,i], main=colnames(Training_Data)[i],col = color[i])
}
# barplot for class breakdown
plot(y,col =1:5)
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse",no.legend = FALSE)
# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
```

**4. Fit Model.**

The `caret`

package has several functions that attempt to streamline the model building and evaluation process. The

function can be used to*train*

- evaluate, using resampling, the effect of model tuning parameters on performance
- choose the “optimal” model across these parameters
- estimate model performance from a training set

```
# Linear Discriminant Analysis (lda)
set.seed(600)
fit.lda <- train(Subtypes~., data=Training_Data, method="lda", metric=metric, trControl=control)
# Classification And Regression Tree (cart)
set.seed(600)
fit.cart <- train(Subtypes~., data=Training_Data, method="rpart", metric=metric, trControl=control)
# k-nearest neighbors (knn)
set.seed(600)
fit.knn <- train(Subtypes~., data=Training_Data, method="knn", metric=metric, trControl=control)
# Support vector machines (svm)
set.seed(600)
fit.svm <- train(Subtypes~., data=Training_Data, method="svmRadial", metric=metric, trControl=control)
# Random Forest (rf)
set.seed(600)
fit.rf <- train(Subtypes~., data=Training_Data, method="rf", metric=metric, trControl=control)
```

**5. Evaluate Model.**

The summary statistics show that svm outperform other methods.

```
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
dotplot(results)
# summarize Best Model
print(fit.svm)
```

**6. Make Predictions.**

We made predictions on the Validation_Data using SVM model.

**Note:** Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

```
# estimate skill of SVM on the validation data set
predictions <- predict(fit.svm, Validation_Data)
confusionMatrix(predictions, Validation_Data$Subtypes,mode = "everything")
```

**7. Save Models for the Future.**

`save.image("Machine_Learning_Project_3_-_Comparison_of_Five_Machine_Learning_Methods_in_Subtyping_Breast_Cancer_Tumors_Based_on_Gene_Expression_of_Biomarkers.RData")`

**Congrats!**

**All-in-One Codes:**

```
# Load data
BRCA_PAM50_Expression<-read.table("BRCA_PAM50_Expression.txt",sep = ",",header = TRUE)
colnames(BRCA_PAM50_Expression)<-gsub("\\.","-",colnames(BRCA_PAM50_Expression))
# Make sure the samples are identical between the two data set
BRCA_Subtypes$Patients==colnames(BRCA_PAM50_Expression)
BRCA_Subtypes<-read.table("BRCA_Subtypes.txt",sep = ",",header = TRUE)
Subtype_Biomarkers<-c("ERBB2","ESR1","PGR")
Biomarkers_Expression<-BRCA_PAM50_Expression[Subtype_Biomarkers,]
Biomarkers_Expression<-as.data.frame(t(Biomarkers_Expression))
Biomarkers_Expression$Subtypes<-BRCA_Subtypes$Subtypes
Biomarkers_Expression[,1:3]<-log1p(Biomarkers_Expression[,1:3])
Biomarkers_Expression$Subtypes<-as.factor(Biomarkers_Expression$Subtypes)
library(caret)
# create a list of 80% of the rows in the original dataser for training
training_index <- createDataPartition(Biomarkers_Expression$Subtypes, p=0.80, list=FALSE)
# 80% for training
Training_Data<-Biomarkers_Expression[training_index,]
# 20% for validation
Validation_Data <- Biomarkers_Expression[-training_index,]
# list types for each attribute
sapply(Training_Data, class)
# take a peek at the first 5 rows of the data
head(Training_Data)
# list the levels for the class
levels(Training_Data$Subtypes)
# summarize the class distribution
percentage <- prop.table(table(Training_Data$Subtypes)) * 100
cbind(freq=table(Training_Data$Subtypes), percentage=percentage)
# summarize attribute distributions
summary(Training_Data)
# split input and output
x <- Training_Data[,1:3]
y <- Training_Data[,4]
colors<-c("darkred","lightblue","lightgreen")
par(mfrow=c(1,3))
for(i in 1:3) {
boxplot(x[,i], main=colnames(Training_Data)[i],col = color[i])
}
# barplot for class breakdown
plot(y,col =1:5)
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse",no.legend = FALSE)
# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
# Linear Discriminant Analysis (lda)
set.seed(600)
fit.lda <- train(Subtypes~., data=Training_Data, method="lda", metric=metric, trControl=control)
# Classification And Regression Tree (cart)
set.seed(600)
fit.cart <- train(Subtypes~., data=Training_Data, method="rpart", metric=metric, trControl=control)
# k-nearest neighbors (knn)
set.seed(600)
fit.knn <- train(Subtypes~., data=Training_Data, method="knn", metric=metric, trControl=control)
# Support vector machines (svm)
set.seed(600)
fit.svm <- train(Subtypes~., data=Training_Data, method="svmRadial", metric=metric, trControl=control)
# Random Forest (rf)
set.seed(600)
fit.rf <- train(Subtypes~., data=Training_Data, method="rf", metric=metric, trControl=control)
# summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
# compare accuracy of models
dotplot(results)
# summarize Best Model
print(fit.svm)
# estimate skill of SVM on the validation data set
predictions <- predict(fit.svm, Validation_Data)
confusionMatrix(predictions, Validation_Data$Subtypes,mode = "everything")
save.image("Machine_Learning_Project_3_-_Comparison_of_Five_Machine_Learning_Methods_in_Subtyping_Breast_Cancer_Tumors_Based_on_Gene_Expression_of_Biomarkers.RData")
```

## Leave a Reply