Machine Learning Project in Oncology 4 – Building Machine Learning Models in Subtyping Breast Cancer Tumors Based on Gene Expression of Biomarkers (Python version)

In Project 3, we built machine learning models in subtyping breast cancer tumors based on gene expression of biomarkers in R.

Machine Learning Project 3

In this project, we will carry out the project in python, which is divided to 8 steps:

0. Prepare Data.

We use the following datasets:

  1. BRCA_PAM50_Expression.txt : The gene expression profiles of PAM50 across 1081 TCGA BRCA samples.
  2. BRCA_Subtypes.txt : The subtype annotations of BRCA samples.

1. Load Data.

The data was loaded using pandas package.

The data matrix looks lik this:

The boxplot of the gene expression shows that these genes have different ranges.

Note: The data matrix will be normalized sklearn.preprocessing.

The plot of gene expression of 5 biomarkers show that these genes contain different inforamtion.

The percentages of the subtypes in TCGA BRCA agree with those of the total population.

#importing the libraries
import matplotlib.pyplot as plt
import pandas as pd
import time
from sklearn.preprocessing import LabelEncoder
#from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
t0 = time.time()

# load the dataset

# a model to map rows of input variables (X) to an output variable (y),
# which we often summarize as y = f(X).

# Input Variables (X):
# Gene expression profiles, with rows as cancer samples and columns as genes

X = pd.read_csv('BRCA_PAM50_Expression.txt',delimiter=',')
X=X.transpose()
X.head(5)
color = {"boxes": "DarkGreen","whiskers": "DarkOrange","medians": "DarkBlue","caps": "Gray"}
X.plot.box(rot=90,fontsize='medium',color=color, sym="r+")
X.iloc[:,0:5].plot(subplots=True, figsize=(8, 8),rot=45,xlabel="Samples",ylabel="RPKM Values")
X.iloc[0:5,0:5].plot.barh()
# Scatterplot of two biomarker genes
plt.scatter(X.ERBB2,X.ESR1)
print("Breast Cancer data set dimensions : {}".format(X.shape))

# Output Variables (y):
# Class variable (0 or 1)
annot = pd.read_csv('BRCA_Subtypes.txt',delimiter=',')
y=annot.Subtypes
y.head(5)
type_sum=pd.DataFrame(y.value_counts())
type_sum.plot.pie(subplots=True, figsize=(15, 15))

#Encoding categorical data values
labelencoder_Y = LabelEncoder()
labelencoder_Y.fit(y.unique())
list(labelencoder_Y.classes_)
labelencoder_Y.transform(y.unique())
list(labelencoder_Y.inverse_transform([0,1,2,3,4]))
# {'Basal':0, 'Her2':1, 'LumA':2, 'LumB':3, 'Normal':4}
y = labelencoder_Y.fit_transform(y)

2. Define Model.

There are ~100 supervised learning models in the sklearn package. I selected 7 popular classification methods.

Here is a paper for extended reading.

Scikit-learn cheat sheet: methods for classification & regression

3. Compile Model.

The data was splited into the Training set (80%; 864) and Test set (20%; 217). The data was scaled for we can see these attributes show different ranges.

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4. Fit Model.

Model fitting is a measure of how well a machine learning model generalizes to similar data to that on which it was trained. A model that is well-fitted produces more accurate outcomes. A model that is overfitted matches the data too closely. A model that is underfitted doesn’t match closely enough.

We selected 7 machine learning algorithms and each has a basic set of parameters that can be changed to improve its accuracy. During the fitting process, you run an algorithm on data for which you know the target variable, known as “labeled” data, and produce a machine learning model.

Note: These models usually have different hyper-parameters.


#Using Logistic Regression Algorithm to the Training Set
from sklearn.linear_model import LogisticRegressio
classifier_lr = LogisticRegression(random_state = 42)
classifier_lr.fit(X_train, Y_train)

#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier_knn.fit(X_train, Y_train)

#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier_lSVM = SVC(kernel = 'linear', random_state = 42)
classifier_lSVM.fit(X_train, Y_train)

#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier_rbf = SVC(kernel = 'rbf', random_state = 42)
classifier_rbf.fit(X_train, Y_train)

#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB()
classifier_nb.fit(X_train, Y_train)

#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier_dt.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier_rf.fit(X_train, Y_train)

5. Evaluate Model and 6. Make Predictions.

As we can see the data is an imbalanced data. Accuracy metric is not appropriate. We adopted F1 score for model evaluation, with “micro” parameter, to calculate metrics globally by counting the total true positives, false negatives and false positives.

Actually, different “labels” have variable F1-scores. For example, F1-score for Basal subtype in random forest model is 0.91.

Y_pred_lr = classifier_lr.predict(X_test)
Y_pred_knn = classifier_knn.predict(X_test)
Y_pred_lSVM = classifier_lSVM.predict(X_test)
Y_pred_rbf = classifier_rbf.predict(X_test)
Y_pred_nb = classifier_nb.predict(X_test)
Y_pred_dt = classifier_dt.predict(X_test)
Y_pred_rf = classifier_rf.predict(X_test)


F1_lr = f1_score(Y_test,Y_pred_lr,average='micro')
F1_knn = f1_score(Y_test,Y_pred_knn,average='micro')
F1_lSVM = f1_score(Y_test,Y_pred_lSVM,average='micro')
F1_rbf = f1_score(Y_test,Y_pred_rbf,average='micro')
F1_nb = f1_score(Y_test,Y_pred_nb,average='micro')
F1_dt = f1_score(Y_test,Y_pred_dt,average='micro')
F1_rf = f1_score(Y_test,Y_pred_rf,average='micro')
methods=list(['lr','kNN','lSVM','rbf','nb','dt','rf'])
F1_Scores=list([F1_lr,F1_knn,F1_lSVM,F1_rbf,F1_nb,F1_dt,F1_rf])

F1_lr = f1_score(Y_test,Y_pred_lr,average=None)
F1_knn = f1_score(Y_test,Y_pred_knn,average=None)
F1_lSVM = f1_score(Y_test,Y_pred_lSVM,average=None)
F1_rbf = f1_score(Y_test,Y_pred_rbf,average=None)
F1_nb = f1_score(Y_test,Y_pred_nb,average=None)
F1_dt = f1_score(Y_test,Y_pred_dt,average=None)
F1_rf = f1_score(Y_test,Y_pred_rf,average=None)
F1_Scores=list([F1_lr,F1_knn,F1_lSVM,F1_rbf,F1_nb,F1_dt,F1_rf])
F1_Scores_df=pd.DataFrame(F1_Scores)
F1_Scores_df.columns=['Basal','Her2','LumA','LumB','Normal']
F1_Scores_df.index=['lr','kNN','lSVM','rbf','nb','dt','rf']
F1_Scores_df.plot.bar(rot=90,fontsize='medium',xlabel="Algorithms",ylabel="F1-Scores")

t1 = time.time()
Time_spent=t1-t0
print("Time spent in Project:",Time_spent," Seconds")


7. Save Models for the Future.

I recommend spyder as python IDE (integrated development environment). Of course, the “best” IDE mostly depends upon your preferences. Everybody has their own preferences, and will favor one IDE over another based upon how well the environment adheres to the mindset of the developer.

Congrats!

All-in-One Codes:

#importing the libraries
import matplotlib.pyplot as plt
import pandas as pd
import time
from sklearn.preprocessing import LabelEncoder
#from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
t0 = time.time()

# load the dataset

# a model to map rows of input variables (X) to an output variable (y),
# which we often summarize as y = f(X).

# Input Variables (X):
# Gene expression profiles, with rows as cancer samples and columns as genes

X = pd.read_csv('BRCA_PAM50_Expression.txt',delimiter=',')
X=X.transpose()
X.head(5)
color = {"boxes": "DarkGreen","whiskers": "DarkOrange","medians": "DarkBlue","caps": "Gray"}
X.plot.box(rot=90,fontsize='medium',color=color, sym="r+")
X.iloc[:,0:5].plot(subplots=True, figsize=(8, 8),rot=45,xlabel="Samples",ylabel="RPKM Values")
X.iloc[0:5,0:5].plot.barh()
# Scatterplot of two biomarker genes
plt.scatter(X.ERBB2,X.ESR1)
print("Breast Cancer data set dimensions : {}".format(X.shape))

# Output Variables (y):
# Class variable (0 or 1)
annot = pd.read_csv('BRCA_Subtypes.txt',delimiter=',')
y=annot.Subtypes
y.head(5)
type_sum=pd.DataFrame(y.value_counts())
type_sum.plot.pie(subplots=True, figsize=(15, 15))

#Encoding categorical data values
labelencoder_Y = LabelEncoder()
labelencoder_Y.fit(y.unique())
list(labelencoder_Y.classes_)
labelencoder_Y.transform(y.unique())
list(labelencoder_Y.inverse_transform([0,1,2,3,4]))
# {'Basal':0, 'Her2':1, 'LumA':2, 'LumB':3, 'Normal':4}
y = labelencoder_Y.fit_transform(y)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Using Logistic Regression Algorithm to the Training Set
from sklearn.linear_model import LogisticRegression
classifier_lr = LogisticRegression(random_state = 42)
classifier_lr.fit(X_train, Y_train)

#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier_knn.fit(X_train, Y_train)

#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
classifier_lSVM = SVC(kernel = 'linear', random_state = 42)
classifier_lSVM.fit(X_train, Y_train)

#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
classifier_rbf = SVC(kernel = 'rbf', random_state = 42)
classifier_rbf.fit(X_train, Y_train)

#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB()
classifier_nb.fit(X_train, Y_train)

#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
classifier_dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier_dt.fit(X_train, Y_train)

#Using RandomForestClassifier method of ensemble class to use Random Forest Classification algorithm
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier_rf.fit(X_train, Y_train)

Y_pred_lr = classifier_lr.predict(X_test)
Y_pred_knn = classifier_knn.predict(X_test)
Y_pred_lSVM = classifier_lSVM.predict(X_test)
Y_pred_rbf = classifier_rbf.predict(X_test)
Y_pred_nb = classifier_nb.predict(X_test)
Y_pred_dt = classifier_dt.predict(X_test)
Y_pred_rf = classifier_rf.predict(X_test)


F1_lr = f1_score(Y_test,Y_pred_lr,average='micro')
F1_knn = f1_score(Y_test,Y_pred_knn,average='micro')
F1_lSVM = f1_score(Y_test,Y_pred_lSVM,average='micro')
F1_rbf = f1_score(Y_test,Y_pred_rbf,average='micro')
F1_nb = f1_score(Y_test,Y_pred_nb,average='micro')
F1_dt = f1_score(Y_test,Y_pred_dt,average='micro')
F1_rf = f1_score(Y_test,Y_pred_rf,average='micro')
methods=list(['lr','kNN','lSVM','rbf','nb','dt','rf'])
F1_Scores=list([F1_lr,F1_knn,F1_lSVM,F1_rbf,F1_nb,F1_dt,F1_rf])

F1_lr = f1_score(Y_test,Y_pred_lr,average=None)
F1_knn = f1_score(Y_test,Y_pred_knn,average=None)
F1_lSVM = f1_score(Y_test,Y_pred_lSVM,average=None)
F1_rbf = f1_score(Y_test,Y_pred_rbf,average=None)
F1_nb = f1_score(Y_test,Y_pred_nb,average=None)
F1_dt = f1_score(Y_test,Y_pred_dt,average=None)
F1_rf = f1_score(Y_test,Y_pred_rf,average=None)
F1_Scores=list([F1_lr,F1_knn,F1_lSVM,F1_rbf,F1_nb,F1_dt,F1_rf])
F1_Scores_df=pd.DataFrame(F1_Scores)
F1_Scores_df.columns=['Basal','Her2','LumA','LumB','Normal']
F1_Scores_df.index=['lr','kNN','lSVM','rbf','nb','dt','rf']
F1_Scores_df.plot.bar(rot=90,fontsize='medium',xlabel="Algorithms",ylabel="F1-Scores")

t1 = time.time()
Time_spent=t1-t0
print("Time spent in Project:",Time_spent," Seconds")

Leave a Reply

%d bloggers like this: