Machine Learning Project in Oncology 1 – Deep Learning-based Identification of Prostate Cancer using TCGA RNA-seq

Prostate cancer (PRAD) is the most common non-skin cancer in America. In the United States, 1 in 8 men will be diagnosed with prostate cancer in his lifetime.

The challenge of classifying PRAD and normal tissues based on gene expression data has been tackled through the development of diverse machine learning methods, such as self-organizing map and deep learning [1].

In this project, I applied deep learning as an integral component of efficiently finding specific patterns within massive datasets from TCGA RNA-seq.

Let’s go through the project step-by-step (Of course, we also tie the codes all together in the end).

Step 0. Data preparation

Before we rush into building models, we need to prepare clean data. That is, the quality control (QC) process is required to provide routine and consistent checks of the design and plan details.

As for RNA-seq data in TCGA, an essential step is normalization, in which raw data are adjusted to account for factors that prevent direct comparison of expression measures. Here, I retrieved The reads per kilobase per million mapped reads (RPKM) values [2] of PRAD from TCGA.

The RPKM expression profile (PRAD_TCGA_RNA_seq.txt) has genes as columns and samples as rows.

The sample annotation file (PRAD_TCGA_Types.txt) contains type informaiton, including cancer (1) and normal (0).

RPKM matrix for PRAD

Step 1. Load Data.

There are many python packages for importing the data, such as numpy and pandas. Here I use read_csv from pandas. X is expression profile while y is cancer type(1/0).

import pandas as pd
X = pd.read_csv('PRAD_TCGA_RNA_seq.txt',delimiter=',')
y = pd.read_csv('PRAD_TCGA_Types.txt',delimiter=',')

Step 2. Define Keras Model.

Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models.

It wraps the efficient numerical computation libraries Theano and TensorFlow and allows you to define and train neural network models in just a few lines of code.

Models in Keras are defined as a sequence of layers. This is a a Sequential model and add layers one at a time until we are happy with our network architecture.

How do we know the number of layers and their types?
There are heuristics that we can use and often the best network structure is found through a process of trial and error experimentation.

This practice is a fully-connected network structure with three layers.

  • The model expects rows of data with 577 variables (the number of genes; the input_dim=577 argument).
  • The first hidden layer has 1000 nodes and uses the relu activation function.
  • The second hidden layer has 500 nodes and uses the relu activation function.
  • The third hidden layer has 100 nodes and uses the relu activation function.
  • The output layer has one node and uses the sigmoid activation function.
model = Sequential()
model.add(Dense(1000, input_dim=577, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Step 3. Compile Keras Model.

Compiling the model uses the efficient numerical libraries under the covers (the so-called backend) such as Theano or TensorFlow. The backend automatically chooses the best way to represent the network for training and making predictions to run on your hardware, such as CPU or GPU or even distributed.

We must specify
1) the loss function to use to evaluate a set of weights,
2) the optimizer is used to search through different weights for the network
3) and any optional metrics we would like to collect and report during training.

In this case, I will use cross entropy as the loss argument. This loss is for a binary classification problems and is defined in Keras as “binary_crossentropy“. You can also find more about how to choose loss functions based on your problem in the Machine Learning Mastery Blog.

We will define the optimizer as the efficient stochastic gradient descent algorithm “adam“. This is a popular version of gradient descent because it automatically tunes itself and gives good results in a wide range of problems. Here is a post in Machine Learning Mastery Blog about the Adam version of stochastic gradient descent see the post.

Here, we collect and report the classification accuracy, defined via the metrics argument.

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Step 4. Fit Keras Model.

We have defined our model and compiled it ready for efficient computation.

Training occurs over epochs and each epoch is split into batches.
Epoch: One pass through all of the rows in the training dataset.
Batch: One or more samples considered by the model within an epoch before weights are updated.
These configurations can be chosen experimentally by trial and error. We want to train the model enough so that it learns a good mapping of rows of input data to the output classification. The model will always have some error, but the amount of error will level out after some point for a given model configuration. This is called model convergence.

model.fit(X, y, epochs=1000, batch_size=200, verbose=0)

Step 5. Evaluate Keras Model

Evaluation is a process during development of the model to check whether the model is best fit for the given problem and corresponding data. Keras model provides a function, evaluate which does the evaluation of the model. It has three main arguments,

  • Test data
  • Test data label
  • verbose – true or false

Note:Evaluation Metrics for Machine Learning is a core part of building an effective machine learning model. The evaluation metrics will be adopted based on your models.

_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))

Step 6. Make Predictions

Prediction is the final step and our expected outcome of the model generation. Keras provides a method, predict to get the prediction of the trained model.

The input data should be independent from the training data set.

predictions = model.predict_classes(X)
for i in range(5):
	print('%s => %d (expected %d)' % (X.index[i], predictions[i], y.Types[i]))

Step 7. Save the model

The models we have created can be save for future application.

t1 = time.time()
total = t1-t0
print('The practice takes %s seconds' % total)
dill.dump_session('./PRAD_Classifier_from_TCGA_RNA_seq.pkl')
#to restore session:
#dill.load_session('./PRAD_Classifier_from_TCGA_06212021.pkl')

8. Run the code:

python Machine_Learning_Project_1_Deep_Learning-based_Identification_of_Prostate_Cancer_using_TCGA_RNA_seq.py

It took ~3 minutes on my Mac. You can find the final accuracy evaluation and time on your screen.

Note again: To simplify the project, I did not split the data into “Training” and “Test” or inlcude an independent dataset. Please use test dataset rather than the training dataset in the evaluation.

Project Summary

  • Retrieve Data — Clean the data
  • Load Data — Select a proper method to load your data
  • Define Keras Model — Defile a model based on the problem
  • Compile Keras Model — Choose the proper loss function and evaluation metrics
  • Fit Keras Model — Make sure we train the model enough
  • Evaluate Keras Model — Check whehter the model is best fit for the problem
  • Make Predictions — Always be cautious of the predictions
  • Save models for the future — Save the models.

All-in-One code:

# Install the required packages
# conda install -c anaconda keras
# conda install -c anaconda tensorflow
# conda install -c anaconda scipy
# pip install -U numpy
#The steps in this tutorial are as follows:
# 1. Load Data.
# 2. Define Keras Model.
# 3. Compile Keras Model.
# 4. Fit Keras Model.
# 5. Evaluate Keras Model.
# 6. Make Predictions.
# 7. Save models for the future
###########################################################
# 0. Quality Control of the data
###########################################################
# Before loading the data, a critical step should be processed: Quality Control.
# The QC process is required to provide routine and consistent checks of the design and plan details.
# The details can be extended in another note or series.
###########################################################
# 1. Load Data.
###########################################################
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
import time
import dill
t0 = time.time()
# load the dataset
# a model to map rows of input variables (X) to an output variable (y),
# which we often summarize as y = f(X).
# Input Variables (X):
# Gene expression profiles, with rows as cancer samples and columns as genes
X = pd.read_csv('PRAD_TCGA_RNA_seq.txt',delimiter=',')
# Output Variables (y):
# Class variable (0 or 1)
y = pd.read_csv('PRAD_TCGA_Types.txt',delimiter=',')
###########################################################
# 2. Define Keras Model
###########################################################
model = Sequential()
model.add(Dense(1000, input_dim=577, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
###########################################################
# 3. Compile Keras Model
###########################################################
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
###########################################################
# 4. Fit Keras Model
###########################################################
model.fit(X, y, epochs=1000, batch_size=200, verbose=0)
###########################################################
#5. Evaluate Keras Model
###########################################################
...
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
###########################################################
# 6. Make Predictions
###########################################################
# make class predictions with the model
predictions = model.predict_classes(X)
# summarize the first 5 cases
for i in range(5):
	print('%s => %d (expected %d)' % (X.index[i], predictions[i], y.Types[i]))
###########################################################
# 7. Save models
###########################################################
t1 = time.time()
total = t1-t0
print('The practice takes %s seconds' % total)
dill.dump_session('./PRAD_Classifier_from_TCGA_RNA_seq.pkl')
#to restore session:
#dill.load_session('./PRAD_Classifier_from_TCGA_06212021.pkl')

References

  1. Artificial Intelligence and Machine Learning in Prostate Cancer Patient Management—Current Trends and Future Perspectives. Octavian Sabin Tătaru, et al. Diagnostics (Basel). 2021 Feb; 11(2): 354.
  2. Mortazavi A, Williams BA, McCue K, et al.Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008;5(7):621–8.

2 responses to “Machine Learning Project in Oncology 1 – Deep Learning-based Identification of Prostate Cancer using TCGA RNA-seq”

  1. Hi,

    I am trying to download the dataset but the link doesn’t work. Could you point out where to get those? Thanks

    1. Thank you very much for the feedback. I put the datasets here https://www.kaggle.com/datasets/zhaoyuqi616/machine-learning-project-1 and re-edited the blog. Please check whether it works for you now.

Leave a Reply

%d bloggers like this: