Machine Learning Project in Oncology 2 – Developing A Multilayer Perceptron Neural Network for Breast Cancer Survival

According to Cancer.Net:

Breast cancer has now surpassed lung cancer as the most commonly diagnosed cancer worldwide. An estimated over 2,20,000 new cases were diagnosed in women across the world in 2020.

More women in the United States are diagnosed with breast cancer than any other type of cancer, besides skin cancer. The disease accounts for 1 in 3 of new female cancers annually.

In 2022, an estimated over 280,000 women in the United States will be diagnosed with invasive breast cancer, and ~ 50,000 women will be diagnosed with non-invasive (in situ) breast cancer. From the mid-2000s, invasive breast cancer in women has increased by approximately half a percent each year.

Primary cancer treatment for new breast cancer cases is surgery, followed by adjuvant therapies. For a patient affected by breast cancer, after tumor removal, it is necessary to decide which adjuvant therapy can prevent the tumor relapse and the formation of metastases. To this effect, a series of measurements of several parameters, such as clinical, histological, molecular factors, are collected and evaluated by experts with the help of guidelines [1].

Survival for breast cancer is generally good, particularly if you are diagnosed early. This is probably because of screening, early diagnosis and improved treatment.

The diagnosis (https://www.cancerresearchuk.org/about-cancer/breast-cancer/survival/) unraveled that most women (around 98%) will survive their cancer for 5 years or more after diagnosis at satge 1 while survival rate will drop to ~25% in stage 4. The type of cancer (ER+, PR+,HER2+, or triple-negetive) and grade of the cancer cells can also affect the survival. The need for better prognosis and prediction of breast cancer has led to substantial research in developing survival models. For instance, Van’t Veer et al [2] described a panel of 70 biomarkers for breast cancer predicting survival after 5 years from breast cancer surgery.

In this practice, we will work with the Haberman Breast Cancer Survival Dataset, describing breast cancer patient data (4 attributes) and the outcome (whether the patient survived for five years or longer). The attributes include:

  1. Age of patient at time of operation (numerical)
  2. Patient’s year of operation (year – 1900, numerical)
  3. Number of positive axillary nodes detected (numerical)
  4. Survival status (class attribute)
    1 = the patient survived 5 years or longer
    2 = the patient died within 5 year

Uing this data set, we will explore a standard classification problem using Multilayer Perceptron Neural Network.

Note: There are many other factors to consider, biomarker expression, epigenetic modifications, hormone levels, etc. Similar models can be established based on ther attributes.

I usually divide the project into 8 steps:

  • 0. Prepare Data.
  • 1. Load Data.
  • 2. Define Model.
  • 3. Compile Model.
  • 4. Fit Model.
  • 5. Evaluate Model.
  • 6. Make Predictions.
  • 7. Save Models for the Future.

Let the adventure begin.

0. Prepare Data.

This step is critical because the quality and quantity of data that you gather will directly determine how good your predictive model. The filtering, trimming, error removing and quality control are essential for follow-up meaningful analyses. Here, we directly retrieved the clean Haberman data.

1. Load Data.

We utilized pandas package (a complete tutorial can be found there), a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, to load data. the read_csv function can accept url.

After loading the data, we collect summary statistics of the data.

##### The required packages in the project #####
from pandas import read_csv
from matplotlib import pyplot
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from numpy import mean
from numpy import std
import warnings
warnings.filterwarnings("ignore")
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv'
# load the dataset
columns = ['age', 'year', 'nodes', 'class']
# load the csv file as a data frame
dataframe = read_csv(url, header=None, names=columns)
print(dataframe.shape)
# summarize the class distribution
target = dataframe['class'].values
counter = Counter(target)
for k,v in counter.items():
	per = v / len(target) * 100
	print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))
# show summary statistics
print(dataframe.describe())
# plot histograms
dataframe.hist()
pyplot.show()

2. Define Model.

We define Multilayer Perceptron (MLP) model using keras. You may ask: why MLP? Why not Convolutional Neural Networks (CNNs), or Recurrent Neural Networks (RNNs)?

Multilayer Perceptron

MLPs are suitable for classification prediction problems where inputs are assigned a class or label. They are also suitable for regression prediction problems where a real-valued quantity is predicted given a set of inputs. Data is often provided in a tabular format, such as you would see in a CSV file or a spreadsheet.

Convolutional Neural Networks

CNNs were designed to map image data to an output variable. They have proven so effective that they are the go-to method for any type of prediction problem involving image data as an input.

Recurrent Neural Networks

RNNs were designed to work with sequence prediction problems. Sequence prediction problems come in many forms and are best described by the types of inputs and outputs supported. The models were traditionally difficult to train. The Long Short-Term Memory (LSTM) network is perhaps the most successful RNN because it overcomes the problems of training a recurrent network and in turn has been used on a wide range of applications.

As for the survival data, we cannot know what model architecture of learning hyperparameters would be best, so we must experiment and discover what works well.

Before we evaluate models, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model. We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

We here use one hidden layer with 50 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “he_normal” weight initialization. The output of the model is a sigmoid activation for binary classification.

# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y, random_state=3)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(1, activation='sigmoid'))

3. Compile Model.

We specify the training configuration (optimizer, loss, metrics).

Optimizers are algorithms or methods used to minimize an error function(loss function)or to maximize the efficiency of production. We adopted “adam” (Adaptive Moment Estimation) as an optimzer, which is one of the most popular and famous gradient descent optimization algorithms. It is a method that computes adaptive learning rates for each parameter.

It stores both the decaying average of the past gradients, similar to momentum and also the decaying average of the past squared gradients, similar to RMS-Prop and Adadelta. Thus, it combines the advantages of both the methods. It is 1) Easy to implement, 2) Computationally efficient, 3) Little memory requirements.

Machines learn by means of a loss function. It’s a method of evaluating how well specific algorithm models the given data. We used Cross Entropy Loss/Negative Log Likelihood, the most common setting for classification problems. Cross-entropy loss increases as the predicted probability diverges from the actual label.

Note: There’s no one-size-fits-all loss function to algorithms in machine learning.

A metric is a function that is used to judge the performance of your model. Metric functions are similar to loss functions, except that the results from evaluating a metric are not used when training the model.

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')

4. Fit Model.

We then fit the model for 1000 training epochs (chosen arbitrarily) with a batch size of 16.

Training occurs over epochs and each epoch is split into batches.
Epoch: One pass through all of the rows in the training dataset.
Batch: One or more samples considered by the model within an epoch before weights are updated.
These configurations can be chosen experimentally by trial and error. We want to train the model enough so that it learns a good mapping of rows of input data to the output classification. The model will always have some error, but the amount of error will level out after some point for a given model configuration. This is called model convergence.

history = model.fit(X_train, y_train, epochs=1000, batch_size=16, verbose=0, validation_data=(X_test,y_test))

5. Evaluate Model.

We will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.
Note: Finding accuracy is not enough. There are various ways to evaluate a machine learning model’s performance. You can find a good summary blog here.

# predict test set
yhat = model.predict_classes(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

6. Make Predictions.

Prediction is the final step and our expected outcome of the model generation. Keras provides a method, predict to get the prediction of the trained model.

The input data should be independent from the training data set. As for cancer survival data, we can downlaod from various resources, such as TCGA and Target.

yhat = model.predict_classes(X_test)

7. Save Models for the Future.

We finally save the model for sharing or future prediction using dill.

dill: a utility for serialization of python objects. dill extends python’s pickle module for serializing and de-serializing python objects to the majority of the built-in python types. Serialization is the process of converting an object to a byte stream, and the inverse of which is converting a byte stream back to a python object hierarchy. It can be used to store python objects to a file, but the primary usage is to send python objects across the network as a byte stream. 

As mentioned above, we can also use the pickle module can be utilized in Python to serialize and de-serialize any structure of an object. However, it can also be implemented to save a variable to a file in Python simply.

import dill
dill.dump_session('./MLP_BRCA_Haberman_Survival.pkl')
#to restore session:
#dill.load_session('./MLP_BRCA_Haberman_Survival.pkl')

8. Run the codes.

Let’s test whether the codes works.

Note: Please use tensorflow<=2.5.

python Machine_Learning_Project_2_Develop_A_Multilayer_Perceptron_Model_for_Predicting_Breast_Cancer_Survival.py

Congrats!

All Codes in One:

# load the haberman dataset and summarize the shape
from pandas import read_csv
from matplotlib import pyplot
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from numpy import mean
from numpy import std
import dill
import warnings
warnings.filterwarnings("ignore")

# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/haberman.csv'
# load the dataset

columns = ['age', 'year', 'nodes', 'class']
# load the csv file as a data frame
dataframe = read_csv(url, header=None, names=columns)
print(dataframe.shape)
# summarize the class distribution
target = dataframe['class'].values
counter = Counter(target)
for k,v in counter.items():
	per = v / len(target) * 100
	print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

# show summary statistics
print(dataframe.describe())
# plot histograms
dataframe.hist()
pyplot.show()
# Multilayer Perceptron (MLP) model

# ensure all data are floating point values
X, y = dataframe.values[:, :-1], dataframe.values[:, -1]

X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# prepare cross validation
kfold = StratifiedKFold(10, random_state=1,shuffle=True)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(100, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
        model.add(Dense(10, activation='relu'))
	model.add(Dense(1, activation='sigmoid'))
	# compile the model
	model.compile(optimizer='adam', loss='binary_crossentropy')
	# fit the model
	model.fit(X_train, y_train, epochs=1000, batch_size=16, verbose=0)
	# predict test set
	yhat = model.predict_classes(X_test)
	# evaluate predictions
	score = accuracy_score(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
# Save the working directory
#dill.dump_session('./MLP_BRCA_Haberman_Survival.pkl')
#to restore session:
#dill.load_session('./MLP_BRCA_Haberman_Survival.pkl')

References

  1. Marco Pellegrini. Accurate prediction of breast cancer survival through coherent voting networks with gene expression profiling. Sci Rep 2021 11(1):14645.
  2. Van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002 415, 530.

Leave a Reply

%d bloggers like this: