Machine Learning Project in Oncology 5 – Developing Convolutional Neural Networks for Classifying Breast Cancer Based on Histopathological Images

There are already some excellent blogs and papers on the topic (See [1]-[4] in References).

The project was divided into 6 steps:

0. Raise A question and Prepare Data.

Breast cancer is a malignant tumor that grows in or around the breast tissue (mainly in the milk ducts and glands). A tumor usually starts as a lump or calcium deposit that develops as a result of abnormal cell growth. Most breast lumps are benign but some can be premalignant (may become cancer) or malignant. Proper treatment and detection at the early stages can significantly reduce cancer-related deaths. Various imaging techniques are used for assisting in diagnosis, including mammography, magnetic resonance imaging (MRI), ultrasonography, and thermography [5]. Histopathological analyses of stained tissue sections by an experienced pathologist remain the gold standard for most reliable diagnoses. However, the accuracy of a manual analysis of histopathological slides varies from 65% to 98% on the basis of the experience and knowledge of the pathologist.

  • Human error may lead to inappropriate diagnoses, with literature reporting 25%–26% discordance between pathologists in differentiating between malignant and benign neoplasms.
  • Manual analysis is also laborious and time consuming, and often, the time involved in manual diagnosis may delay the diagnosis, eventually resulting in fatalities.

Machine learning has emerged as one of the most powerful tool in the healthcare industry for image analysis, the early detection and classification of breast cancer.

In this project, I adopted The Breast Cancer Histopathological Image Classification (BreakHis), which is  composed of 9,109 microscopic images of breast tumor tissue collected from 82 patients using different magnifying factors (40X, 100X, 200X, and 400X).

The dataset can be downloaded from here. The downloaded folder was organized as follows:

Breast_Cancer
├── test
│   ├── benign
│   └── malignant
└── train
│   ├── benign
│   └── malignant

Note: The folder can be split into the same structure following the blog’s suggestion (How to split folders with files (e.g. images) into training, validation and test (dataset) folders).

1. Load the Data.

The BreakHis data was split into train (80%; 6327 images) and test (20%; 1582 images). I define a function implementing np.random to randomly visualize the images.

The benign images:

Benign images from BreakHis.

The malignant images:

Malignant images from BreakHis.
# Change work directory
import os
os.chdir("./Project5/Breast_Cancer/")
# The folder should be structured as
#Breast_Cancer
#├── test
#│   ├── benign
#│   └── malignant
#└── train
#│   ├── benign
#│   └── malignant
# Calculate the time
import time
t0 = time.time()
###############################################################################
def plot_image(folder):
	from matplotlib import pyplot
	from matplotlib.image import imread
	import numpy as np
	# define location of dataset
	images=os.listdir(path=folder)
    # Return random integers from low (inclusive) to high (exclusive).
	np.random.seed(123)
	image_idx=np.random.randint(len(images),size=9)
	# plot first few images
	for i in range(9):
		# define subplot
		pyplot.subplot(330 + 1 + i)
		# load image pixels
		filename=folder+images[image_idx[i]]
		image = imread(filename)
		# plot raw pixel data
		pyplot.imshow(image)
	# show the figure
	pyplot.show()
###############################################################################

# plot images from benign breast cancer tumors
plot_image('train/benign/')
# plot images from malignant breast cancer tumors
plot_image('train/malignant/')

2. Define, Compile, and Fit Model.

Lots of python packages have been developed for processing images, such as cv2 and Scikit-Image.

I use Keras ImageDataGenerator class, which provides a quick and easy way to augment your images. It provides a host of different augmentation techniques like standardization, rotation, shifts, flips, brightness change, and many more.

The main benefit of using the Keras ImageDataGenerator class is that it is designed to provide real-time data augmentation, that is, generating augmented images on the fly while your model is still in the training stage.

Note: No tutorial is better than the official ImageDataGenerator API docs.

Next, I defined VGG models. What are the VGG Models?

  • VGG models are a type of CNN Architecture proposed by Karen Simonyan & Andrew Zisserman of Visual Geometry Group (VGG), Oxford University, which brought remarkable results for the ImageNet Challenge.
  • They experiment with 6 models, with different numbers of trainable layers. Based on the number of models the two most popular models are VGG16 and VGG19.
  • The architecture involves stacking convolutional layers with small 3×3 filters followed by a max pooling layer. Together, these layers form a block, and these blocks can be repeated where the number of filters in each block is increased with the depth of the network such as 32, 64, 128, 256 for the first four blocks of the model.
# tensorflow.keras.preprocessing.image.ImageDataGenerator
# generate batches of tensor image data with real-time data augmentation.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# create data generators
# photos in the training dataset will be augmented with small (10%)
# random horizontal and vertical shifts and random horizontal flips that create a mirror image of a photo.
datagen = ImageDataGenerator(rescale=1./255,  # scale the pixel values to the range of 0-1.
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=2,  # set range for random zoom
rotation_range = 90,
horizontal_flip=True,  # randomly flip images
vertical_flip=True,  # randomly flip images
)

###############################################################################
# define three Block VGG Model
def define_model():
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D
    from tensorflow.keras.layers import MaxPooling2D
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.layers import Flatten
    from tensorflow.keras.layers import Dropout
    from tensorflow.keras.optimizers import SGD
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    # compile model
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    return model
###############################################################################

# define model
model = define_model()

train_it = datagen.flow_from_directory('./train/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
test_it = datagen.flow_from_directory('./test/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
history = model.fit(train_it, steps_per_epoch=len(train_it),
  	validation_data=test_it, validation_steps=len(test_it), epochs=20, verbose=0)

As mentioned above, VGG16 is a convolution neural net (CNN ) architecture which was used to win ILSVR(Imagenet) competition in 2014. It is considered to be one of the excellent vision model architecture till date. The 16 in VGG16 refers to it has 16 layers that have weights.

A basic architecture of VGG16:

Architecture of VGG16 (from neurohive.io)

→ 2 x convolution layer of 64 channel of 3×3 kernal and same padding

→ 1 x maxpool layer of 2×2 pool size and stride 2×2

→ 2 x convolution layer of 128 channel of 3×3 kernal and same padding

→ 1 x maxpool layer of 2×2 pool size and stride 2×2

→ 3 x convolution layer of 256 channel of 3×3 kernal and same padding

→ 1 x maxpool layer of 2×2 pool size and stride 2×2

→ 3 x convolution layer of 512 channel of 3×3 kernal and same padding

→ 1 x maxpool layer of 2×2 pool size and stride 2×2

→ 3 x convolution layer of 512 channel of 3×3 kernal and same padding

→ 1 x maxpool layer of 2×2 pool size and stride 2×2

→ 1 x Dense layer of 4096 units

→ 1 x Dense layer of 4096 units

→ 1 x Dense Softmax layer of 2 units

The keras has a convenient function VGG16 for the model.

# define VGG model for transfer learning
def define_VGG16_model():
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D
    from tensorflow.keras.layers import MaxPooling2D
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.layers import Flatten
    from tensorflow.keras.layers import Dropout
    from tensorflow.keras.optimizers import SGD
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    # compile model
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    return model

3. Evaluate Model.

The model was evaluated using accuracy metrics. The diagnostic learning curves were plotted using a defined function.

Plot diagnostic learning curves
# evaluate model
_, acc = model.evaluate(test_it, steps=len(test_it), verbose=0)
print('The accuracy of VGG3 model is: %.3f' % (acc * 100.0))
###############################################################################
# plot diagnostic learning curves
def summarize_diagnostics(history,outplot_file):
    	from matplotlib import pyplot
    	# plot loss
    	pyplot.subplot(211)
    	pyplot.title('Cross Entropy Loss')
    	pyplot.plot(history.history['loss'], color='blue', label='train')
    	pyplot.plot(history.history['val_loss'], color='orange', label='test')
    	# plot accuracy
    	pyplot.subplot(212)
    	pyplot.title('Classification Accuracy')
    	pyplot.plot(history.history['accuracy'], color='blue', label='train')
    	pyplot.plot(history.history['val_accuracy'], color='orange', label='test')
    	# save plot to file
    	pyplot.savefig(outplot_file + '_plot.png')
    	pyplot.close()
###############################################################################

summarize_diagnostics(history,'VGG3')

4. Make Predictions.

The BreakHis data was adopted for the CNN model.

  1. The data can be split into train, test, and validation.
  2. There are tens of valuable images [4]. The model can be used for more predictions.

5. Save Models.

The model was saved as h5 (Hierarchical Data Format) file. The h5py package is a Pythonic interface to the HDF5 binary data format.

model_VGG16.save('VGG16_model.h5')

The running time of the project is 3~4 hours. The accuracy of the VGG16 model is 0.84.

Congrats!

All-in-One Codes:

# Change work directory
import os
os.chdir("./Project5/Breast_Cancer/")
# The folder should be structured as
#Breast_Cancer
#├── test
#│   ├── benign
#│   └── malignant
#└── train
#│   ├── benign
#│   └── malignant
# Calculate the time
import time
t0 = time.time()
###############################################################################
def plot_image(folder):
	from matplotlib import pyplot
	from matplotlib.image import imread
	import numpy as np
	# define location of dataset
	images=os.listdir(path=folder)
    # Return random integers from low (inclusive) to high (exclusive).
	np.random.seed(123)
	image_idx=np.random.randint(len(images),size=9)
	# plot first few images
	for i in range(9):
		# define subplot
		pyplot.subplot(330 + 1 + i)
		# load image pixels
		filename=folder+images[image_idx[i]]
		image = imread(filename)
		# plot raw pixel data
		pyplot.imshow(image)
	# show the figure
	pyplot.show()
###############################################################################

# plot images from benign breast cancer tumors
plot_image('train/benign/')
# plot images from malignant breast cancer tumors
plot_image('train/malignant/')


# tensorflow.keras.preprocessing.image.ImageDataGenerator
# generate batches of tensor image data with real-time data augmentation.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# create data generators
# photos in the training dataset will be augmented with small (10%)
# random horizontal and vertical shifts and random horizontal flips that create a mirror image of a photo.
datagen = ImageDataGenerator(rescale=1./255,  # scale the pixel values to the range of 0-1.
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=2,  # set range for random zoom
rotation_range = 90,
horizontal_flip=True,  # randomly flip images
vertical_flip=True,  # randomly flip images
)

###############################################################################
# define three Block VGG Model
def define_model():
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D
    from tensorflow.keras.layers import MaxPooling2D
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.layers import Flatten
    from tensorflow.keras.layers import Dropout
    from tensorflow.keras.optimizers import SGD
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    # compile model
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    return model
###############################################################################

# define model
model = define_model()

train_it = datagen.flow_from_directory('./train/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
test_it = datagen.flow_from_directory('./test/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
history = model.fit(train_it, steps_per_epoch=len(train_it),
  	validation_data=test_it, validation_steps=len(test_it), epochs=20, verbose=0)
# evaluate model
_, acc = model.evaluate(test_it, steps=len(test_it), verbose=0)
print('The accuracy of VGG3 model is: %.3f' % (acc * 100.0))

###############################################################################
# plot diagnostic learning curves
def summarize_diagnostics(history,outplot_file):
    	from matplotlib import pyplot
    	# plot loss
    	pyplot.subplot(211)
    	pyplot.title('Cross Entropy Loss')
    	pyplot.plot(history.history['loss'], color='blue', label='train')
    	pyplot.plot(history.history['val_loss'], color='orange', label='test')
    	# plot accuracy
    	pyplot.subplot(212)
    	pyplot.title('Classification Accuracy')
    	pyplot.plot(history.history['accuracy'], color='blue', label='train')
    	pyplot.plot(history.history['val_accuracy'], color='orange', label='test')
    	# save plot to file
    	pyplot.savefig(outplot_file + '_plot.png')
    	pyplot.close()
###############################################################################

summarize_diagnostics(history,'VGG3')
# save model
model.save('VGG3_model.h5')

###############################################################################
# define VGG model for transfer learning
def define_VGG16_model():
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Conv2D
    from tensorflow.keras.layers import MaxPooling2D
    from tensorflow.keras.layers import Dense
    from tensorflow.keras.layers import Flatten
    from tensorflow.keras.layers import Dropout
    from tensorflow.keras.optimizers import SGD
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    # compile model
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    return model
###############################################################################
# define model
model_VGG16 = define_VGG16_model()

train_it = datagen.flow_from_directory('./train/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
test_it = datagen.flow_from_directory('./test/',
	class_mode='binary', batch_size=64, target_size=(200, 200))
history_VGG16 = model_VGG16.fit(train_it, steps_per_epoch=len(train_it),
  	validation_data=test_it, validation_steps=len(test_it), epochs=20, verbose=0)
# evaluate model
_, acc = model_VGG16.evaluate(test_it, steps=len(test_it), verbose=0)
summarize_diagnostics(history_VGG16,'VGG16')
# save model
model_VGG16.save('VGG16_model.h5')

print('The accuracy of VGG16 model is: %.3f' % (acc * 100.0))
t1 = time.time()
total_time = t1-t0
print('The practice takes %s seconds' % total_time)

References

  1. Convolutional Neural Network for Breast Cancer Classification
  2. Breast-cancer-classification
  3. A Comprehensive Survey on Deep-Learning-Based Breast Cancer Diagnosis
  4. A Comprehensive Review for Breast Histopathology Image Analysis Using Classical and Deep Neural Networks
  5. https://www.sciencedirect.com/science/article/pii/B9780128214725000028

Leave a Reply

%d bloggers like this: