Data: Abnormality Detection in Mammography using Data
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
The developed code is found on Github, and the trained CNN models can be downloaded in the following links:
Introduction
Breast cancer is the second leading cause of deaths among American women. The average risk of a woman in the United States developing breast cancer sometime in her life is approximately 12.4% [1]. Screen x-ray mammography have been adopted worldwide to help detect cancer in its early stages.
As a result, we've seen a 20-40% mortality reduction [2]. In recent years, the prevalence of digital mammogram images have made it possible to apply deep learning methods to cancer detection [3]. Advances in deep neural networks enable automatic learning from large-scale image data sets and detecting abnormalities in mammography [4, 5].
Considering the benefits of using deep learning in image classification problem (e.g., automatic feature extraction from raw data), I developed a deep Convolutional Neural Network (CNN) that is trained to read mammography images and classify them into the following five instances:
- Normal
- Benign Calcification
- Benign Mass
- Malignant Calcification
- Malignant Mass
In the subsequent sections, data source, data preprocessing, labeling, ROI extraction, data augmentation, and model development and evaluation will be delineated.
Data Source
I obtained mammography images from the DDSM and CBIS-DDSM databases. The DDSM (Digital Database of Screening Mammography) is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is a subset of the DDSM database curated by a trained mammographer.
Both DDSM and CBIS-DDSM include two different image views - CC (craniocaudal - Top View) and MLO (mediolateral oblique - Side View) as shown in Figure 1. As the CBIS-DDSM database only contains abnormal cases, normal cases were collected from the DDSM database. Overall, a total of 4,091 mammography images were collected and used for the CNN development.
(a) MLO - Side view (b) CC - Top view |
Figure 1. Images of MLO and CC views
Data Preprocessing
Rename File Name
Because all the files obtained from the CBIS-DDSM database have the same name (i.e., 000000.dcm), I had to rename each file, so each one would have a distinct name. To that end, I wrote a Python script to rename each file's name with the folder and sub-folder names that include patient ID, breast side (i.e., Left vs. Right), and image view (i.e., CC vs. MLO) information.
File Format Conversion
The original file formats of the DDSM and CBIS-DDSM images are LJPEG (i.e., Lossless JPEG) and DICOM (i.e., Digital Imaging and Communications in Medicine), respectively. Since the original formats can be handled only with specific software (or program), I converted them all into 'PNG' format using MicroDicom and the scripts from Github.
Artifacts Removal and Image Enhancement
As illustrated in Figure 2, the raw mammography images (see Figure 2-(a)) contain artifacts which could be a major issue in the CNN development. To remove the artifacts, I created a mask image (Figure 2-(b)) for each raw image by selecting the largest object from a binary image and filled white gaps (i.e., artifacts) in the background image. I used the Otsu segmentation method to differentiate the breast image area with the background image area for the artifacts removal. Then, the boundary of the breast image was smoothed using the openCv morphologyEx method (see Figure 2-(c)).
(a) original image | (b) mask image | (c) processed image |
Figure 2. Image Contrast Increase and Artifacts Removal
After completion of the preprocessing task, I stored all the images as 8-bit unsigned integers ranging from 0 to 255, which were then normalized to have the pixel intensity range between 0 and 1.
Labeling
The CBIS-DDSM database provides the data description CSV files that include pixel-wise annotations for the regions of interest (ROI), abnormality type (e.g., mass vs. calcification), pathology (e.g., benign vs. malignant), etc. as shown in Figure 3-(a).
Figure 3-(a). Data Label
In the pathology column, 'BENIGN_WITHOUT_CALLBACK' was converted to 'BENIGN'. The Image_Name column was created with patient ID, breast side, and image view, and then set as the index column as shown in Figure 3-(b) below.
Figure 3-(b). New Data Label
After that, each label was encoded into one of the categories shown below.
In the end, each category vector (e.g., integers) was converted to binary class matrix using Keras 'to_categorical' method.
Patch Extraction
Considering the size of data sets and available computing power, I decided to develop a patch classifier rather than a whole image classifier. For this purpose, image patch extractions for the normal and abnormal images were conducted in two different way:
- Patches for the normal images were randomly extracted from within the breast image area
- Patches for the abnormal images were created by sampling from the center and around the center of ROI area
In Figure 4, the size and location of ROI in an abnormal image was first identified from the ROI mask image (Note that the ROI mask images were included in the CBIS-DDSM data set). Patches were then extracted from the corresponding location in the original image. When the size of ROI was greater than 256×256, multiple patches were extracted with a stride of 128.
Figure 4. Example of Original, ROI mask, and Patch Images
Overall, I could extract a total of 50,718 patches, 85% of which normal and 15% abnormal (e.g., either benign or malignant) cases. This was just intended to reflect the real-world condition. In real-world cases, the mean abnormal interpretation rate is about 12% [8].
The extracted patches were split into the training and test (i.e., 80/20) data sets. In the test set, I further isolated 50% of the patches to create a validation set. Examples of extracted abnormal patches are shown in Figure 5.
Figure 5. Examples of extracted patch images
Deep CNN Development
Architecture
I designed a baseline model with a VGG (Visual Geometry Group) type structure, which includes a block of two convolutional layers with small 3×3 filters followed by a max pooling layer. The final model has four repeated blocks, and each block has a batch normalization layer followed by a max pooling layer and dropout layer. Each convolutional layer has 3×3 filters, ReLU activation, and he_uniform kernel initializer with same padding, ensuring the output feature maps have the same width and height. The architecture of the developed CNN is shown in Figure 6.
Figure 6. Architecture of Developed CNN
Model Improvement and Training Procedure
The CNN model in Figure 6 was developed through 7 steps. The interim models were trained and evaluated with the training, validation, and test data sets. The initial number of epoch for model training was 50, and then increased to 100. I selected Adam as the optimizer and set the batch size to be 32. Model training involved tuning the hyper parameters, such as beta_1, and beta_2 for the optimizer, dropout rate, and learning rate.
The results of train and validation accuracy and loss of the interim models are shown in Figure 7. Overall, the accuracy of the baseline model with the test data was more than 80%, but a significant overfitting also occurred. To address this, I added a dropout layer in each block and/or applied kernel regularizer in the convolutional layers.
Figure 7. Model Training and Improvement Procedure
Computational Environment
The model training in this project was carried out on a Windows 10 computer equipped with an NVIDIA 8GB RTX 2080 Super GPU card. The CNN model was developed with TensorFlow 2.0 and Keras 2.3.0.
Model Evaluation
The accuracy of the developed model achieved with the test data was 90.7%. However, the accuracy is not a proper evaluation metric in this project because the number of samples per class is highly unbalanced. With imbalanced classes, it's easy to get a high accuracy without actually making useful predictions. Thus, a confusion matrix was estimated to understand classification result per class (see Figure 8). Note that 0, 1, 2, 3, and 4 represent Normal, Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass, respectively.
Figure 8. Confusion Matrix Results
Precision and recall were then computed for each class, and the results are summarized in Figure 9. While the precision and recall of class 0 (i.e., Normal) are 97.2% and 99.8%, respectively, the precision and recall for the other classes are relatively lower. However, the weighted average of precision and the weighted average of recall were 89.8% and 90.7%, respectively.
Figure 9. Precision, Recall, and F1-Score
Considering the data imbalance, I re-trained the multi-class classification model by assigning the balanced class weight. The weights were computed with scikit-learn 'class_weight.' The computed weights are shown below:
Class | 0 | 1 | 2 | 3 | 4 |
Weight | 0.250 | 2.877 | 4.116 | 5.031 | 4.735 |
The results of Precision and Recall calculated with the re-trained model are summarized in Figure 10. While Recall of classes 3 (i.e., Malignant Calcification) increased, Precision and Recall of the other classes slightly decreased. Overall, no noticeable results were obtained even after adding the class weight.
Figure 10. Precision, Recall, and F1-Score with Class Weight
Figure 11 shows Precision-Recall (PR) curve as well as F1-curve for each class.
Figure 11. Precision-Recall (PR) Curves with F1 Score
Binary Classification
The developed CNN was further trained for binary classification (e.g., Normal vs. Abnormal). The number of epochs for the model training was 100, and the other parameters remained the same as the multi-class classification. The confusion matrix and normalized confusion matrix are shown in Figure 12.
Figure 12. Confusion Matrix Results for Binary Classification
Corresponding precision and recall for detecting abnormalities were also calculated, and the results are shown below. The binary classification model achieved great precision and recall values, which is far better than those obtained with the multi-class classification model. It should be noted that recall is a more important measure than precision for rare cancer detection because anything that does not account for false negatives is a critical issue in cancer detection.
- Precision = 808/(808+13) = 0.984
- Recall = 808/(808 + 98) = 0.892
Figure 13 shows Precision-Recall curve for the binary classification.
Figure 13. Precision-Recall (PR) Curves with F1 Score - Binary Classification
Examples of Predictions
We can use the developed CNN to make predictions about images. Figure 14 exhibits examples of image predictions. Correct prediction labels are blue and incorrect prediction labels are red. The number gives the percentage for the predicted label.
Figure 14. Example of Image Predictions
Conclusions and Future Studies
Throughout this capstone project, I developed the two Convolutional Neural Network (CNN) models for mammography image classification. The two models were developed with highly imbalanced data sets. The first model (i.e., multi-class classification) was trained to classify the images into five instances: Normal, Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass. The other model (i.e., binary classification) was trained to detect normal and abnormal cases. Notable findings of this project are summarized below:
- Achieved accuracy of the multi-class classification model was 90.7%, but the accuracy is not a proper performance measure under the unbalanced data condition.
- Results of precision and recall for the abnormal classes (e.g., Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass) in the multi-class classification model were relatively lower than the estimated accuracy. The recall value for each abnormal class was 68.4%, 50.5%, 35.8%, and 47.1%, respectively, while the precision value was 68.8%, 48.5%, 56.7%, and 57.1%, respectively. However, the weighted average of the precision and the weighted average of recall were 89.8% and 90.7%, respectively.
- The precision and recall values for detecting abnormalities (e.g., binary classification) were 98.4% and 89.2%.
This project will be enhanced by investigating the ways to increase the precision and recall values of the multi-class classification model. An immediate extension of this project is to investigate the model performance after adding additional blocks/layers into the existing CNN model and tuning hyper-parameters. In the meantime, I will examine the data imbalance issue with both over-sampling and under-sampling techniques. Additionally, I will improve the developed CNN model by integrating with a whole image classifier.
References
- American Cancer Society. Breast Cancer Facts & Figures 2017-2018. Atlanta: American Cancer Society, Inc. 2017
- Lotter, William, et al. "Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach." arXiv preprint arXiv:1912.11027 (2019).
- Abdelhafiz, Dina, et al. "Deep convolutional neural networks for mammography: advances, challenges and applications." BMC bioinformatics 20.11 (2019): 281.
- Nelson, Heidi D., et al. "Factors associated with rates of false-positive and false-negative results from digital mammography screening: an analysis of registry data." Annals of internal medicine 164.4 (2016): 226-235.
- Xi, Pengcheng, Chang Shu, and Rafik Goubran. "Abnormality detection in mammography using deep convolutional neural networks." 2018 IEEE International Symposium on Medical Measurements and Applications (MeMeA). IEEE, 2018.
- Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi , Daniel Rubin (2016). Curated Breast Imaging Subset of DDSM [Dataset]. The Cancer Imaging Archive. DOI: 10.7937/K9/TCIA.2016.7O02S9CY
- https://github.com/trane293/DDSMUtility
- Lehman, Constance D., et al. "National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium." Radiology 283.1 (2017): 49-58.
- Shen, Li, et al. "Deep learning to improve breast cancer detection on screening mammography." Scientific reports 9.1 (2019): 1-12.