Abnormality Detection in Mammography using Deep Learning

Avatar
Posted on Feb 5, 2020

The developed code is found on Github, and the trained CNN models can be downloaded in the following links:

  1. Multi-class classification model
  2. Binary classification model

Introduction

Breast cancer is the second leading cause of deaths among American women. The average risk of a woman in the United States developing breast cancer sometime in her life is approximately 12.4% [1]. Screen x-ray mammography have been adopted worldwide to help detect cancer in its early stages. As a result, we've seen a 20-40% mortality reduction [2]. In recent years, the prevalence of digital mammogram images have made it possible to apply deep learning methods to cancer detection [3]. Advances in deep neural networks enable automatic learning from large-scale image data sets and detecting abnormalities in mammography [4, 5].

Considering the benefits of using deep learning in image classification problem (e.g., automatic feature extraction from raw data), I developed a deep Convolutional Neural Network (CNN) that is trained to read mammography images and classify them into the following five instances:

  • Normal
  • Benign Calcification
  • Benign Mass
  • Malignant Calcification
  • Malignant Mass

In the subsequent sections, data source, data preprocessing, labeling, ROI extraction, data augmentation, and model development and evaluation will be delineated. 

 

Data Source

I obtained mammography images from the DDSM and CBIS-DDSM databases. The DDSM (Digital Database of Screening Mammography) is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is a subset of the DDSM database curated by a trained mammographer.

Both DDSM and CBIS-DDSM include two different image views - CC (craniocaudal - Top View) and MLO (mediolateral oblique - Side View) as shown in Figure 1. As the CBIS-DDSM database only contains abnormal cases, normal cases were collected from the DDSM database. Overall, a total of 4,091 mammography images were collected and used for the CNN development.

                                          (a) MLO - Side view                                                                           (b) CC - Top view

Figure 1. Images of MLO and CC views

 

Data Preprocessing

Rename File Name

Because all the files obtained from the CBIS-DDSM database have the same name (i.e., 000000.dcm), I had to rename each file, so each one would have a distinct name. To that end, I wrote a Python script to rename each file's name with the folder and sub-folder names that include patient ID, breast side (i.e., Left vs. Right), and image view (i.e., CC vs. MLO) information.  

File Format Conversion

The original file formats of the DDSM and CBIS-DDSM images are LJPEG (i.e., Lossless JPEG) and DICOM (i.e., Digital Imaging and Communications in Medicine), respectively. Since the original formats can be handled only with specific software (or program), I converted them all into 'PNG' format using MicroDicom  and the scripts from Github.

Artifacts Removal and Image Enhancement

As illustrated in Figure 2, the raw mammography images (see Figure 2-(a)) contain artifacts which could be a major issue in the CNN development. To remove the artifacts, I created a mask image (Figure 2-(b)) for each raw image by selecting the largest object from a binary image and filled white gaps (i.e., artifacts) in the background image. I used the Otsu segmentation method to differentiate the breast image area with the background image area for the artifacts removal. Then, the boundary of the breast image was smoothed using the openCv morphologyEx method (see Figure 2-(c)).

       (a) original image                    (b) mask image    (c) processed image

Figure 2. Image Contrast Increase and Artifacts Removal

After completion of the preprocessing task, I stored all the images as 8-bit unsigned integers ranging from 0 to 255, which were then normalized to have the pixel intensity range between 0 and 1.

 

Labeling

The CBIS-DDSM database provides the data description CSV files that include pixel-wise annotations for the regions of interest (ROI), abnormality type (e.g., mass vs. calcification), pathology (e.g., benign vs. malignant), etc. as shown in Figure 3-(a).


Figure 3-(a). Data Label

In the pathology column, 'BENIGN_WITHOUT_CALLBACK' was converted to  'BENIGN'. The Image_Name column was created with patient ID, breast side, and image view, and then set as the index column as shown in Figure 3-(b) below.

Figure 3-(b). New Data Label

After that, each label was encoded into one of the categories shown below.

Number Category
0 Normal
1 Benign Calcification
2 Benign Mass
3 Malignant Calcification
4 Malignant Mass

 

In the end, each category vector (e.g., integers) was converted to binary class matrix using Keras 'to_categorical' method. 

 

Patch Extraction 

Considering the size of data sets and available computing power, I decided to develop a patch classifier rather than a whole image classifier. For this purpose, image patch extractions for the normal and abnormal images were conducted in two different way:

  1. Patches for the normal images were randomly extracted from within the breast image area
  2. Patches for the abnormal images were created by sampling from the center and around the center of ROI area

In Figure 4, the size and location of ROI in an abnormal image was first identified from the ROI mask image (Note that the ROI mask images were included in the CBIS-DDSM data set). Patches were then extracted from the corresponding location in the original image. When the size of ROI was greater than 256×256, multiple patches were extracted with a stride of 128.   

Figure 4. Example of Original, ROI mask, and Patch Images

Overall, I could extract a total of 50,718 patches, 85% of which normal and 15% abnormal (e.g., either benign or malignant) cases. This was just intended to reflect the real-world condition. In real-world cases, the mean abnormal interpretation rate is about 12% [8].

The extracted patches were split into the training and test (i.e., 80/20) data sets. In the test set, I further isolated 50% of the patches to create a validation set. Examples of extracted abnormal patches are shown in Figure 5. 

 

Figure 5. Examples of extracted patch images

 

Deep CNN Development

Architecture

I designed a baseline model with a VGG (Visual Geometry Group) type structure, which includes a block of two convolutional layers with small 3×3 filters followed by a max pooling layer. The final model has four repeated blocks, and each block has a batch normalization layer followed by a max pooling layer and dropout layer. Each convolutional layer has 3×3 filters, ReLU activation, and he_uniform kernel initializer with same padding, ensuring the output feature maps have the same width and height. The architecture of the developed CNN is shown in Figure 6.

Figure 6. Architecture of Developed CNN

 

Model Improvement and Training Procedure

The CNN model in Figure 6 was developed through 7 steps. The interim models were trained and evaluated with the training, validation, and test data sets. The initial number of epoch for model training was 50, and then increased to 100. I selected Adam as the optimizer and set the batch size to be 32. Model training involved tuning the hyper parameters, such as beta_1, and beta_2 for the optimizer, dropout rate, and learning rate. The results of train and validation accuracy and loss of the interim models are shown in Figure 7. Overall, the accuracy of the baseline model with the test data was more than 80%, but a significant overfitting also occurred. To address this, I added a dropout layer in each block and/or applied kernel regularizer in the convolutional layers. 

(a) 1 VGG Block

(b) 2 VGG Blocks

(c) 3 VGG Blocks

(d) 3 VGG Blocks with Dropout

(e) 3 VGG Blocks with Dropout and Batch Normalization

(f) 3 VGG Blocks with Dropout, Batch Normalization and Kernel Regularizer

   

(g) 4 VGG Blocks with Dropout and Batch Normalization

   

Figure 7. Model Training and Improvement Procedure

 

Computational Environment

The model training in this project was carried out on a Windows 10 computer equipped with an NVIDIA 8GB RTX 2080 Super GPU card. The CNN model was developed with TensorFlow 2.0 and Keras 2.3.0.

 

Model Evaluation

The accuracy of the developed model achieved with the test data was 90.7%. However, the accuracy is not a proper evaluation metric in this project because the number of samples per class is highly unbalanced. With imbalanced classes, it's easy to get a high accuracy without actually making useful predictions. Thus, a confusion matrix was estimated to understand classification result per class (see Figure 8). Note that 0, 1, 2, 3, and 4 represent Normal, Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass, respectively.

Figure 8. Confusion Matrix Results

Precision and recall were then computed for each class, and the results are summarized in Figure 9. While the precision and recall of class 0 (i.e., Normal) are 97.2% and 99.8%, respectively, the precision and recall for the other classes are relatively lower. However, the weighted average of precision and the weighted average of recall were 89.8% and 90.7%, respectively.

 

Figure 9. Precision, Recall, and F1-Score

Considering the data imbalance, I re-trained the multi-class classification model by assigning the balanced class weight. The weights were computed with scikit-learn 'class_weight.' The computed weights are shown below: 

Class 0 1 2 3 4
Weight 0.250 2.877 4.116 5.031 4.735

 

The results of Precision and Recall calculated with the re-trained model are summarized in Figure 10. While Recall of classes 3 (i.e., Malignant Calcification) increased, Precision and Recall of the other classes slightly decreased. Overall, no noticeable results were obtained even after adding the class weight.  

Figure 10. Precision, Recall, and F1-Score with Class Weight

 

Figure 11 shows Precision-Recall (PR) curve as well as F1-curve for each class.  

Figure 11. Precision-Recall (PR) Curves with F1 Score

 

Binary Classification

The developed CNN was further trained for binary classification (e.g., Normal vs. Abnormal). The number of epochs for the model training was 100, and the other parameters remained the same as the multi-class classification. The confusion matrix and normalized confusion matrix are shown in Figure 12.

Figure 12. Confusion Matrix Results for Binary Classification

Corresponding precision and recall for detecting abnormalities were also calculated, and the results are shown below. The binary classification model achieved great precision and recall values, which is far better than those obtained with the multi-class classification model. It should be noted that recall is a more important measure than precision for rare cancer detection because anything that does not account for false negatives is a critical issue in cancer detection. 

  • Precision =  808/(808+13) = 0.984
  • Recall = 808/(808 + 98) = 0.892

 

Figure 13 shows Precision-Recall curve for the binary classification.

Figure 13. Precision-Recall (PR) Curves with F1 Score - Binary Classification

 

Examples of Predictions

We can use the developed CNN to make predictions about images. Figure 14 exhibits examples of image predictions. Correct prediction labels are blue and incorrect prediction labels are red. The number gives the percentage for the predicted label.

Figure 14. Example of Image Predictions 

 

Conclusions and Future Studies

Throughout this capstone project, I developed the two Convolutional Neural Network (CNN) models for mammography image classification. The two models were developed with highly imbalanced data sets. The first model (i.e., multi-class classification) was trained to classify the images into five instances: Normal, Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass. The other model (i.e., binary classification) was trained to detect normal and abnormal cases. Notable findings of this project are summarized below:

  1. The achieved accuracy of the multi-class classification model was 90.7%, but the accuracy is not a proper performance measure under the unbalanced data condition. 
  2. The results of precision and recall for the abnormal classes (e.g., Benign Calcification, Benign Mass, Malignant Calcification, and Malignant Mass) in the multi-class classification model were relatively lower than the estimated accuracy. The recall value for each abnormal class was 68.4%, 50.5%, 35.8%, and 47.1%, respectively, while the precision value was 68.8%, 48.5%, 56.7%, and 57.1%, respectively. However, the weighted average of the precision and the weighted average of recall were 89.8% and 90.7%, respectively.
  3. The precision and recall values for detecting abnormalities (e.g., binary classification) were 98.4% and 89.2%.

This project will be enhanced by investigating the ways to increase the precision and recall values of the multi-class classification model. An immediate extension of this project is to investigate the model performance after adding additional blocks/layers into the existing CNN model and tuning hyper-parameters. In the meantime, I will examine the data imbalance issue with both over-sampling and under-sampling techniques. Additionally, I will improve the developed CNN model by integrating with a whole image classifier. 

 

References

  1. American Cancer Society. Breast Cancer Facts & Figures 2017-2018. Atlanta: American Cancer Society, Inc. 2017
  2. Lotter, William, et al. "Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach." arXiv preprint arXiv:1912.11027 (2019).
  3. Abdelhafiz, Dina, et al. "Deep convolutional neural networks for mammography: advances, challenges and applications." BMC bioinformatics 20.11 (2019): 281.
  4. Nelson, Heidi D., et al. "Factors associated with rates of false-positive and false-negative results from digital mammography screening: an analysis of registry data." Annals of internal medicine 164.4 (2016): 226-235.
  5. Xi, Pengcheng, Chang Shu, and Rafik Goubran. "Abnormality detection in mammography using deep convolutional neural networks." 2018 IEEE International Symposium on Medical Measurements and Applications (MeMeA). IEEE, 2018.
  6. Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi , Daniel Rubin  (2016). Curated Breast Imaging Subset of DDSM [Dataset]. The Cancer Imaging Archive. DOI: 10.7937/K9/TCIA.2016.7O02S9CY
  7. https://github.com/trane293/DDSMUtility
  8. Lehman, Constance D., et al. "National performance benchmarks for modern screening digital mammography: update from the Breast Cancer Surveillance Consortium." Radiology 283.1 (2017): 49-58.
  9. Shen, Li, et al. "Deep learning to improve breast cancer detection on screening mammography." Scientific reports 9.1 (2019): 1-12.

About Author

Avatar

Chris (Kitae) Kim

Self-motivated data scientist with hands-on experiences in substantial data handling, processing, and analysis. Skilled in machine learning, image classification, data visualization, and statistical inference for problem solving and decision making
View all posts by Chris (Kitae) Kim >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp