Identifying Chip Fail Patterns with CNNs
Introduction
Many industries monitor their production yields to find areas of improvement and informΒ their business decisions, and that certainly holds true forΒ the semiconductor chip manufacturing industry.Β Manufacturing chips entails continuous processing of a wafer (shown below on the left) which contains thousands of chips.
Each chip on these wafers is electrically measured to ensure that it meets the performance standards for a given purpose (i.e. technology application). These measurements can sometime produce patterns of faulty chips. The a common fail mode denoted as the "Donut" pattern in the "Yield Patterns"Β illustration on the upper right is an example of that. (There are also parametric patterns, which we will not be using in this project).
Once these patterns occur,it is the job of the engineering team to understand what caused the chips to fail in such a pattern and implement process improvement to avert the issue in the future.Β This process of understanding what cause such patterns is known as "root cause" or "drilldown" analysis.Β The more examples one has of the fail pattern, the easier it is to find the root cause and solve the problem.
The drawback of applying that form of analysis, though, is that it involves a manual and time-consuming process.Β The engineering team would have to dig through perhaps thousands of yield patterns in order to group them into their respective categories. Automating the grouping of the yield patterns through image recognition techniques would drastically reduce the time and effort of the team to understand the root cause and implement process improvements. It is well known that CNNs have proven to be great image classifiers,[9] and my work focuses on applying image classification techniques via CNNs to yield patterns.
Prior Work
This application of CNNs to the semiconductor manufacturing industry is not new, as it dates back to 2005 [8]Β andΒ has been published by various groups. There are also many companies interested in this application, amongst them are:
- Infineon [1]
- Toshiba [6]
- STM [8]
- ONSEMI (the blog author's current company)
- Intel (via personal and professional conversations)
Its clear that being able to address this issue would clearly provide a great advantage to improving the processing of chips and increasing the company return on investment (ROI).
Of interest to this work is two specific papers[2,6] which used a publicly available data set named "WM-811K" which is the data set that we will use in this work as well.
Data set:
The dataset used in this work is a publicly available dataset called "WM-811K", its name as such since it has roughly 811,000 Wafer Maps from a real world semiconductor fab which have been encoded in the following manner:
0: Background
1: Good Chip
2: Bad Chip
The data set has been used in prior publications as well [2,6] showcasing its usability and relevance to this type of application. For readers in the semiconductor industry, you may find the below plot relevant since it shows that the data is roughly composed of ~32,500 lots*.
*A lot is a grouping of 25 wafers that are typically processed together within the fab.
Some of the principal patterns which occur in the dataset are shown below and consist of eight different categories: Center, Donut, Edge-Loc (Edge Localized), Edge Ring, Loc (localized), Random, Scratch and Near-Full.
The goal will be to try use categorize any given wafer map into one of these categories.
EDA and Preprocessing:
The raw dataset consist of five columns:
- Wafer Map: The actual encoded wafer maps, a very important column
- Β Die Size: Not relevant and not used
- Β LotName: Not relevant and not used
- Β WaferIndex: Not relevant and not used
- Β TrainTestLabel: Not relevant and not used
- Β FailureType: The labels for the failure patterns, essentially your "y-label" array
Only two of the columns are useful and the rest are not used in the analysis.
A few pre-processing steps were needed for this dataset as well
- Corrected column name: βtrianTestLabelβ -> βtrainTestLabelβ
- Changed column data types:
- FailureType -> String
- trainTestLabel -> String
- lotName -> String
- dieSize -> int32
- Added column "failureNum" to encode the "failureType" column as:
- 'Center': 0
- 'Donut': 1
- 'Edge-Loc': 2
- 'Edge-Ring': 3
- 'Loc': 4
- 'Random': 5
- 'Scratch': 6
- 'Near-full': 7
- 'none': 8
- '[]': 9
- Added βwaferMapDimβ column which calculates the dimension of each wafer map.
The distribution of the failure patterns proved to be very uneven. There is a great deal of "unlabeled" data that we will not use in this exercise (~79%). There is also a large percentage of labeled data that has no visible pattern (~18%) that we will also not use, This leaves us with only ~3% of useful data for training and testing purposes as can be shown in the pie chart below. Furthermore, within that 3% of data the Scratch, Random and Donut are the types with the smallest number of maps. This is problematic mainly for the Scratch type, the hardest to detect. [2]
Resizing Wafer Maps:
Wafer maps were found to have a variety of sizes within the dataset (Some examples are shown below) ranging from the smallest (6,21) to the largest (300,202). In total there is a total of 632 different wafer maps sizes.
This large variation in map sizes will be an issue when training our CNN since its standard to resize the input images prior to training in order to optimize the learning. This was accomplished via imputations using the CV2 library that offers various methods to scale images. Of the available methods (shown below,Β Nearest Neighbors gave the best results for our wafer maps.
Augmenting Maps
In order to minimize over fitting the input wafer maps (i.e. images) were also "augmented" by randomly rotating each image by one of four rotation choices: +/- 90 and +/- 180. Some examples are shown below:
One hot encoding
In order to properly train any CNN for image recognition purposes, it isΒ necessary to try to separate the image into some of its principal components. For standard color images, this process usually entails separating each image into its RGB components. The images used here don't have standard RGB color coding;Β instead they are pre-encoded with the following:
0: Background
1: Good Chips
2: Bad Chips
To process these images in the CNN, we first separate each image into these three components (shown below) and then feed the "one hot encoded" resulting image into the CNN.
Custom Generators and RAM limitations:
Our code was implemented in google colab. Since we are using 25,000 images each of which is encoded into 3 layers for a total of ~75,000 images being fed as input to our CNN, we needed a substantial amount of RAM (>60GB) to train our CNN. We were limited by the 15GB that is given by default in the non-subscription version of colab. In order to bypass this limitation, we used generators that performed all the pre-processing needed for the wafer maps prior to feeding them to the CNN as input batches. Our custom generators performed the following on each image: resizing, augmentation and encoding.Β Generators only "generate" a result temporarily in RAM, and the output is removed from memory once its used (i.e. fed into the CNN). Consequently,Β we were able to save a substantial amount of memory with this approach and use less than 5GB of RAM in our CNN training.
CNN Architecture:
We use a standard architecture popularized by the one of the first CNN's. [9] The main building block is outlined in red dotted lines below, it consists of a convolution layer with 3x3 kernels after which the Relu activation function is utilized (not shown in diagram) and then the output of the non-linearity is batch normalized and subsequently Max-pooled with a 2x2 filter. This is repeated three times for a total of three hidden layers that compose our CNN.
The output of the last hidden layer is then flattened and passed through two different fully connected layers of 64x64 and 32x32 respectively. Finally the output of the last fully connected layer is filtered through a soft max activation function in order to give the relative probabilities for each of the eight categories. Recall that we are only using eight different fail patterns in our CNN training. They are:
- 'Center': 0
- 'Donut': 1
- 'Edge-Loc': 2
- 'Edge-Ring': 3
- 'Loc': 4
- 'Random': 5
- 'Scratch': 6
- 'Near-full': 7
Image size dependence:
It was found during training that the ideal image input size for our CNN was ~64x64. The training loss curves for three different image inputs are shown below. For smaller images (32x32), we tend to see under fitting. Our images are smaller, though our kernel size has not changed (3x3). Consequently, each kernel is operating on a magnified view of the image, and we are likely not capturing finer details. This is typically due to a lack of "capacity" in our model. In order to optimize this, we would need to change the kernel size (smaller) or add more convolutional layers to extract more features (i.e. details) or any other methods to increase capacity.
For larger images (128x128) we likely see over fitting, this is a typically due to lack of "generalization" in our model or too much capacity. Our model is likely fitting the noise and just memorizing the training data at this point and can't adjust to any data outside the training set. This was a bit counter-intuitive.Β We expected that the CNN would improve with larger images since the smaller 3x3 kernels can now extract more features on each image. However,Β increasing input sizeΒ increases the fully connected (FC) layers size that are the most prone to overfitting. We can see that going from 32x32 to 128x128, we go from ~42.6k trainable parameters to ~829k trainable parameters! Thus, the FC layers are likely a limitation here leading to overfitting. In order to overcome that,Β we can implement spatial dropout (in the FC only), decrease the layer number (2->1) or any other methods to minimize overfitting used in regular dense neural nets.
The above analysis shows us that our architecture is optimal for 64x64 images, the image size we choose for our analysis. The 64x64 loss curve is much more conventional, and we can see a clear deviation at ~9-10 epochs. We choose 10 epochs for our model.
Results:
After 10 epochs, our model has a training accuracy of ~94% and a test accuracy of ~90%.
The most interesting breakdown is how if performs on each type of spatial feature it encountered. Summarized below we can see that the bottom three performers are the Localized, Donut and Scratch patterns. We expected localized and scratch to be difficult, but it's not too obvious why donut is performing so poorly.
When inspecting the confusion matrix we see that the majority of the donut patterns being confused for Loc and Center.
Inspecting the actual maps we see three types of errors:
- Wrong Labels: The data set has inherently mistakenly labeled data, all data sets have this and there's little we can do besides manually relabeling all the data.
- Mistaking for Loc and Center: We see alot of "partial donuts" labeled as Donuts, this makes it easy to confuse with the Loc and Center type patterns so we will need more training in order to address this. Potential solutions include:
- Attention mechanisms
- Data Augmentation for donuts
- Generate more donut patterns artificially to train the CNN better on "Partial Donuts"
- Flagrant Mistakes: Actual mistakes that are weakness' of our CNN. Potential solutions are:
- AttentionΒ mechanisms
- Using smaller kernels, larger images, deeper networks
- Transfer learning
Conclusion and Future Work:
Our work shows that we can attain results similar to [2] even with a simplified CNN architecture that's not too deep. This is significant in that we can only improve from here and aid the chip manufacturing industry in quick root cause problem solving. We can further improve by applying the following:
- Attention mechanisms
- Transfer learning
- Optimized deeper networks
We would also like to understand how other approaches compare such as using GAN's and transformers.
Finally we would like to deploy our model into an app one day so that it can be usable to the end customer and aid device engineers in their day to day work.
References
[1] Data Mining and Support Vector Regression Machine Learning in Semiconductor Manufacturing to Improve Virtual Metrology, Benjamin Lenz et al., IEEE, 2023 (Recent)
[2] Improved Wafer Map Inspection Using Attention Mechanism and Cosine Normalization, Qiao Xu et al., Machines, 2022 (*)
[3] Machine learning for semiconductors, Duan-Yang Liu et al.,Chip, 2022 (Review)
[4] A Novel Framework for Semiconductor Manufacturing Final Test Yield Classification Using Machine Learning Techniques, DAN JIANG et al., IEEE, 2020
[5] Unsupervised Wafermap Patterns Clustering via Variational Autoencoders, Peter Tulala et al., IEE, 2018
[6] A Comprehensive Big-Data-Based Monitoring System for Yield Enhancement in Semiconductor Manufacturing, Kouta Nakata, IEEE, 2017. (*)
[7] Data mining for yield enhancement in semiconductor manufacturing and an empirical study, Chen-Fu Chien et al., 2007
[8] Unsupervised Spatial Pattern Classification of Electrical Failures in Semiconductor Manufacturing, G. De Nicolao et al.,Pattern Recognition Letters, 2005
[9] Β Gradient-based learning applied to document recognition, Lecun, Y., et. al., Proceedings of the IEEE.Β 86 (11): 2278β2324, 1998
(*): These papers use the open source data set WMK_811k