Bosch Production Line Performance

Posted on Sep 22, 2016

In August 2016 Bosch, one of the world's leading manufacturing companies, launched a competition on Kaggle addressing the occurrence of defective parts in assembly lines. This post focuses on the machine learning pipeline built for the competition, and how to preprocess the large dataset for a traditional machine learning modeling process.


Manufacturing industry relies on continuous optimization to ensure quality and safety standards are respected while pushing the production volume. Being able to predict if and when a given part of a product will fail the standards is an essential part of such optimization, as it leverages the existing massive amount of data recorded in the production line without affecting the process. This argument is particularly relevant in the “assembly” phase since it accounts for 50% to 70% of the manufacturing cost. Bosch, among other companies, records data at every step along its assembly lines, in order to build the capability to apply advanced analytics and improve the manufacturing process.pie

Figure 1. Breakdown of the manufacturing costs

The assembly line

A quick check at the dataset header allows drawing a sketch of the assembly line used for this dataset.

Figure 2. Assembly Line

The assembly line is divided into 4 segments and 52 workstations. Each workstation performs a variable number of tests and measurements on a given part, accounting in total for 4264 features. Different products may not share the same path along the assembly line, nor there seem to be a common starting or final workstation. Each of the 1,183,747 parts recorded in the dataset follows one of the 4700 unique combinations. As shown in fig. 2, one observation  can be interpreted as a series of cells (yellow boxes) where the object is processed. Conversely, features may be described according to their popularity (number of rows/parts for which the feature exists) and defective rate, defined as the percentage of the parts being measured at a given feature and found to fail the quality test (see fig. 2). It is interesting to notice how features with high defective rate (>0.6%) are clustered around specific areas, mostly in line 0 and 1.  

Explanatory Analysis

Both of the training set and testing set provided by Bosch was split into three separate tables: One contains numerical values, one contains categorical values, and one contains timestamps. Each of those tables is roughly 2.8 GB, which sum up to 7.7 GB for training data and same size for testing data. With data of this size, it is extremely important to understand data before testing any machine learning technique.

The first important thing is to understand the naming schema of this table. According to the description document, L indicates the assembly line number, S means the working station number, F means the value of an anonymous feature that has been measured at the station. For example, L0_S0_F1 means the Feature 1 measured at Station 0 on Assembly Line 0. However, the special code D is been used differently: columns that named as D(n) records the timestamp that features F(n - 1) have been measured. For instance, L0_S0_D10 stores the timestamp for L0_S0_F9. Why were columns named in this strange way?

By just loading the first few rows of each dataset, it is possible to check all the column names at the same time and discover whether there is any pattern in the structure. As the result, each of the tables in training and testing set has the same number of observations respectively, which suggests that the data was separated by column from a single table, therefore, the original table can be restored by simply binding all the three tables together without any advanced joining procedure. This answered the previous question about D codes -- it turns out that the last digit of each column is just the column number, instead of the feature ID. Each timestamp column is located next to corresponding F column, which explains why D(n) columns are describing F(n - 1) columns.


Figure 3. Breakdown of Original Dataset


Figure 4. The Order of Original Dataset

Next, there is a massive missingness within the dataset. To be specific, only 5% of numeric values, 1 % of categorical values, and 7% of timestamps were NOT NULL. This is quite understandable -- each observation only goes through a certain number of stations, and will not be touched by most of other stations. Therefore, the missingness was not at random. Thus, if a proper transformation method can be applied to squeeze out those void cells from the data table, the physical file size of the data file can be significantly reduced, and make it possible to apply machine learning directly on the entire dataset.


Figure 5. The Missingness in Bosch Dataset

Dimension Reduction and feature engineering

What are the reasons for producing a defective part? What is the likelihood of detecting an error in the assembly line? It is reasonable to assume that such likelihood increases with the number of steps required in order to produce a part. Similarly, the higher the number of measurements, the higher the time required to complete the part/product.

By using this simple assumption we can use produce (at least) three new features related with the “process” rather than the individual feature. This is particularly relevant for time stamps (Date dataset), where most non-null features for a given row show only very few (around 3) unique values. By calculating the time lapse (TMAX-TMIN), the entire dataset can be reduced effectively from 1156 to 1 column. The other feature, namely the number of steps (non-null features) per row, can be calculated for both numerical and categorical datasets.

A second major “gain” in dimension reduction can be obtained on the categorical dataset, by noticing a large number of duplicated columns (1913), probably referring to the same features measured at different stations. Furthermore, the categorical features have only 93 unique values. Rather than encoding the features (preserving the original feature set) we chose to look at the appearance of each categorical value. Combining these transformations, the original 2140 features shrink to 93 dummy variables.

Finally, the sparsity of both numerical and the transformed categorical datasets can be used to reformat the data, by means of libSVM. Overall we reduced the data by a factor 5, from 7.7 GB to 1.7 GB.


After feature engineering, the dataset is ready to be fed into the machine learning pipeline. Due to the high correlation among variables, a large number of observations,  and non-random missingness patterns of the data, the tree models are expected to perform better in this scenario, because they are capable of picking up correlations among variables during the training process. However, a logistic regression model was still trained, and the performance of this model can be used as the baseline for measuring other models' performances. The metric been used to evaluate each model's performance is the Matthew Correlation Coefficient (MCC), which is equally valuing both true positive and true negative rates, and the range of this score is from -1 (perfectly incorrect) to 1 (perfectly correct).

Since the logistic regression cannot handle missing values, imputation is needed. However, imputing the data set is computationally expensive because all the missing values will be filled and been processed during the training process. Therefore, only numerical dataset had been used for this model. In addition, L1 regularization (a.k.a. Lasso Regression) was used to narrow down the most important features. As the result,  only 22 out of 968 variables were kept in the final model, and the MCC score of this model on the test set is 0.14. This is already a great improvement compared with blindly assuming all observations are not defective (response == 0).


Figure 6. Variables Included in the Logistic Regression With L1 Regularization 

Next, a tree model shall be selected to better adapt to the missingness of the data, as well as to achieve a better MCC score. Considering Random Forest is extremely computationally intensive, and not able to handle missing values, XGBoost was selected due to its high computation efficiency and capability of dealing with missing values automatically. Meanwhile, the only hyper-parameter of this model that has been modified was learning rate, which was set to 1 in order to get fast convergence. A five-fold cross-validation on training set shows that this basic model has achieved an MCC score of 0.24, which is a huge improvement!


Figure 7. Most Important Features in Final XGBoost Model

About Authors

Jonathan Liu

Through years of self-learning on programming and machine learning, Jonathan has discovered his interests and passion in Data Science. With his B.B.A. in accounting, M.S. in Business Analytics, and two years of experience as operation analyst, he is...
View all posts by Jonathan Liu >

Diego De Lazzari

Researcher, developer and data scientist. Diego De Lazzari is an applied physicist with a rather diverse background. He spent 8 years in applied research, developing computational models in the field of Plasma Physics (Nuclear Fusion) and Geophysics. As...
View all posts by Diego De Lazzari >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI