NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Student Works > Top 9% Open Kaggle Competition - Santander Products Recommendation

Top 9% Open Kaggle Competition - Santander Products Recommendation

Lydia Kan, Wen Li and Yisong Tao
Posted on Dec 23, 2016

Introduction

In order to support the clients for range of financial decisions, Santander Bank offers their customers personalized product recommendations time to time. Under current system, not all the customers received the right product recommendations for them. To better meet the individual's needs and ensure their satisfaction,this challenge seeks to improve the recommendation system by predicting which products their existing customers will use in the next month based on their past behavior. Having a precise and strong recommendation system, the sales of the bank can be maximized. At the same time, the right products can also help the customers utilized their financial plan.

Data Size

The size of training data set is about 2.3 GB, which has 13,647,409 observations. The test data set has 929,615 observation.

Input Features

From column 1 to column 24 are the input features, which contain 21 categorical features and 3 continuous features. The input features contain customersโ€™ demographic and status with the bank information. On top of this, the observations are in the time series format. The data contains each customerโ€™s information from January 2015 to May 2016.

Output Features

From column 25 to column 48 are the output features, which contains the product purchased information according to each customer from January 2015 to May 2016. Each column stands for one product, and there are 24 products in total. The final purpose is to make a prediction on which products customers are going to purchase in June 2016. In this case, the prediction is going to be multi-classifier.

 

Evaluation

formula

To measure the result, the competition is using Mean Average Precision @ 7. From the formula where |U| is the number of the users in two time points, P(k) is the precision at cutoff k, n is the number of predicted products, and m is the number of added products for the given user at that time point. @7 here means the evaluation only take the top 7 products into account no matter how many products are in the prediction. At the same time, if the customer does not purchase any product, the precision is also defined to be 0.

 

Workflow

In order to manage and implement a great model on this complex data within two and half week as a team, the team were following the steps showing below.

workflow

  • Data Cleaning and  Exploratory Data Analysis (EDA): Due to a lot of missing value and the time series format of the data, the data cleaning and EDA were executing simotaniously. Mean while, the insights gaining from  doing data cleaning and EDA are also the foundation and inspiration for the later on feature engineering.
  •  Feature Engineering and Models Training : Due to the evaluation method, 7 decimal places does matter for the score. Feature Engineering is the part that can make impact on the model. Therefore, feature engineering play a main role in this project. Instead doing different features and then going to model training, the two stages were actually moving simultaneously as well. More models we were feeding, the more insights were found, and the more features engineering were added to improve the result.
  • Ensemble and Models Stacking: After getting results from different single models, different combinations were being tried for ensemble and stacking to see is there any possible chance to improve the result.
  • Data cleaning

By doing numeric EDA, we dicovered that there were 24 features contain missing value in the data set. Besides having missing value by columns, the data also had missing observations. Missing observations means that some of the customers missing data for certain months in between the overall time range.

After deeper investigation, there were 5 features being dropped before imputation due to over 95 % of missing value and repetitive information of other features.

 

Imputation

imputation

There were about 4 kinds of imputation strategy implement for this data set. For the features in the โ€˜Unknownโ€™ column, the missing values were all labeled as โ€˜unknownโ€™. The reason is that the features in this column are more customerโ€™s demographic information. Therefore, in order not to make any assumption, labeling โ€˜unknownโ€™ was the only way. For the features in the Common Type column, the median of each features were imputed for the missing values because those were the features that described the relationship between the bank or continued variable. The features in the others were imputed by couple different methods. Beside the โ€˜ageโ€™ feature, the missing values of rest of the features were fill in based on treating those observation as new customers. From the EDA, we were discovered that those missing values were the same observations. And within those observations, โ€˜all the account activities were under 6 months, which were also the bench mark for being a new customers. For the โ€˜ageโ€™ features, the after scaled mean were using for imputation in order to avoid some skewness in the data. Last but not least, there were two kinds of products having missing values. Due to the evaluation penalized the false negative, we would like to assume that the products havent been purchased yet.

 

EDA

At first, we would like to take a look at how the product owned related to customerโ€™s demographic information at May 2016. We could see that no matter which segments of the customers, the โ€˜current cash accountโ€™ was the dominated product among all.

1 2 3

Since the data set was in a time series format, it was important to look at the trend of the numbers of the customers. As the graph indicated, there were a big amount of new customers appear in July 2015, and keep growing for a bit for 4 to 5 months.  

4

When we look at how many products does each customer own in May 2016, we discovered that there were customers do not own product anymore. Also, most of the customers own 1 to 2 products.

5

The following graphs show that if the customers own 1, 2, or 3 products, which products have the highest popularity.

6

Instead looking at the relationship between products and customers, we also did some investigation on how does the product sales over time. Using these two products as example, the first one indicates that there were almost no selling activities for the last 6 months. The second graph shows that that product was constantly sold over the time.

7

In addition, we also take a look at the income distribution by cities. The graph shows that the income varies in all the cities. From this interesting information, we were using it as part of feature engineering later on.

8

Feature Engineering

Feature engineering is key in this project. Based on the results of the model training, each time new useful features are added into the models, the scores get improved. In the following paragraphs, we will discuss our process of feature engineering.

10

In this project, we have several rounds of feature engineering, which can be divided into 4 stages. In the first stage, the input and output features are encoded from letters to numbers, and only the original features in the data set are used in the model training, therefore, it is 22 features in total. Since the data set is way too large, and using all the data will run out the laptopโ€™s memory limit, so in actual model training, one monthโ€™s data  is used as training set. However, in order to find the month which gives the best prediction, three directions of month selection are performed: using the previous month to predict the current month, using the month from last year to predict the same month of current year, using one month to predict the situation of three months later. After performing all the combinations, the pattern between months are not that clear, and the scores based on MAP@7 are not good.

 

Then, using the same idea of month combinations, combining adding the previous monthโ€™s product information as input features, meaning 46 input features in total, the performance of models is improved. Also, it is found that the best way in this dataset to recommend new products is based on the same month from the previous year. Since then, the data of June, 2015 is used as the train set to give the recommendation for June, 2016.

 

In the following steps, what we did was adding or dropping features based on time series, k-means clustering, and EDA.

 

Our purpose of the model training is to let the model be sensitive to newly added products. Because machine learning is not that smart, we cannot anticipate the model training process to understand what we want them to do, we need to provide the model the information we want the model to know directly. Therefore, we create a change feature. Change here means use the current monthโ€™s product information minus previous monthโ€™s product information. This change feature has two levels, that is, โ€œ1โ€ and โ€œ0โ€. โ€œ1โ€ represents newly added products, โ€œ0โ€ represents other statuses. The new features are selected based on the results of time series.

 

Change features have a positive effect on the model, to further improve the model, we reseparate the change features back to โ€˜-1โ€™, โ€˜0โ€™, and โ€˜1โ€™, respectively representing, close an account, no change, and open a new account. At the same time, 5 products are dropped from the predicting list, because the bank doesnโ€™t sell those products anymore. Since the Kaggle system calculation penalizes more on false negative. Attempting not to miss any prediction, class weight  is added for output features based on popularity of the products, meaning give more weight for popular products.

 

The month selection methods used in the first two rounds of feature engineering are actually to manually search the time effect between months. Combining the information from the manually searching and time series results based on the three levels of change features, more product information from different months are added as new features. Hereโ€™s an example that how a certain product information of a certain month  is chosen  based on time series.

 

9

 

This is the change of pension account through time. ADF test result shows this time series is stationary, which is statistically significant, the lag number is 4. According to this information, since the data of June, 2015 is as train set, the product change information of February, 2015 is added as a new feature. However, again, this dataset is weird, for month 13, and month 14, there are sharp increase and decrease in the chart. At that time, the bank has about 50,000 pension accounts, and in month 13, there were over 10,000 pension accounts being closed, and in the next month, almost 20,000 newly opened pension account. This also tells that randomly adding the change features into the model is not appropriate, because it is hard to predict the erratic change in the time series.

 

More feature engineering process involves in the model training procedure, and more details will be discussed in the following paragraphs.

 

Models Training

The baseline of our model is the recommendations based on popularity of the products at the end of May 2016. Multiple modeling algorithms were tested at various rounds along with our feature selection process including: Xgboost, Naive Bayes, Random Forest, Neural Networks, Adaboost and collaborative filter. In the last two rounds of  feature selection, Xgboost and Random Forest stood out and out-performed other models.

 

With following new features: adding 5 previous monthsโ€™ account history, a marriage index (combination of age, sex and income), removing city and 5 rare products. Our Xgboost model scored 0.02996 on Kaggle Leader Board and Random Forest model which scored 0.02946 is the second best among our single models.

 

Ensemble Models

Multiple ensemble model strategies were tested in our modeling process. Voting helped to improve the quality of our model in earlier rounds of modeling when we had diverse models of similar quality, however when later we only have two highly quality but correlated models, the voting process stopped helping our models improve. Stacking strategies were also attempted at later stages using Xgboost and Random Forest models, however, due to high correlation of our models, neither of these ensemble model strategies helped us to achieve a model of higher score.

11Ensemble Models - Voting

stacking_1

Ensemble Models - Stacking Strategy 1

stacking_2

Ensemble Models - Stacking Strategy 2

Insights and Finding

The key to build a good model in this competition is to use June 2015 as the train set because 5 correlated main account types (nom_pens, nomina, recibo, reca and cno) show seasonal changes. Using June 2015 as the training month and account history from Jan 2015 to May 2015 enabled us to capture the time series aspect of the dataset. Removing 5 rare products (aval, ahor, viv, deme, deco) also contributed to improve our models, we saw a 4% increase in our Kaggle Leader Board score by making this change alone.

 

At the end of competition, we were working on a new strategy to improve our model. The new customers who joined after June 2015 showed different product purchasing behaviors from the old customers. We could use their data from July 2015, which wasnโ€™t in our training set, to build models for them separately. Although the โ€œnew-customer onlyโ€ model did not improve the predictions on new customers (Kaggle LB score ~ 0.0297), combining them with the predictions from our Xgboost model trained on old customers could provide better predictions.

improvenewcustomerprediction

Strategy to Improve New Customer Prediction

 

Final Result

Our best model at the end of the Santander Product Recommendation Kaggle competition is Xgboost model with aforementioned engineered features, which scored 0.0299626 on Public Leader Board and 0.0302852 on Private Leader Board would put us among top 9% of all the participating teams in this competition.

leaderboardsubmission

Kaggle Leader Board Standing, top 9% of 1806 participating teams

About Authors

Lydia Kan

View all posts by Lydia Kan >

Wen Li

View all posts by Wen Li >

Yisong Tao

Yisong Tao graduated from Columbia University with a PhD degree in Chemistry, after which he worked as a research associate at Albert Einstein College of Medicine for 6 years. Heโ€™s shown good judgement in developing projects, ability to...
View all posts by Yisong Tao >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models
Meetup
Machine learning Uber vs. Lyft price prediction modeling

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application