Data Study to Predict NBA Player Positions

Posted on Aug 15, 2016
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The goal of this project is to predict the position of a NBA player based on their data statistics.


Decision Tree & Random Forest Classification


Traditionally, the 5 basketball positions normally employed by NBA teams are the point guard, shooting guard, small forward, power forward, and the center. Despite the fact that the NBA has tried to move toward more position-less basketball, there exists clear distinctions between position based on player statistics. In this project I will attempt to classify players into one of the five main basketball positions based players statistics scraped from statscrunch. I make use of decision trees and random forest to do the prediction.


1. Center

The center is usually the tallest player on the team. He should be able to post up offensively, receiving the the ball with his back to the basket and use pivot moves to hit a variety of short jumpers, hook shots, and dunks. He also must know how to find the open player in the paint and grab rebounds and blocks.

Average Height: 6’11.25″

Average Weight:257lbs

Data Study to Predict NBA Player Positions

2. Point Guard

The point guard is usually the shortest player on the team. Should be the team’s best passer and ball handler; not primarily a shooter. Traditional role is to push the ball up-court and start the offensive wheels turning. Should either take the ball to the basket or remain near the top of the key, ready to retreat on defense. Point guards are known to get the most assists and steals.

Average Height: 6’2″

Average Weight: 189 lbs

Data Study to Predict NBA Player Positions

3. Shooting Guard

The shooting guard is usually a little bit taller than the point guard but shorter than a small forward. The shooting guard is considered as the teams best shooter generally. A good shooting guard comes off screens set by taller teammates prepared to shoot, pass, or drive to the basket. In addition, the shooting guard must also be versatile enough to handle some of the point guard’s ball-handling duties.

Average Height: 6’5.25

Average Weight: 209lbs

Data Study to Predict NBA Player Positions

4. Small Forward

The small forward is considered as the all-purpose player on offense. An strong and aggressive who is tall enough to mix it up inside but agile enough to handle the ball and shoot well. Must be able to score both from the perimeter and from inside.

Average Height: 6’7.75″
Average Weight: 225lbs


Small Forward No. 2: LeBron James (Previous Ranking: No. 1)

5. Power Forward

Finally, the power forward is a strong, tough re-bounder, but also athletic enough to move with some quickness around the lane on offense and defense. Expected to score when given the opportunity on the baseline, much like a center, but usually has a range of up to 15 feet all around the basket.

Average Height: 6’9.5″

Average Weight: 246lbs

Power Forward No. 1: Draymond Green (Previous Ranking: No. 1)


My analysis sets out to predict what position a player is based on 15 player statistics. The 15 variables I use are :

  1. Points per Game
  2. True Shooting Score
  3. Offensive Rebounds
  4. Defensive Rebounds
  5. Total Rebounds
  6. Assists
  7. Blocks
  8. Turn Overs
  9. Team Play Usage
  10. Offensive Ratings
  11. Defensive Ratings
  12. Offensive Win Shares
  13. Defensive Win Shares
  14. Win Shares
  15. Steals

Data Visualization

To start my analysis off, let's look at how some of these variables are distributed between positions. Looking at the distributions, we start to see which variables will best help us to differentiate between each position, as well as give us an indication of the main splitting points the decision tree will make use of. This visualization shows us how the 15 variables are distributed across the 5 positions.

We can clearly see that Centers and Power-Forwards dominate the rebound and block statistics. This might lead to some issues in my decision trees, since the more similarities there are between positions, the more difficulty it will be to distinguish between them. The clearest distinction becomes apparent when we look at assists. Point guards have by far the most assists, which leads me to believe that this will be the easiest post to consistently predict correctly.


Summary Statistics

Screen Shot 2016-08-15 at 11.53.49 AM



Screen Shot 2016-08-14 at 7.11.54 PM

Heatmap - by Player

Screen Shot 2016-08-15 at 9.45.23 PM


Heatmap - by Position

Screen Shot 2016-08-15 at 9.50.26 PM



Prediction Using Decision Trees and Random Forest Classification

1. Decision Tree - Good Fit

Decision Trees are like a game of 20 questions. It asks different questions based on the answers to previous questions, and then at the end it makes a guess based on all the answers. Decision trees are simple to understand and to interpret, and can be easily  visualized.

In order to make predictions with the tree, we start at the top (the “root” node), and ask questions, traveling left or right in the tree based on what the answer is (left for true and right for false). At each step, we reach a new node, with a new question. In my example, in this particular tree, we can see that if TRB > 13(different each time), and steals are less than 0.95, the position of the player is predicted to be a Center. Very easy to visualize and interpret indeed.



Decision Tree

Results :

Screen Shot 2016-08-15 at 9.55.31 AM

The results above represents how well my tree performed. Centers were predicted correctly 56% of the time, and power-forwards 41% of the time. We see a lot of overlap between these two categories. Not surprising due to the fact that these two positions share a very similar distributions of independent variables. Not surprisingly, point guards were predicted correctly  80% of the time. I suspect some of the miss-predictions will be corrected when we implement a random forest.


2. Decision Tree - Over Fit


Screen Shot 2016-08-14 at 7.24.07 PM




Screen Shot 2016-08-15 at 10.00.10 AM


Over-fitting is technically defined as a model that performs better on a training set than another simpler model, but does worse on unseen data, as we saw above. Correct predictions of Centers decreased by 13%, power-forwards by 2%, point guards by 18%, small forwards by 23%. Predictions for shooting guards increased by 18%, but overall the over-fitted model performed substantially worse.  We went too far and grew our decision tree out to encompass massively complex rules that may not generalize to unknown player positions. It important to find the sweet-spot in the bias-variance trade-off.


3. Random Forest

The random forest (see figure below) takes a decision tree to a whole-nother level by combining decision trees with the notion of an ensemble.  As we can see from the diagram below,  random forests add some randomness into the equation.

  • One type of randomness involves the set of examples used to build the trees. Instead of training with all of the training examples, we use only a random subset. This is called “bootstrap aggregating,” and is done to stabilize the predictions and reduce over-fitting.
  • The other type of randomness involves the set of features used to build each tree. Different trees will consider different features when asking questions. Random forests have low bias, and by adding more trees, we reduce variance, and thus over-fitting.

Before I go into the results, I'll first explain how the system is trained. Lets generalize the number of trees to T. The first step is to sample N(100) trees at random and create a subset of the data. Usually we use about 2/3rds of the data to train our model. At each node of the tree,  m predictor variables are selected at random. Next, the variable that happens to be the best split is used to do a binary split on that node. At the next node, choose another m variables at random from all predictor variables and do the same.


In my case study, I decided to stick to tuning only two parameters - mtry and ntree. These two parameters usually have the biggest impact on the final accuracy. ntree represents the number of trees I will build, and mtry represents the number of variables randomly sampled as candidates at each split.

Tuning Parameters

For the first parameter I started tuning was the number of trees I used.  I decided I decided to use a 180 trees . The plot below shows that the out of bag error is the lowest around this number.


Screen Shot 2016-08-15 at 3.15.03 PM


For the mtry parameter, I decided to randomly sample 9 variables each time. I made my decision based on the following plot. We can see that the out of bag error is minimized at 9.

Screen Shot 2016-08-15 at 3.59.00 PM



Screen Shot 2016-08-15 at 9.54.27 AM


We can see that the random forest performed better than the decision tree. It significantly predicted point guards better. We also saw increases in center predictions as well as small forward predictions. There was little decreases in power forward and shooting guard predictions. My data set was rather small, so the random forest will most likely perform a lot better if we had more data to work with. As a next step in my project, I plan on scarping data from 50 seasons, and do a similar analysis and see how my results and models perform with more data.


This analysis set out to predict player positions based on a whole array of players statistics. I made use of a decision tree, and then extended that tree to a whole forest. The random forest performed better, but still left a lot of room for improvement.


About Author

Jurgen De Jager

Jurgen’s fascination with analytics and its applications specifically within data science, led to his decision some time ago that this is the career path he wants to pursue post graduation. In anticipation of this, he has worked extensively...
View all posts by Jurgen De Jager >

Leave a Comment

用随机森林预测NBA球员打什么么位置 - 数据分析网 June 7, 2017
[…] 原文作者:Jurgen  译者:Cup 原文链接: […]
用随机森林预测NBA球员打什么么位置-IT文库 February 27, 2017
[…] 原文作者:Jurgen译者:Cup原文链接: […]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI