NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Machine Learning > Machine Learning for Lotteries

Machine Learning for Lotteries

Stephen Penrice
Posted on Dec 7, 2015

Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his fourth class project(due at 8th week of the program).

Lucky Numbers Part 2: Machine Learning for Understanding Lottery Players' Preferences

Stephen Penrice

Introduction

In a quick unscientific poll at a recent NYC Data Science Academy meetup, most people indicated that they have played the lottery at one time or another, but of those who have played, only a few indicated that they choose their own numbers. But was my audience at the meetup a representative sample of lottery players? Probably not, given the quantitative skills one would assume for people who choose to spend an evening listening to data science presentations. The goal of this project is to understand the selection behavior of lottery players as a whole. In particular I want to answer the following questions:

  • Are there certain number combinations that are selected unusually often by lottery players?
  • In games where winners share a fixed pool of prize money (i.e. parimutuel games), are the expected prize amounts appreciably lower for players who choose popular combinations? In other words, are customers who loyally play โ€œtheirโ€ numbers getting smaller payouts than occasional players who play random numbers selected by the lottery terminal?

While data on player selections are not publicly available, all lotteries in the United States publish their winning numbers and the amount awarded for each prize level. In this project I looked at data from six different games.

  • Florida Fantasy 5
  • Pennsylvania Cash 5
  • New Jersey Cash 5
  • North Carolina Cash 5
  • Texas Cash 5
  • Oregon Megabucks

I encourage you to visit some of these sites and look at past winning numbers to get a feel for the variation in prize amounts. Since these are all parimutuel games, the variation in prizes corresponds to variation in the percentage of players who won. So either the Law of Large Numbers does not apply to lotteries, or there is a non-random aspect of player selection. I hope to convince you that the latter is true.

Hereโ€™s some quick background for readers who are not familiar with lotteries. In the games I studied, the lottery draws 5 or 6 distinct numbers from a set of about 40 integers, and the order in which the numbers are drawn has no effect on prize amounts. For example, New Jersey Cash 5 draws 5 numbers from 1 to 43. The set from which the numbers are selected is called a โ€œmatrixโ€ (not to be confused with the mathematical object with the same name). The Cash 5 games have several hundred thousand possible outcomes, and the Oregon game has about 12 million outcomes. The odds of winning the prizes I discuss range from about 1 in 100 to 1 in 1,000. The target quantity for each model is the prize amount that the lottery will pay to each winner given a set of drawn numbers.

Analysis Organization and Infrastructure

The first challenge I faced in this project was keeping the various components organized and reasonably uniform. The need for organization arose in part from my plan to produce separate models not only for each game but also for different prize levels within the games (e.g. the prize for matching three numbers and the prize for matching four numbers for each โ€œCash 5โ€ game.) Moreover, there are idiosyncracies in the data that are unrelated to player selections but that still affect the prize payouts:

  • New Jersey expanded its โ€œmatrixโ€ in September of 2014 and changed its prize money allocation accordingly. I had to limit my data set to draws for which the current prize scheme applies.
  • In the Texas Cash 5, the prize for matching 4 numbers depends on whether any players won the prize for matching 5, because when no one wins the top prize the jackpot money is divided among the second prize winners rather than being carried over to the next drawโ€™s jackpot. I created separate models for each of these cases.
  • In Floridaโ€™s Fantasy 5 game, the second prizes are increased when there is no top prize winner and there is also a cap of $555 on the second prizes. Whenever there is no top prize winner and a parimutuel payout of the second prizes would be more than $555, the excess money is added to the third prize pool. I dealt with this simply by limiting my data to draws where there was a jackpot winner, which is the vast majority of cases.

This resulted in a total of 12 analyses.

  1. FL Fantasy 5, 3-match prize, draws with at least 1 jackpot winner
  2. FL Fantasy 5, 4-match prize, draws with at least 1 jackpot winner
  3. NJ Cash 5, 3-match prize
  4. NJ Cash 5, 4-match prize
  5. PA Cash 5, 3-match prize
  6. PA Cash 5, 4-match prize
  7. NC Cash 5, 3-match prize
  8. NC Cash 5, 4-match prize
  9. TX Cash 5, 3-match prize
  10. TX Cash 5, 4-match prize, no jackpot winner
  11. TX Cash 5, 4-match prize, at least one jackpot winner
  12. OR Megabucks, 4-match prize

My goal for handling the special considerations discussed above was to find a way to make a note of them in exactly one place and enable my analyses to reflect these idiosyncracies without ever explicitly coding them in R. My solution was to build a PostgreSQL database that holds both the raw data and the information that my R code needed to pull the correct data for each analysis. The structure is summarized in the following diagram:

System_diagram.001

The โ€œDataโ€ box in the upper portion of the diagram represents the tables holding the data I had scraped from the various lottery websites, with a table for each of the games. The โ€œGamesโ€ box is a table that holds the key information for each game: how many numbers are selected, the size of the matrix, the earliest drawing date that should be should be included in the analyses, and the name of the table that holds the data for that game. The โ€œAnalysesโ€ box represents a table that contains the necessary information about each analysis: the id for the game in the previous table, the prize that is being analyzed, and any filters that need to be included when querying the data tables. This structure enables R to retrieve the data it needs for a given analysis by using just the id from the analysis table, and after pulling the data it is ready to calculate features for each draw.

I kept a uniform feature structure across all games and analyses. In order to discuss these features generally, letโ€™s say weโ€™re drawing distinct numbers from the set . The most basic features are the numbers selected, , , โ€ฆ, , where . I also derived various features from these numbers. In order to have a summary of the magnitudes of the numbers drawn, I calculated the sum . I also wanted to model the possibility that players choose numbers from a small range, so I included , the difference between the largest and smallest numbers drawn. In order to test the effect of evenly spaced numbers, I used the standard deviation of the gaps between consecutive numbers, i.e. the standard deviation of . Finally, in order to capture aspects of the numbers that are related to playersโ€™ preferences, superstitions, etc. I included flags where if was drawn and 0 otherwise.

The other potential source of complication in this project was the variety of machine learning models I wanted to apply to all of my analyses:
- regression
- elastic net
- k nearest neighbors
- random forests
- boosting applied to random forests with trees of depth up to 3
Fortunately, the R package โ€œcaretโ€ (โ€œclassification and regression trainingโ€) uses standrdized functions to make it easy to tune and train a variety of models.

Once I had everything standardized, training the models was straightforward. I cut off the data at July 31, 2015 so that I would have a set of recent data that had been untouched by any training, validation, or model selection processes. I split the training/test sets in 75/25 proportions and used root mean squared error on the test set as the criterion for final model selection. I used 5-fold cross-validation to tune the models, and I generally used caretโ€™s default grids for the possible tuning parameters.

Now letโ€™s look at the results of the best models to emerge from this process.

Model Performance

Here are the summaries of the modelsโ€™ performance for draws since August 1, 2015 on. (Remember, these draws were not used to train or select the models.) For the analyses that looked at the 3-match prizes, I rounded the predictions to the the nearest $1.00 or $0.50 (depending on the granularity of the gameโ€™s actual prizes) and reported the results in a confusion matrix. I also show the Mean Absolute Percent Error, or MAPE. The errors are consistently around 5% of the actual values.

## fl_fantasy_5 prize3 
## MAPE: 0.0361
##        actual
## predict 7 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5
##    7    1 0   0 0   0  0    0  0    0  0    0
##    8    0 1   1 0   0  0    0  0    0  0    0
##    8.5  0 2   5 1   0  0    0  0    0  0    0
##    9    0 0   1 3   4  0    0  0    0  0    0
##    9.5  0 0   0 3   2  1    0  0    0  0    0
##    10   0 0   1 1   3  5    3  0    0  0    0
##    10.5 0 0   0 0   0  5    4  0    0  0    0
##    11   0 0   0 0   0  2    4  2    4  0    0
##    11.5 0 0   0 0   0  0    0  1    0  5    0
##    12   0 0   0 0   0  0    0  0    0  1    1
## nj_cash_5 prize3 
## MAPE: 0.0526
##        actual
## predict 9 10 11 12 13 14 15 16 17 18 19 20
##      10 2  0  1  0  0  0  0  0  0  0  0  0
##      11 0  2  3  0  0  0  0  0  0  0  0  0
##      12 0  1  1  3  4  1  0  0  0  0  0  0
##      13 0  0  0  0  3  1  1  0  0  0  0  0
##      14 0  0  0  0  2  5  2  1  0  0  0  0
##      15 0  0  0  0  0  7  6  4  0  0  0  0
##      16 0  0  0  0  0  0  4  5  2  0  0  0
##      17 0  0  0  0  0  0  0  2  5  3  1  0
##      18 0  0  0  0  0  0  0  1  6  1  1  1
##      19 0  0  0  0  0  0  0  0  1  2  2  1
##      20 0  0  0  0  0  0  0  1  0  2  1  0
## pa_cash_5 prize3 
## MAPE: 0.0419
##        actual
## predict 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15
##    8      1 1   3 1   0 0   0  0    0  0    0  0    0  0    0  0    0  0
##    8.5    0 0   0 0   2 1   0  0    0  0    0  0    0  0    0  0    0  0
##    9      0 0   0 1   0 0   0  1    0  0    0  0    0  0    0  0    0  0
##    9.5    0 0   0 0   0 1   1  0    0  0    0  0    0  0    0  0    0  0
##    10     0 0   0 0   0 1   5  6    0  0    0  0    0  0    0  0    0  0
##    10.5   0 0   0 0   0 0   1  2    8  3    0  0    0  0    0  0    0  0
##    11     0 0   0 0   0 0   0  0    0  1    0  0    0  0    0  0    0  0
##    11.5   0 0   0 0   0 0   0  0    1  1    6  3    0  0    0  0    0  0
##    12     0 0   0 0   0 0   0  0    0  0    2  4    2  0    0  0    0  0
##    12.5   0 0   0 0   0 0   0  0    0  0    1  1    1  3    0  0    0  0
##    13     0 0   0 0   0 0   0  0    0  0    0  2    1  1    1  1    0  0
##    13.5   0 0   0 0   0 0   0  0    0  0    0  0    0  2    0  1    3  0
##    14     0 0   0 0   0 0   0  0    0  0    0  0    0  0    2  2    0  0
##    14.5   0 0   0 0   0 0   0  0    0  0    0  0    0  2    0  0    0  0
##    15     0 0   0 0   0 0   0  0    0  0    0  0    0  0    1  0    2  1
##        actual
## predict 15.5 16 16.5 17
##    8       0  0    0  0
##    8.5     0  0    0  0
##    9       0  0    0  0
##    9.5     0  0    0  0
##    10      0  0    0  0
##    10.5    0  0    0  0
##    11      0  0    0  0
##    11.5    0  0    0  0
##    12      0  0    0  0
##    12.5    0  0    0  0
##    13      0  0    0  0
##    13.5    0  0    0  0
##    14      0  0    0  0
##    14.5    1  1    0  0
##    15      0  1    1  1
## nc_cash_5 prize3 
## MAPE: 0.0235
##        actual
## predict  3  4  5  6  7
##       3  2  1  0  0  0
##       4  1 24  2  0  0
##       5  0  3 49  1  0
##       6  0  0  0  3  1
## tx_cash_5 prize3 
## MAPE: 0.0515
##        actual
## predict  7  8  9 10 11 12 13
##      7   2  0  0  0  0  0  0
##      8   0  1  1  0  0  0  0
##      9   0  4  8  3  0  0  0
##      10  0  0  4  9  3  0  0
##      11  0  0  1  9 17  9  2
##      12  0  0  0  0  2  3  1

For the 4-match analyses, I made scatterplots rather than confusion matrices. The percent errors tend to be in the 10% to 15% range. I believe the lower accuracy is due to the fact that there are fewer winners in these cases so the prize amounts are more influenced by random variation.

## fl_fantasy_5 prize4 
## MAPE: 0.0865

unnamed-chunk-6-1
## nj_cash_5 prize4 
## MAPE: 0.1264

unnamed-chunk-7-1
## pa_cash_5 prize4 
## MAPE: 0.1237

unnamed-chunk-8-1
## nc_cash_5 prize4 
## MAPE: 0.1487

unnamed-chunk-9-1
## tx_cash_5 prize4 
## MAPE: 0.1393

unnamed-chunk-10-1
## tx_cash_5 prize4 
## MAPE: 0.1596

unnamed-chunk-11-1
## or_megabucks prize4 
## MAPE: 0.0605

unnamed-chunk-12-1

Applying the Models

These are accurate models, but they have one shortcoming: THEY DONโ€™T ANSWER OUR QUESTIONS!!! They simply answer the question, โ€œFor a given combination that has been drawn, what is the predicted prize amount that the lottery will pay to the winners?โ€ One could argue that the models say something about which combinations are most popular, because low prize payouts correspond to popular combinations. But they arenโ€™t much help in understanding our second question: how much of an impact is there on a given selectionโ€™s expected prize? For example, if we want to know the expected 3-match prize for the selection , we have to apply the relevant model to and to and to and to several thousand other possible draws that match in 3 places. It is possible that when we average over all winning combinations, there is not much difference in the expected prize amount. So we need to do multiple applications of the models, and we need to do so efficiently.

More formally, given a model for estimating the expected prize for matches, the following expression gives the expected prize amount for a given selection :

where

In general, is large: when the game selects numbers from . For example, in the 3-match analysis of New Jersey Cash 5 . Since the are selections to evaluate (962,598 in the New Jersey example), we need to make the model calculations as efficient as possible. One tactic is to precompute the model on all combinations and simply look up these values when evaluating . And the list of precomputed values will be most efficient if it is in lexicographic order, because then there is a fast algorithm for finding the position of a given value on the list using just the elements of .

Unfortunately, I found that even with these efficiencies, it takes about 0.8 second to calculate for a single selection . At this rate it would take 8 to 9 days to evaluate all the expected 3-match prizes for New Jersey Cash 5, and thatโ€™s just one of my twelve analyses! So I needed to find a faster implementation.

I was willing to sacrifice some accuracy in order to speed up the calculations of , and it occurred to me that using a linear function might be helpful. If has the form

then

where is the average of over all in . This will not necessarily speed up the calculations, because we still need to average over all of . But it does help when we do a regression on the flags . If

then

where

which can be evaluated very quickly: all 12 of my analyses ran in about one hour. (See the Appendix for a proof that .)

So we are finally in a position to find the selections for each game that have the 10 lowest expected prize amounts. Here are the results for each 3-match analysis.

FL Fantasy 5:

##       n1 n2 n3 n4 n5 avgprize
##  [1,]  3  5  7  9 11 7.943926
##  [2,]  5  7  9 10 11 7.955978
##  [3,]  3  7  9 10 11 7.962647
##  [4,]  5  7  8  9 11 7.963678
##  [5,]  3  7  8  9 11 7.970347
##  [6,]  5  7  9 11 12 7.973561
##  [7,]  3  7  9 11 12 7.980230
##  [8,]  7  8  9 10 11 7.982399
##  [9,]  7  9 10 11 12 7.992283
## [10,]  7  8  9 11 12 7.999983

New Jersey Cash 5:

##       n1 n2 n3 n4 n5 avgprize
##  [1,]  3  5  7  8 12 10.96077
##  [2,]  3  5  7  9 12 10.98105
##  [3,]  3  5  7  8  9 10.99249
##  [4,]  5  7  8  9 12 10.99942
##  [5,]  3  7  8  9 12 11.02162
##  [6,]  3  5  7 11 12 11.02523
##  [7,]  3  5  7  8 11 11.03667
##  [8,]  5  7  8 11 12 11.04359
##  [9,]  3  5  7  9 11 11.05695
## [10,]  5  7  9 11 12 11.06387

Pennsylvania Cash 5:

##       n1 n2 n3 n4 n5 avgprize
##  [1,]  5  7  9 11 12 7.803290
##  [2,]  3  5  7 11 12 7.818146
##  [3,]  5  7  8 11 12 7.830267
##  [4,]  5  7 10 11 12 7.862501
##  [5,]  3  5  7  9 11 7.866244
##  [6,]  5  7  8  9 11 7.878364
##  [7,]  3  5  7  8 11 7.893220
##  [8,]  5  7  9 10 11 7.910598
##  [9,]  3  5  7 10 11 7.925455
## [10,]  3  5  7  9 12 7.937129

North Carolina Cash 5:

##       n1 n2 n3 n4 n5 avgprize
##  [1,]  5  7  8  9 11 3.576551
##  [2,]  3  5  7  9 11 3.589897
##  [3,]  3  7  8  9 11 3.600528
##  [4,]  3  5  7  8 11 3.611937
##  [5,]  5  7  9 11 12 3.612730
##  [6,]  7  8  9 11 12 3.623360
##  [7,]  5  7  8 11 12 3.634770
##  [8,]  3  7  9 11 12 3.636706
##  [9,]  3  5  8  9 11 3.638946
## [10,]  5  7  9 10 11 3.641547

Texas Cash 5:

##       n1 n2 n3 n4 n5 avgprize
##  [1,]  3  5  7  9 11 7.859660
##  [2,]  5  7  8  9 11 7.879023
##  [3,]  5  7  9 10 11 7.889834
##  [4,]  3  7  8  9 11 7.904373
##  [5,]  3  7  9 10 11 7.915184
##  [6,]  3  5  7  8  9 7.920539
##  [7,]  3  5  7  8 11 7.926993
##  [8,]  5  7  9 11 12 7.929643
##  [9,]  3  5  7  9 10 7.931350
## [10,]  7  8  9 10 11 7.934547

The level of agreement across the different data sets is truly remarkable. The numbers are low, all less than 12, but 2, 4, and 6 do not appear on any of the lists. Meanwhile, 7 and 11 appear in almost every combination.

There is still the question of how much players are disadvantaged when they choose these popular combinations. To quantify this, we can look at the smallest expected prizes (already shown above), the average expected prize, and the largest expected prize. Here are the results for the 3-match analyses.

##           Game Minimum Average Maximum
## 1 FL Fantasy 5    7.94    9.97   12.27
## 2    NJ Cash 5   10.96   15.08   21.55
## 3    PA Cash 5    7.80   11.55   16.55
## 4    NC Cash 5    3.58    4.66    5.96
## 5    TX Cash 5    7.86   10.14   12.67

We should also scale these numbers to the probability of winning a 3-match prize. This also allows for an apples-to-apples comparison across games and puts the differences on the same scale as the expected prize payout, typically about $0.50.

##           Game Minimum Average Maximum
## 1 FL Fantasy 5  0.0979  0.1230  0.1513
## 2    NJ Cash 5  0.0800  0.1101  0.1574
## 3    PA Cash 5  0.0570  0.0844  0.1209
## 4    NC Cash 5  0.0349  0.0454  0.0581
## 5    TX Cash 5  0.0894  0.1154  0.1442

So we can see that the difference in expected payouts between the most and least popular selections is often around 10% of the total expected prize payout, and this is not even considering the 4-match prizes, or the fact that players who hit the jackpot with a popular combination have a high likelihood of having to share that prize. Whether it was a conscious design choice or not, it would seem parimutuel lotteries give greater reinforcement to their casual players, i.e. the one who donโ€™t select their own numbers.

Take-aways

Aside from the insights into the lottery, I think there are a few data science lessons to be learned here:
- Sometimes data that directly addresses your questions is not available, so you need to look for data that speaks to a related question.
- If you do end up using data in an indirect manner, keep the original question in mind and remember to transform your results back to the original context.
- If your most accurate models canโ€™t be implemented in a way that answers your question, try a less accurate model that can.

Appendix

To prove that , we need to show two things:
- If then of the sets in contain
- If then of the sets in contain

Any set in consists of an -element subset of and a -element subset of . In the case where we only need to find the fraction of -element subsets of that contain . There are such sets because that is the number of ways we can choose the elements other than . So the fraction that contain is

Similarly, in the case where we only need to find the fraction of -element subsets of that contain . There are such sets because that is the number of ways we can choose the elements other than . So the fraction that contain is

About Author

Stephen Penrice

After starting his career as a Ph.D. in pure mathematics, Stephen has worked continuously to grow his technical proficiency in order to take on more and more challenges with an applied focus. His latest work in the finance...
View all posts by Stephen Penrice >

Leave a Comment

Cancel reply

You must be logged in to post a comment.

facebook ads and click farms texas July 13, 2016
The paid solution to advertise like placing small advertisings or banners costs you but they can be effective when done properly. You can make traffic by creating attractive content for ones website. After adding your brand-new campaign, complete the method by deciding on a sort of bid that you are going to take.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application