Scraping Lottery Data

Posted on Dec 7, 2015

Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

In a lottery game, the numbers that the lottery selects are random, but the numbers that players choose to play are not. To the best of my knowledge, data on player selections are not publicly available. However, lotteries do publish data on the numbers they draw and the amounts of the prizes they award. In games where prizes are parimutuel, that is when a certain percentage of sales is divided equally among the winners, one can infer the popularity of the numbers drawn from the prize amounts: popular numbers result in smaller prizes because there are more winners splitting the prize money. The primary component of this project is scraping a variety of lottery websites using a variety of techniques in order to gather data for an analysis that relates prizes amounts to the numbers drawn. Ultimately, I would like to build machine learning models that predict prize amounts as a function of the numbers drawn. However, here I simply present some visualizations and do some hypothesis tests to investigate whether there is a relationship between prize amounts and the sum of the numbers drawn.

Scraping Strategies

In this project a single observation is a lottery drawing, with the data comprising a date, the numbers drawn by the lottery, the number of winners at each prize level, and the prize amount at each level. In order to get all of these data components, one has to visit a separate page for each drawing. Beautiful Soup can easily scrape each of these pages, so the primary challenge was visiting each page within a site in an automated fashion.

Since I was accessing several different websites, I had to employ several different strategies. In increasing order of complexity they were: encoding dates into URL’s, using Selenium to click a link, and using Selenium to fill in a form.

Encoding Dates into a URL

Florida’s Fantasy 5 game is a typical example of a website well sutied to this strategy. A typical results page looks like this.

While it is possible to access individual pages using menus, visiting one of these pages reveals that the URL’s have a particular format that encodes game name and the date of the drawing. For example,

http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn=10%2F13%2F2015&fromDateIn=&toDateIn=&n1In=&n2In=&n3In=&n4In=&n5In=&submitForm=Submit

is the URL for the page that displays the data for the Fantasy 5 drawing that occurred on October 13, 2015, the key portion of the address being the string

10%2F13%2F2015

The following code uses the datetime library to create a date object that it uses to iterate through a specified range of dates, creating a URL string for each one that can be used to access a page which is then processed using Beautiful Soup.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
    answer = dateob.strftime('%m') + '%2F'
    answer = answer + dateob.strftime('%d') + '%2F'
    answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
    return answer

fl5 = open('fl_fant_5.csv','w')
fl5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn='
start_date = date(2007,1,1)
end_date = date(2015,10,26)
current = start_date
while current < end_date:
    url = url_stem + encodeDate(current)
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    numbers = bsPage.find_all("div",class_="winningNumbers")
    temp = numbers[0].get_text()
    draws = re.split('[-\n]',temp)
    draws = draws[1:6]
    winners = bsPage.find_all("td",class_="column2")
    winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
    prizes = bsPage.find_all("td", class_="column3 columnLast")
    prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
    fl5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    print current.strftime('%Y-%m-%d')
    current = current + timedelta(1)
    
fl5.close()
print 'done'

The code for Florida’s Lucky Money game is very similar. The only meaningful difference is that Lucky Money draws happen on Tuesdays and Fridays only, so the code checks the day of the week before building the URL in order to avoid getting an error caused by trying to access a non-existent page.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
    answer = dateob.strftime('%m') + '%2F'
    answer = answer + dateob.strftime('%d') + '%2F'
    answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
    return answer

fllm = open('fl_lucky_money.csv','w')
fllm.write(','.join(['drawdate','n1','n2','n3','n4','luckyball','win41','win40','win31','win30','win21','win11','win20','prize41','prize40','prize31','prize30','prize21','prize11','prize20'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=LUCKYMONEY&singleDateIn='
start_date = date(2014,7,4)
end_date = date(2015,10,24)
current = start_date
while current < end_date:
    while current.strftime('%w') not in ['2','5']:
        current = current + timedelta(1)
    url = url_stem + encodeDate(current)
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    numbers = bsPage.find_all("div",class_="winningNumbers")
    temp = numbers[0].get_text()
    draws = re.split('[-\n]',temp)
    draws = draws[1:6]
    winners = bsPage.find_all("td",class_="column2")
    winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
    prizes = bsPage.find_all("td", class_="column3 columnLast")
    prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
    fllm.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    print current.strftime('%Y-%m-%d')
    current = current + timedelta(1)

fllm.close()

North Carolina’s Cash 5 game requires the same strategy. The structure of the code is the same as the Fantasy 5 code, with the differences coming from the differences in the page structures and tags. A sample data page can be found here.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

ncc5 = open('nc_cash_5.csv','w')
ncc5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.nc-educationlottery.org/cash5_payout.aspx?drawDate='
start_date = date(2006,10,27)
end_date = date(2015,10,27)
current = start_date
p = re.compile('[,$]')
while current < end_date:
    print current.strftime('%Y-%m-%d')
    url = url_stem + current.strftime('%m/%d/%Y')
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    
    draws = []
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num1")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num2")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num3")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num4")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num5")[0].get_text()))
    
    winners = []
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5")[0].get_text()))
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4")[0].get_text()))
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3")[0].get_text())) 
    
    prizes = []
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5Prize")[0].get_text()))
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4Prize")[0].get_text()))
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3Prize")[0].get_text()))
    if prizes[0] == 'Rollover':
        prizes[0] = '0'
    ncc5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    current = current + timedelta(1)
    
ncc5.close()
print 'finished'

Using Selenium to Fill in a Form

Past results from the Oregon Lottery website can be accessed only by using a form on the results page. Once again, Selenium is up to the challenge. Like in the Florida and North Carolina cases, the code iterates through a date object and checks for a valid day of the week (Monday, Wednesday, or Saturday.) However, here Selenium enters the date into the form in two places, “Start Date” and “End Date.” (Using the same date in both parts of the form simplifies both the iteration and the Beautiful Soup processing.) Then Selenium clicks the submit button.

While testing this I noticed that sometimes the code repeats results from a previous selection, most likely due to a failure of the new page to load fast enough. The code deals with this issue in two ways. First, the sleep function from the date module pauses the code for 30 seconds, greatly reducing the likelihood of the problem occuring. As a extra safety measure, the also checks that the date on the page matches the one entered into the form before writing the results to a file. If the dates don’t match, the desired date, i.e. the one Selenium entered on the form, is written to an error log.

from selenium import webdriver
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
from time import sleep

ormb = open('or_megabucks.csv','a')
ormb_err = open('or_megabucks_errors.csv','w')
ormb.write(','.join(['drawdate','n1','n2','n3','n4','n5','n6','winners6','winners5','winners4','prize6','prize5','prize4'])+'\n')
start_date = date(2012,10,30)
end_date = date(2015,10,29)
current = start_date

driver = webdriver.Firefox()
driver.get('http://www.oregonlottery.org/games/draw-games/megabucks/past-results')

while current < end_date:
    while current.strftime('%w') not in ['1','3','6']:
        current = current + timedelta(1)    
    driver.find_element_by_id("FromDate").clear()
    driver.find_element_by_id("ToDate").clear()
    driver.find_element_by_id("FromDate").send_keys(current.strftime('%m/%d/%Y'))
    driver.find_element_by_id("ToDate").send_keys(current.strftime('%m/%d/%Y'))
    driver.find_element_by_css_selector(".viewResultsButton").click()
    sleep(30)
    soup = BeautifulSoup(driver.page_source)
    test1 = soup.find_all("td")
    numbers = [test1[i].get_text() for i in range(2,8)] 
    test2 = soup.find_all("strong")
    winners = [test2[1].get_text().replace(',','')]
    prizes = [test2[0].get_text().replace('$','').replace(',','')]
    for i in range(0,2):
        winners.append(test2[4*i+3].get_text().replace(',',''))
        prizes.append(test2[4*i+2].get_text().replace('$','').replace(',','')) 
    testdate = test1[0].get_text().split('/')
    testdate = date(int(testdate[2]),int(testdate[0]),int(testdate[1]))
    if current.strftime('%Y-%m-%d') == testdate.strftime('%Y-%m-%d'):
        ormb.write(','.join([testdate.strftime('%Y-%m-%d')] + numbers + winners + prizes)+'\n')
    else:
        ormb_err.write(current.strftime('%Y-%m-%d') + '\n')
    
    current = current + timedelta(1)

ormb.close()
ormb_err.close()

Visualizations

Any number of visualizations of the scraped data are possible, but here let’s focus on a type of plot that not only suggests an association between the numbers drawn and the prize amounts but also motivates a statistical test to be performed later.

The summary statistic that we will use is simply the sum of the numbers drawn. The plots will show histograms of this sum for two sets of drawings: those where the prize amounts were less than the 25th percentile for all draws (labelled “Small Prizes”) and those where the prize amounts were greater than the 75th percentile for all draws (labelled “Large Prizes”).

North Carolina Cash 5

ncc5_blog

Oregon Megabucks

ormb_blog

Tennessee Cash

tnc_blog

Florida Lucky Money

fllm_blog

Florida Fantasy 5

ff5_blog

Conclusion

The visualizations presented here provide multiple examples of parimutuel lotteries where there seems to be a relationship between the numbers drawn and the prize amounts. Therefore the project of predicting prize amounts from the drawn numbers is likely to produce some results, and using the sum of the drawn numbers appears to be a great starting point.

About Author

Stephen Penrice

After starting his career as a Ph.D. in pure mathematics, Stephen has worked continuously to grow his technical proficiency in order to take on more and more challenges with an applied focus. His latest work in the finance...
View all posts by Stephen Penrice >

Leave a Comment

simcity buildit game May 8, 2017
As you've gotten entered all the pieces now, Simply tap on the generate button which will now confirm your information on the simcity buildit servers, You'll be able to actually see what our online server attempting to entry into the game and then it will try to discover the details on-line, You have to enable it a while until the whole course of ends up, As soon as the process is complete then system will ask you for the verification.
cartier love bangle diamond imitation January 19, 2017
cartierlovejesduas I love the the design and the fact it is Britax! cartier love bangle diamond imitation http://courtshipgift.com/category/white-gold-love-bracelet-replica
Jennifer December 2, 2016
I decided to leave a message here on your Scraping Lottery Data - NYC Data Science Academy BlogNYC Data Science Academy Blog page instead of calling you. Do you need more likes for your Facebook Fan Page? The more people that LIKE your website and fanpage on Facebook, the more credibility you will have with new visitors. It works the same for Twitter, Instagram and Youtube. When people visit your page and see that you have a lot of followers, they now want to follow you too. They too want to know what all the hype is and why all those people are following you. Get some free likes, followers, and views just for trying this service I found: http://nt4.pl/u/72

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI