Scraping Lottery Data

Stephen Penrice

Posted on Dec 7, 2015

Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

In a lottery game, the numbers that the lottery selects are random, but the numbers that players choose to play are not. To the best of my knowledge, data on player selections are not publicly available. However, lotteries do publish data on the numbers they draw and the amounts of the prizes they award. In games where prizes are parimutuel, that is when a certain percentage of sales is divided equally among the winners, one can infer the popularity of the numbers drawn from the prize amounts: popular numbers result in smaller prizes because there are more winners splitting the prize money. The primary component of this project is scraping a variety of lottery websites using a variety of techniques in order to gather data for an analysis that relates prizes amounts to the numbers drawn. Ultimately, I would like to build machine learning models that predict prize amounts as a function of the numbers drawn. However, here I simply present some visualizations and do some hypothesis tests to investigate whether there is a relationship between prize amounts and the sum of the numbers drawn.

Scraping Strategies

In this project a single observation is a lottery drawing, with the data comprising a date, the numbers drawn by the lottery, the number of winners at each prize level, and the prize amount at each level. In order to get all of these data components, one has to visit a separate page for each drawing. Beautiful Soup can easily scrape each of these pages, so the primary challenge was visiting each page within a site in an automated fashion.

Since I was accessing several different websites, I had to employ several different strategies. In increasing order of complexity they were: encoding dates into URL’s, using Selenium to click a link, and using Selenium to fill in a form.

Encoding Dates into a URL

Florida’s Fantasy 5 game is a typical example of a website well sutied to this strategy. A typical results page looks like this.

While it is possible to access individual pages using menus, visiting one of these pages reveals that the URL’s have a particular format that encodes game name and the date of the drawing. For example,

http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn=10%2F13%2F2015&fromDateIn=&toDateIn=&n1In=&n2In=&n3In=&n4In=&n5In=&submitForm=Submit

is the URL for the page that displays the data for the Fantasy 5 drawing that occurred on October 13, 2015, the key portion of the address being the string

10%2F13%2F2015

The following code uses the datetime library to create a date object that it uses to iterate through a specified range of dates, creating a URL string for each one that can be used to access a page which is then processed using Beautiful Soup.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
    answer = dateob.strftime('%m') + '%2F'
    answer = answer + dateob.strftime('%d') + '%2F'
    answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
    return answer

fl5 = open('fl_fant_5.csv','w')
fl5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn='
start_date = date(2007,1,1)
end_date = date(2015,10,26)
current = start_date
while current < end_date:
    url = url_stem + encodeDate(current)
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    numbers = bsPage.find_all("div",class_="winningNumbers")
    temp = numbers[0].get_text()
    draws = re.split('[-\n]',temp)
    draws = draws[1:6]
    winners = bsPage.find_all("td",class_="column2")
    winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
    prizes = bsPage.find_all("td", class_="column3 columnLast")
    prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
    fl5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    print current.strftime('%Y-%m-%d')
    current = current + timedelta(1)
    
fl5.close()
print 'done'

The code for Florida’s Lucky Money game is very similar. The only meaningful difference is that Lucky Money draws happen on Tuesdays and Fridays only, so the code checks the day of the week before building the URL in order to avoid getting an error caused by trying to access a non-existent page.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

def encodeDate(dateob):
    answer = dateob.strftime('%m') + '%2F'
    answer = answer + dateob.strftime('%d') + '%2F'
    answer = answer + dateob.strftime('%Y') + '&submitForm=Submit'
    return answer

fllm = open('fl_lucky_money.csv','w')
fllm.write(','.join(['drawdate','n1','n2','n3','n4','luckyball','win41','win40','win31','win30','win21','win11','win20','prize41','prize40','prize31','prize30','prize21','prize11','prize20'])+'\n')
url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=LUCKYMONEY&singleDateIn='
start_date = date(2014,7,4)
end_date = date(2015,10,24)
current = start_date
while current < end_date:
    while current.strftime('%w') not in ['2','5']:
        current = current + timedelta(1)
    url = url_stem + encodeDate(current)
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    numbers = bsPage.find_all("div",class_="winningNumbers")
    temp = numbers[0].get_text()
    draws = re.split('[-\n]',temp)
    draws = draws[1:6]
    winners = bsPage.find_all("td",class_="column2")
    winners = [tag.get_text().replace(',','') for tag in winners[:-1]]
    prizes = bsPage.find_all("td", class_="column3 columnLast")
    prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]]
    fllm.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    print current.strftime('%Y-%m-%d')
    current = current + timedelta(1)

fllm.close()

North Carolina’s Cash 5 game requires the same strategy. The structure of the code is the same as the Fantasy 5 code, with the differences coming from the differences in the page structures and tags. A sample data page can be found here.

from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re

ncc5 = open('nc_cash_5.csv','w')
ncc5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n')
url_stem = 'http://www.nc-educationlottery.org/cash5_payout.aspx?drawDate='
start_date = date(2006,10,27)
end_date = date(2015,10,27)
current = start_date
p = re.compile('[,$]')
while current < end_date:
    print current.strftime('%Y-%m-%d')
    url = url_stem + current.strftime('%m/%d/%Y')
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    
    draws = []
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num1")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num2")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num3")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num4")[0].get_text()))
    draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num5")[0].get_text()))
    
    winners = []
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5")[0].get_text()))
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4")[0].get_text()))
    winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3")[0].get_text())) 
    
    prizes = []
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5Prize")[0].get_text()))
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4Prize")[0].get_text()))
    prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3Prize")[0].get_text()))
    if prizes[0] == 'Rollover':
        prizes[0] = '0'
    ncc5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n')
    current = current + timedelta(1)
    
ncc5.close()
print 'finished'

Using Selenium to Click a Link

Take a look at the Tennessee Cash website here.

There are two types of links that are of interest here. First there are the “details” links to the right. I chose to deal with these by having Beautiful Soup read the URL’s encoded in the tags and use them to access each page. A more challenging problem is to use the “Next Page” link at the bottom of the page to access the next set of 40 “details” links. For this I used the Selenium package. (Read the documentation here.) Fortunately, the link has an id that remains the same no matter how many times we click, so the code is straightforward.

from selenium import webdriver
from bs4 import BeautifulSoup
import requests
from datetime import date
from time import sleep


def GetTnCashData(url):
    page = requests.get(url).text
    bsPage = BeautifulSoup(page)
    temp = bsPage.find_all("td",class_="SmallBlackText")
    winners = []
    prizes = []
    for i in range(1,8):
        winners.append(temp[3*i+1].get_text())
        prizes.append(temp[3*i+2].get_text().replace('$','').replace(',',''))
    return winners + prizes

def cleanDate(strdate):
    temp = strdate.split('/')
    return date(int(temp[2]),int(temp[0]),int(temp[1])).strftime('%Y-%m-%d')

tnc = open('tn_cash.csv','w')
tnc.write(','.join(['drawdate','n1','n2','n3','n4','n5','cashball','win51','win50','win41','win40','win31','win30','win21','prize51','prize50','prize41','prize40','prize31','prize30','prize21'])+'\n')
driver = webdriver.Firefox()
driver.get('https://www.tnlottery.com/winningnumbers/TennesseeCashlist.aspx?TCShowall=y#TennesseeCashball')
html = driver.page_source
nextLink = "navTennesseeCashNextPage"
soup = BeautifulSoup(html)
for pg in range(0,20):
    temp = soup.find_all("td",align="center")
    top = (len(temp)-4)/3 + 1
    print pg, len(temp)
    for i in range(1,top):
        drawDate = [cleanDate(temp[3*i].get_text())]
        NumsDrawn = temp[3*i+1].get_text().replace('-',' ').split(' ')
        drawID = temp[3*i+2].a.get('href')
        drawID = drawID[drawID.index('=')+1:]
        drawID = drawID[:drawID.index("'")]
        drawData = GetTnCashData('https://www.tnlottery.com/winningnumbers/TennesseeCashdetails_popup.aspx?id='+drawID)
        tnc.write(','.join(drawDate + NumsDrawn + drawData) + '\n')
    driver.find_element_by_id(nextLink).click()
    sleep(30)
    soup = BeautifulSoup(driver.page_source)

tnc.close()
print 'Done'

Note that this code builds each data point from two different sources: the date and numbers drawn are read from the main page while the winner counts and prize amounts are read from the pop-up window you see when you click a “details” link.

Using Selenium to Fill in a Form

Past results from the Oregon Lottery website can be accessed only by using a form on the results page. Once again, Selenium is up to the challenge. Like in the Florida and North Carolina cases, the code iterates through a date object and checks for a valid day of the week (Monday, Wednesday, or Saturday.) However, here Selenium enters the date into the form in two places, “Start Date” and “End Date.” (Using the same date in both parts of the form simplifies both the iteration and the Beautiful Soup processing.) Then Selenium clicks the submit button.

While testing this I noticed that sometimes the code repeats results from a previous selection, most likely due to a failure of the new page to load fast enough. The code deals with this issue in two ways. First, the sleep function from the date module pauses the code for 30 seconds, greatly reducing the likelihood of the problem occuring. As a extra safety measure, the also checks that the date on the page matches the one entered into the form before writing the results to a file. If the dates don’t match, the desired date, i.e. the one Selenium entered on the form, is written to an error log.

from selenium import webdriver
from datetime import timedelta, date
import requests
from bs4 import BeautifulSoup
import re
from time import sleep

ormb = open('or_megabucks.csv','a')
ormb_err = open('or_megabucks_errors.csv','w')
ormb.write(','.join(['drawdate','n1','n2','n3','n4','n5','n6','winners6','winners5','winners4','prize6','prize5','prize4'])+'\n')
start_date = date(2012,10,30)
end_date = date(2015,10,29)
current = start_date

driver = webdriver.Firefox()
driver.get('http://www.oregonlottery.org/games/draw-games/megabucks/past-results')

while current < end_date:
    while current.strftime('%w') not in ['1','3','6']:
        current = current + timedelta(1)    
    driver.find_element_by_id("FromDate").clear()
    driver.find_element_by_id("ToDate").clear()
    driver.find_element_by_id("FromDate").send_keys(current.strftime('%m/%d/%Y'))
    driver.find_element_by_id("ToDate").send_keys(current.strftime('%m/%d/%Y'))
    driver.find_element_by_css_selector(".viewResultsButton").click()
    sleep(30)
    soup = BeautifulSoup(driver.page_source)
    test1 = soup.find_all("td")
    numbers = [test1[i].get_text() for i in range(2,8)] 
    test2 = soup.find_all("strong")
    winners = [test2[1].get_text().replace(',','')]
    prizes = [test2[0].get_text().replace('$','').replace(',','')]
    for i in range(0,2):
        winners.append(test2[4*i+3].get_text().replace(',',''))
        prizes.append(test2[4*i+2].get_text().replace('$','').replace(',','')) 
    testdate = test1[0].get_text().split('/')
    testdate = date(int(testdate[2]),int(testdate[0]),int(testdate[1]))
    if current.strftime('%Y-%m-%d') == testdate.strftime('%Y-%m-%d'):
        ormb.write(','.join([testdate.strftime('%Y-%m-%d')] + numbers + winners + prizes)+'\n')
    else:
        ormb_err.write(current.strftime('%Y-%m-%d') + '\n')
    
    current = current + timedelta(1)

ormb.close()
ormb_err.close()

Visualizations

Any number of visualizations of the scraped data are possible, but here let’s focus on a type of plot that not only suggests an association between the numbers drawn and the prize amounts but also motivates a statistical test to be performed later.

The summary statistic that we will use is simply the sum of the numbers drawn. The plots will show histograms of this sum for two sets of drawings: those where the prize amounts were less than the 25th percentile for all draws (labelled “Small Prizes”) and those where the prize amounts were greater than the 75th percentile for all draws (labelled “Large Prizes”).

North Carolina Cash 5

Oregon Megabucks

Tennessee Cash

Florida Lucky Money

Florida Fantasy 5

Conclusion

The visualizations presented here provide multiple examples of parimutuel lotteries where there seems to be a relationship between the numbers drawn and the prize amounts. Therefore the project of predicting prize amounts from the drawn numbers is likely to produce some results, and using the sum of the drawn numbers appears to be a great starting point.

About Author

Stephen Penrice

After starting his career as a Ph.D. in pure mathematics, Stephen has worked continuously to grow his technical proficiency in order to take on more and more challenges with an applied focus. His latest work in the finance...

View all posts by Stephen Penrice >

Cancel reply

You must be logged in to post a comment.

simcity buildit game May 8, 2017

As you've gotten entered all the pieces now, Simply tap on the generate button which will now confirm your information on the simcity buildit servers, You'll be able to actually see what our online server attempting to entry into the game and then it will try to discover the details on-line, You have to enable it a while until the whole course of ends up, As soon as the process is complete then system will ask you for the verification.

cartier love bangle diamond imitation January 19, 2017

cartierlovejesduas I love the the design and the fact it is Britax! cartier love bangle diamond imitation http://courtshipgift.com/category/white-gold-love-bracelet-replica

Jennifer December 2, 2016

I decided to leave a message here on your Scraping Lottery Data - NYC Data Science Academy BlogNYC Data Science Academy Blog page instead of calling you. Do you need more likes for your Facebook Fan Page? The more people that LIKE your website and fanpage on Facebook, the more credibility you will have with new visitors. It works the same for Twitter, Instagram and Youtube. When people visit your page and see that you have a lot of followers, they now want to follow you too. They too want to know what all the hype is and why all those people are following you. Get some free likes, followers, and views just for trying this service I found: http://nt4.pl/u/72

Scraping Lottery Data

Lucky Numbers Part 1: Web Scraping and Preliminary Analysis

Stephen Penrice

Scraping Strategies

Encoding Dates into a URL

Using Selenium to Click a Link

Using Selenium to Fill in a Form

Visualizations

North Carolina Cash 5

Oregon Megabucks

Tennessee Cash

Florida Lucky Money

Florida Fantasy 5

Conclusion

About Author

Stephen Penrice

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our
amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Scraping Lottery Data

Lucky Numbers Part 1: Web Scraping and Preliminary Analysis

Stephen Penrice

Scraping Strategies

Encoding Dates into a URL

Using Selenium to Click a Link

Using Selenium to Fill in a Form

Visualizations

North Carolina Cash 5

Oregon Megabucks

Tennessee Cash

Florida Lucky Money

Florida Fantasy 5

Conclusion

About Author

Stephen Penrice

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Get detailed curriculum information about our
amazing bootcamp!