Scraping Lottery Data
Contributed by Stephen Penrice. He took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).
Lucky Numbers Part 1: Web Scraping and Preliminary Analysis
Stephen Penrice
In a lottery game, the numbers that the lottery selects are random, but the numbers that players choose to play are not. To the best of my knowledge, data on player selections are not publicly available. However, lotteries do publish data on the numbers they draw and the amounts of the prizes they award. In games where prizes are parimutuel, that is when a certain percentage of sales is divided equally among the winners, one can infer the popularity of the numbers drawn from the prize amounts: popular numbers result in smaller prizes because there are more winners splitting the prize money. The primary component of this project is scraping a variety of lottery websites using a variety of techniques in order to gather data for an analysis that relates prizes amounts to the numbers drawn. Ultimately, I would like to build machine learning models that predict prize amounts as a function of the numbers drawn. However, here I simply present some visualizations and do some hypothesis tests to investigate whether there is a relationship between prize amounts and the sum of the numbers drawn.
Scraping Strategies
In this project a single observation is a lottery drawing, with the data comprising a date, the numbers drawn by the lottery, the number of winners at each prize level, and the prize amount at each level. In order to get all of these data components, one has to visit a separate page for each drawing. Beautiful Soup can easily scrape each of these pages, so the primary challenge was visiting each page within a site in an automated fashion.
Since I was accessing several different websites, I had to employ several different strategies. In increasing order of complexity they were: encoding dates into URL’s, using Selenium to click a link, and using Selenium to fill in a form.
Encoding Dates into a URL
Florida’s Fantasy 5 game is a typical example of a website well sutied to this strategy. A typical results page looks like this.
While it is possible to access individual pages using menus, visiting one of these pages reveals that the URL’s have a particular format that encodes game name and the date of the drawing. For example,
http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn=10%2F13%2F2015&fromDateIn=&toDateIn=&n1In=&n2In=&n3In=&n4In=&n5In=&submitForm=Submit
is the URL for the page that displays the data for the Fantasy 5 drawing that occurred on October 13, 2015, the key portion of the address being the string
10%2F13%2F2015
The following code uses the datetime
library to create a date object that it uses to iterate through a specified range of dates, creating a URL string for each one that can be used to access a page which is then processed using Beautiful Soup.
from datetime import timedelta, date import requests from bs4 import BeautifulSoup import re def encodeDate(dateob): answer = dateob.strftime('%m') + '%2F' answer = answer + dateob.strftime('%d') + '%2F' answer = answer + dateob.strftime('%Y') + '&submitForm=Submit' return answer fl5 = open('fl_fant_5.csv','w') fl5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n') url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=FANTASY5&singleDateIn=' start_date = date(2007,1,1) end_date = date(2015,10,26) current = start_date while current < end_date: url = url_stem + encodeDate(current) page = requests.get(url).text bsPage = BeautifulSoup(page) numbers = bsPage.find_all("div",class_="winningNumbers") temp = numbers[0].get_text() draws = re.split('[-\n]',temp) draws = draws[1:6] winners = bsPage.find_all("td",class_="column2") winners = [tag.get_text().replace(',','') for tag in winners[:-1]] prizes = bsPage.find_all("td", class_="column3 columnLast") prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]] fl5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n') print current.strftime('%Y-%m-%d') current = current + timedelta(1) fl5.close() print 'done'
The code for Florida’s Lucky Money game is very similar. The only meaningful difference is that Lucky Money draws happen on Tuesdays and Fridays only, so the code checks the day of the week before building the URL in order to avoid getting an error caused by trying to access a non-existent page.
from datetime import timedelta, date import requests from bs4 import BeautifulSoup import re def encodeDate(dateob): answer = dateob.strftime('%m') + '%2F' answer = answer + dateob.strftime('%d') + '%2F' answer = answer + dateob.strftime('%Y') + '&submitForm=Submit' return answer fllm = open('fl_lucky_money.csv','w') fllm.write(','.join(['drawdate','n1','n2','n3','n4','luckyball','win41','win40','win31','win30','win21','win11','win20','prize41','prize40','prize31','prize30','prize21','prize11','prize20'])+'\n') url_stem = 'http://www.flalottery.com/site/winningNumberSearch?searchTypeIn=date&gameNameIn=LUCKYMONEY&singleDateIn=' start_date = date(2014,7,4) end_date = date(2015,10,24) current = start_date while current < end_date: while current.strftime('%w') not in ['2','5']: current = current + timedelta(1) url = url_stem + encodeDate(current) page = requests.get(url).text bsPage = BeautifulSoup(page) numbers = bsPage.find_all("div",class_="winningNumbers") temp = numbers[0].get_text() draws = re.split('[-\n]',temp) draws = draws[1:6] winners = bsPage.find_all("td",class_="column2") winners = [tag.get_text().replace(',','') for tag in winners[:-1]] prizes = bsPage.find_all("td", class_="column3 columnLast") prizes = [tag.get_text().replace('$','').replace(',','') for tag in prizes[:-1]] fllm.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n') print current.strftime('%Y-%m-%d') current = current + timedelta(1) fllm.close()
North Carolina’s Cash 5 game requires the same strategy. The structure of the code is the same as the Fantasy 5 code, with the differences coming from the differences in the page structures and tags. A sample data page can be found here.
from datetime import timedelta, date import requests from bs4 import BeautifulSoup import re ncc5 = open('nc_cash_5.csv','w') ncc5.write(','.join(['drawdate','n1','n2','n3','n4','n5','winners5','winners4','winners3','prize5','prize4','prize3'])+'\n') url_stem = 'http://www.nc-educationlottery.org/cash5_payout.aspx?drawDate=' start_date = date(2006,10,27) end_date = date(2015,10,27) current = start_date p = re.compile('[,$]') while current < end_date: print current.strftime('%Y-%m-%d') url = url_stem + current.strftime('%m/%d/%Y') page = requests.get(url).text bsPage = BeautifulSoup(page) draws = [] draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num1")[0].get_text())) draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num2")[0].get_text())) draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num3")[0].get_text())) draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num4")[0].get_text())) draws.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Num5")[0].get_text())) winners = [] winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5")[0].get_text())) winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4")[0].get_text())) winners.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3")[0].get_text())) prizes = [] prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match5Prize")[0].get_text())) prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match4Prize")[0].get_text())) prizes.append(p.sub('',bsPage.find_all("span",id="ctl00_MainContent_lblCash5Match3Prize")[0].get_text())) if prizes[0] == 'Rollover': prizes[0] = '0' ncc5.write(','.join([current.strftime('%Y-%m-%d')] + draws + winners + prizes)+'\n') current = current + timedelta(1) ncc5.close() print 'finished'
Using Selenium to Click a Link
Take a look at the Tennessee Cash website here.
There are two types of links that are of interest here. First there are the “details” links to the right. I chose to deal with these by having Beautiful Soup read the URL’s encoded in the tags and use them to access each page. A more challenging problem is to use the “Next Page” link at the bottom of the page to access the next set of 40 “details” links. For this I used the Selenium package. (Read the documentation here.) Fortunately, the link has an id that remains the same no matter how many times we click, so the code is straightforward.
from selenium import webdriver from bs4 import BeautifulSoup import requests from datetime import date from time import sleep def GetTnCashData(url): page = requests.get(url).text bsPage = BeautifulSoup(page) temp = bsPage.find_all("td",class_="SmallBlackText") winners = [] prizes = [] for i in range(1,8): winners.append(temp[3*i+1].get_text()) prizes.append(temp[3*i+2].get_text().replace('$','').replace(',','')) return winners + prizes def cleanDate(strdate): temp = strdate.split('/') return date(int(temp[2]),int(temp[0]),int(temp[1])).strftime('%Y-%m-%d') tnc = open('tn_cash.csv','w') tnc.write(','.join(['drawdate','n1','n2','n3','n4','n5','cashball','win51','win50','win41','win40','win31','win30','win21','prize51','prize50','prize41','prize40','prize31','prize30','prize21'])+'\n') driver = webdriver.Firefox() driver.get('https://www.tnlottery.com/winningnumbers/TennesseeCashlist.aspx?TCShowall=y#TennesseeCashball') html = driver.page_source nextLink = "navTennesseeCashNextPage" soup = BeautifulSoup(html) for pg in range(0,20): temp = soup.find_all("td",align="center") top = (len(temp)-4)/3 + 1 print pg, len(temp) for i in range(1,top): drawDate = [cleanDate(temp[3*i].get_text())] NumsDrawn = temp[3*i+1].get_text().replace('-',' ').split(' ') drawID = temp[3*i+2].a.get('href') drawID = drawID[drawID.index('=')+1:] drawID = drawID[:drawID.index("'")] drawData = GetTnCashData('https://www.tnlottery.com/winningnumbers/TennesseeCashdetails_popup.aspx?id='+drawID) tnc.write(','.join(drawDate + NumsDrawn + drawData) + '\n') driver.find_element_by_id(nextLink).click() sleep(30) soup = BeautifulSoup(driver.page_source) tnc.close() print 'Done'
Note that this code builds each data point from two different sources: the date and numbers drawn are read from the main page while the winner counts and prize amounts are read from the pop-up window you see when you click a “details” link.
Using Selenium to Fill in a Form
Past results from the Oregon Lottery website can be accessed only by using a form on the results page. Once again, Selenium is up to the challenge. Like in the Florida and North Carolina cases, the code iterates through a date object and checks for a valid day of the week (Monday, Wednesday, or Saturday.) However, here Selenium enters the date into the form in two places, “Start Date” and “End Date.” (Using the same date in both parts of the form simplifies both the iteration and the Beautiful Soup processing.) Then Selenium clicks the submit button.
While testing this I noticed that sometimes the code repeats results from a previous selection, most likely due to a failure of the new page to load fast enough. The code deals with this issue in two ways. First, the sleep
function from the date
module pauses the code for 30 seconds, greatly reducing the likelihood of the problem occuring. As a extra safety measure, the also checks that the date on the page matches the one entered into the form before writing the results to a file. If the dates don’t match, the desired date, i.e. the one Selenium entered on the form, is written to an error log.
from selenium import webdriver from datetime import timedelta, date import requests from bs4 import BeautifulSoup import re from time import sleep ormb = open('or_megabucks.csv','a') ormb_err = open('or_megabucks_errors.csv','w') ormb.write(','.join(['drawdate','n1','n2','n3','n4','n5','n6','winners6','winners5','winners4','prize6','prize5','prize4'])+'\n') start_date = date(2012,10,30) end_date = date(2015,10,29) current = start_date driver = webdriver.Firefox() driver.get('http://www.oregonlottery.org/games/draw-games/megabucks/past-results') while current < end_date: while current.strftime('%w') not in ['1','3','6']: current = current + timedelta(1) driver.find_element_by_id("FromDate").clear() driver.find_element_by_id("ToDate").clear() driver.find_element_by_id("FromDate").send_keys(current.strftime('%m/%d/%Y')) driver.find_element_by_id("ToDate").send_keys(current.strftime('%m/%d/%Y')) driver.find_element_by_css_selector(".viewResultsButton").click() sleep(30) soup = BeautifulSoup(driver.page_source) test1 = soup.find_all("td") numbers = [test1[i].get_text() for i in range(2,8)] test2 = soup.find_all("strong") winners = [test2[1].get_text().replace(',','')] prizes = [test2[0].get_text().replace('$','').replace(',','')] for i in range(0,2): winners.append(test2[4*i+3].get_text().replace(',','')) prizes.append(test2[4*i+2].get_text().replace('$','').replace(',','')) testdate = test1[0].get_text().split('/') testdate = date(int(testdate[2]),int(testdate[0]),int(testdate[1])) if current.strftime('%Y-%m-%d') == testdate.strftime('%Y-%m-%d'): ormb.write(','.join([testdate.strftime('%Y-%m-%d')] + numbers + winners + prizes)+'\n') else: ormb_err.write(current.strftime('%Y-%m-%d') + '\n') current = current + timedelta(1) ormb.close() ormb_err.close()
Visualizations
Any number of visualizations of the scraped data are possible, but here let’s focus on a type of plot that not only suggests an association between the numbers drawn and the prize amounts but also motivates a statistical test to be performed later.
The summary statistic that we will use is simply the sum of the numbers drawn. The plots will show histograms of this sum for two sets of drawings: those where the prize amounts were less than the 25th percentile for all draws (labelled “Small Prizes”) and those where the prize amounts were greater than the 75th percentile for all draws (labelled “Large Prizes”).
Conclusion
The visualizations presented here provide multiple examples of parimutuel lotteries where there seems to be a relationship between the numbers drawn and the prize amounts. Therefore the project of predicting prize amounts from the drawn numbers is likely to produce some results, and using the sum of the drawn numbers appears to be a great starting point.