The 3rd tab of each NBA Game Recap page contains the play-by-play for the game. However, an NBA season consists of 1,230 regular season games, so we need an automated method of finding the game pages. Ideally, we'd like to be able to scrape a single day at a time, as this lends itself to regular daily updates. The NBA has a daily schedule page with links to that day's game recaps, and the url pattern for a given date is easy to determine. So, our workflow will be to find the schedule page for a given date, extract that game recap links for each game, and then follow those to scrape the play-by-play for each game.
Scrapy
Initially, I chose to use scrapy mostly because it supports proper selectors (both CSS and XPath) for navigation HTML documents. However, the design of the framework lends itself to efficiently executing our intended workflow. This is because scrapy allows us to queue up additional pages for scraping, and will scrape those pages in parallel, so as we parse the schedule page, we can queue up each game recap page for scraping.
Our initial parse method is quite simple:
def parse(self, response):
for href in response.css("a.recapAnc::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_game_recap)
Scrapy uses python generators to yield objects to the framework for further processing. In this case, we're finding each game recap link within the response and yielding a scrapy.Request, telling scrapy to scrape that link using the specified callback.
Parsing the game recap is a bit more complicated:
def parse_game_recap(self, response):
away = None
home = None
quarter = None
# There's some useful information in the url, so we extract it.
# This probably should have been a single regex, but it doesn't matter much.
game_id = re.search('([A-Z]+)', response.url).group(1)
pbp_item = PlayByPlay() # We'll see scrapy Items shortly.
# Find the play by play table and iterate its rows
for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')):
# If we get a row with team names, extract them.
if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1:
(away, home) = [x.strip() for x in row.xpath('td/text()').extract()
else:
# otherwise, build up the PlayByPlay item with the data in the row.
pbp_item['quarter'] = quarter
pbp_item['game_id'] = game_id
pbp_item['index'] = index
for field in row.xpath('td'):
field_class = str(field.xpath('@class').extract_first())
if field_class == 'nbaGIPbPTblHdr':
name = row.xpath('td/a/@name')
if len(name) > 0:
quarter = row.xpath('td/a/@name').extract_first()
pbp_item['quarter'] = quarter
elif len(field.xpath('@id')) > 0:
# Sometimes we'll have rows that don't fit the structure of the
# PlayByPlay item. We store them in a GameEvent item.
event_item = GameEvent()
event_item['type'] = field.xpath('@id').extract_first()
event_item['text'] = field.xpath('div/text()').extract_first()
event_item['quarter'] = quarter
event_item['game_id'] = game_id
event_item['index'] = index
# We can yield items to, for processing by scrape's pipelines,
# which we'll learn about later.
yield event_item
else:
text = field.xpath('text()').extract_first().strip()
if len(text) == 0:
continue
else:
if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore':
pbp_item['team'] = away
pbp_item['text'] = text
elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore':
pbp_item['team'] = home
pbp_item['text'] = text
elif field_class == 'nbaGIPbPMid':
pbp_item['clock'] = text
elif field_class == 'nbaGIPbPMidScore':
pbp_item['clock'] = text
pbp_item['score'] = field.xpath('text()').extract()[1].strip()
else:
raise ValueError("Unknown class: %s" % field_class)
if 'clock' in pbp_item:
# Yield the PlayByPlay item we've been working on and create a new one.
yield pbp_item
pbp_item = PlayByPlay()
We see here how a scrapy parse method can return not just scrapy Request objects, but also Item objects.
Here is one of our basic scrapy items at this stage:
class PlayByPlay(scrapy.Item):
game_id = scrapy.Field()
quarter = scrapy.Field()
period = scrapy.Field()
clock = scrapy.Field()
score = scrapy.Field()
team = scrapy.Field()
text = scrapy.Field()
index = scrapy.Field()
Dates
We still haven't told scrapy which page to parse. Let's do that now. Here's how we initialize our Spider:
import scrapy
import re
import time
from scraping.items import PlayByPlay, GameEvent
class NbaSpider(scrapy.Spider):
name = "nba"
allowed_domains = ["nba.com"]
# __init__ allows us to specify custom arguments that can be passed to scrapy with the -a option
# in this case, 'scrape_date'
def __init__(self, scrape_date=None, *args, **kwargs):
super(NbaSpider, self).__init__(*args, **kwargs)
# if no scrape_date is specified, default to yesterday
if scrape_date is None:
scrape_date = str(int(time.strftime('%Y%m%d')) - 1)
# Here's where we define the starting URL
self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date]
def parse(self, response):
...
Now we can scrape a day of data like this: scrapy crawl nba -a scrape_date=20160226
Pipelines
Our basic scraper/crawler can now pull down the play-by-play for a given date, but we can't yet do anything with it. Scrape's pipelines allow us to work with our data. First, we'll store it somewhere.
MongoDB
MongoDB is a schema-less NoSQL database with an easy to use javascript-based query syntax. It lends itself to situations where we wish to engage in open-ended exploration of the data. It also saved me all the work of creating schemas for my database.
My MongoDB pipeline is very similar to the example here, except since our application has multiple Item types, we select our MongoDB collection based upon the class name. I've also elected to replace in the case of duplicates. To identify duplicates, we've added an index_fields method to each of our Item types.
')
TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?')
DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?) ')
DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?) , (.+?) ')
EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)')
# pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb
def process_item(self, item, spider):
text = item.get('text', None)
if text:
item['events'] = []
while text:
l = len(text)
m = self.SHOT_RE.match(text)
if m:
event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)}
if '3pt' in m.group(2):
event['fg3a'] = 1
if m.group(5) == 'Made':
event['fg3m'] = 1
event['fgm'] = 1
event['pts'] = 3
else:
if m.group(5) == 'Made':
event['fg3m'] = 1
event['fgm'] = 1
event['pts'] = 2
item['events'].append(event)
text = text[m.end():].strip()
m = self.REBOUND_RE.match(text)
if m:
event = {'player': m.group(1), 'reb': 1}
item['events'].append(event)
text = text[m.end():].strip()
m = self.DEFENSE_RE.match(text)
if m:
event = {'player': m.group(2)}
if m.group(1) == 'Block':
item['events'][-1]['blka'] = 1
event['blk'] = 1
else:
event['stl'] = 1
item['events'].append(event)
text = text[m.end():].strip()
m = self.ASSIST_RE.match(text)
if m:
event = {'player': m.group(1), 'ast': 1}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TIMEOUT_RE.match(text)
if m:
event = {'timeout': m.group(1)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TURNOVER_RE.match(text)
if m:
event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TEAM_TURNOVER_RE.match(text)
if m:
event = {'turnover': m.group(1)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TEAM_REBOUND_RE.match(text)
if m:
item['events'].append({'rebound': 'team'})
text = text[m.end():].strip()
m = self.FOUL_RE.match(text)
# TODO: Are all of these actual personal fouls?
if m:
event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)}
if m.group(4):
event['type'] = m.group(4)
item['events'].append(event)
text = text[m.end():].strip()
m = self.DOUBLE_FOUL_RE.match(text)
if m:
item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)})
item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)})
text = text[m.end():].strip()
m = self.JUMP_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'jump': 'home'})
item['events'].append({'player': m.group(2), 'jump': 'away'})
if m.group(3):
item['events'].append({'player': m.group(4), 'jump': 'possession'})
text = text[m.end():].strip()
m = self.VIOLATION_RE.match(text)
if m:
event = {'player': m.group(1), 'violation': m.group(2)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.FREE_THROW_RE.match(text)
if m:
event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)}
if m.group(5) is None:
event['pts'] = 1
event['ftm'] = 1
if m.group(2):
event['special'] = m.group(2)
item['events'].append(event)
text = text[m.end():].strip()
m = self.TECHNICAL_FT_RE.match(text)
if m:
event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'}
if m.group(2) is None:
event['pts'] = 1
event['ftm'] = 1
item['events'].append(event)
text = text[m.end():].strip()
m = self.SUB_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'sub': 'out'})
item['events'].append({'player': m.group(2), 'sub': 'in'})
text = text[m.end():].strip()
m = self.TEAM_VIOLATION_RE.match(text)
if m:
item['events'].append({'violation': m.group(1)})
text = text[m.end():].strip()
m = self.CLOCK_RE.match(text)
if m:
item['clock'] = m.group(1)
text = text[m.end():].strip()
m = self.TEAM_RE.match(text)
if m:
item['team_abbreviation'] = m.group(1)
text = text[m.end():].strip()
m = self.TECHNICAL_RE.match(text)
if m:
if m.group(3):
item['events'].append({'team': m.group(3), 'technical': m.group(1)})
else:
item['events'].append({'player': m.group(1), 'technical': True})
text = text[m.end():].strip()
m = self.DOUBLE_TECH_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'technical': True})
item['events'].append({'player': m.group(2), 'technical': True})
text = text[m.end():].strip()
m = self.EJECTION_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)})
text = text[m.end():].strip()
if len(text) == l:
raise ValueError('Could not parse text: %s' % text)
if len(text) == 0:
text = None
return item
Problem: Who is Playing?
While the play-by-play data includes substitutions, it doesn't tell us who started each quarter. This means we don't know who was on the floor at any given point in time. However, by cross-referencing against the per-day, per-quarter lineup data, we should be able to figure this out.
First, we need to modify our Spider to fetch the lineup data.:
def parse(self, response):
for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_game_recap)
# Create Requests for lineup data for 4 quarters, plus 10 possible overtimes
for period in range(1,15):
url = self.lineup_pattern % (self.date, self.date, period, self.season)
yield scrapy.Request(url, callback=self.parse_lineups)
# Although the lineup data is a json API, we can still integrate it into our crawler
def parse_lineups(self, response):
jsonresponse = json.loads(response.body_as_unicode())
headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])])
for row in jsonresponse['resultSets'][0]['rowSet']:
item = Lineup()
item['date'] = self.scrape_date
item['period'] = int(re.search('Period=(\d+)', response.url).group(1))
for index, value in enumerate(row):
field = headers[index]
item[field] = value
yield item
Within the time-frame of this project, I didn't get as far as putting the lineups data together with the play-by-play data, but the basic idea would be to simulate each quarter starting with each of the lineups used in that quarter, finding the starting lineup that results in no inconsistencies in the data.
Putting it all Together
spiders/nba_spider.py
import scrapy
import re
import time
import json
from scraping.items import PlayByPlay, GameEvent, Lineup
# This is the API for play-by-play...
# http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500513&RangeType=2&Season=2015-16&SeasonType=Regular+Season&StartPeriod=1&StartRange=0
class NbaSpider(scrapy.Spider):
name = "nba"
allowed_domains = ["nba.com"]
lineup_pattern = 'http://stats.nba.com/stats/leaguedashlineups?Conference=&DateFrom=%s&DateTo=%s&Division=&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=%d&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&TeamID=0&VsConference=&VsDivision='
def __init__(self, scrape_date=None, *args, **kwargs):
super(NbaSpider, self).__init__(*args, **kwargs)
if scrape_date is None:
scrape_date = str(int(time.strftime('%Y%m%d')) - 1)
match = re.search('(\d{4})(\d{2})(\d{2})', scrape_date)
year = int(match.group(1))
month = int(match.group(2))
day = int(match.group(3))
self.date = '%02d%%2F%02d%%2F%04d' % (month, day, year)
self.season = '%04d-%02d' % ((year, (year+1) % 100) if month > 7 else (year-1, year % 100))
self.scrape_date = scrape_date
self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date]
def parse(self, response):
for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_game_recap)
for period in range(1,15):
url = self.lineup_pattern % (self.date, self.date, period, self.season)
yield scrapy.Request(url, callback=self.parse_lineups)
def parse_game_recap(self, response):
away = None
home = None
quarter = None
date = re.search('(\d+)', response.url).group(1)
game_id = re.search('([A-Z]+)', response.url).group(1)
pbp_item = PlayByPlay()
for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')):
if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1:
(away, home) = [x.strip() for x in row.xpath('td/text()').extract()]
else:
pbp_item['quarter'] = quarter
pbp_item['game_id'] = game_id
pbp_item['index'] = index
pbp_item['date'] = date
for field in row.xpath('td'):
field_class = str(field.xpath('@class').extract_first())
if field_class == 'nbaGIPbPTblHdr':
name = row.xpath('td/a/@name')
if len(name) > 0:
quarter = row.xpath('td/a/@name').extract_first()
pbp_item['quarter'] = quarter
elif len(field.xpath('@id')) > 0:
event_item = GameEvent()
event_item['type'] = field.xpath('@id').extract_first()
event_item['text'] = field.xpath('div/text()').extract_first()
event_item['quarter'] = quarter
event_item['game_id'] = game_id
event_item['date'] = date
event_item['index'] = index
yield event_item
else:
text = field.xpath('text()').extract_first().strip()
if len(text) == 0:
continue
else:
if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore':
pbp_item['team'] = away
pbp_item['text'] = text
elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore':
pbp_item['team'] = home
pbp_item['text'] = text
elif field_class == 'nbaGIPbPMid':
pbp_item['clock'] = text
elif field_class == 'nbaGIPbPMidScore':
pbp_item['clock'] = text
pbp_item['score'] = field.xpath('text()').extract()[1].strip()
else:
raise ValueError("Unknown class: %s" % field_class)
if 'clock' in pbp_item:
yield pbp_item
pbp_item = PlayByPlay()
def parse_lineups(self, response):
jsonresponse = json.loads(response.body_as_unicode())
headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])])
for row in jsonresponse['resultSets'][0]['rowSet']:
item = Lineup()
item['date'] = self.scrape_date
item['period'] = int(re.search('Period=(\d+)', response.url).group(1))
for index, value in enumerate(row):
field = headers[index]
item[field] = value
yield item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class PlayByPlay(scrapy.Item):
game_id = scrapy.Field()
quarter = scrapy.Field()
period = scrapy.Field()
clock = scrapy.Field()
score = scrapy.Field()
team = scrapy.Field()
text = scrapy.Field()
index = scrapy.Field()
date = scrapy.Field()
events = scrapy.Field()
seconds = scrapy.Field()
team_abbreviation = scrapy.Field()
def index_fields(self):
return {
'game_id': self['game_id'],
'index': self['index'],
'quarter': self['quarter'],
'date': self['date']
}
class GameEvent(scrapy.Item):
type = scrapy.Field()
text = scrapy.Field()
quarter = scrapy.Field()
period = scrapy.Field()
game_id = scrapy.Field()
index = scrapy.Field()
date = scrapy.Field()
events = scrapy.Field()
clock = scrapy.Field()
seconds = scrapy.Field()
team_abbreviation = scrapy.Field()
def index_fields(self):
return {
'game_id': self['game_id'],
'index': self['index'],
'quarter': self['quarter'],
'date': self['date']
}
class Lineup(scrapy.Item):
group_set = scrapy.Field()
group_id = scrapy.Field()
group_name = scrapy.Field()
team_id = scrapy.Field()
team_abbreviation = scrapy.Field()
gp = scrapy.Field()
w = scrapy.Field()
l = scrapy.Field()
w_pct = scrapy.Field()
min = scrapy.Field()
fgm = scrapy.Field()
fga = scrapy.Field()
fg_pct = scrapy.Field()
fg3m = scrapy.Field()
fg3a = scrapy.Field()
fg3_pct = scrapy.Field()
ftm = scrapy.Field()
fta = scrapy.Field()
ft_pct = scrapy.Field()
oreb = scrapy.Field()
dreb = scrapy.Field()
reb = scrapy.Field()
ast = scrapy.Field()
tov = scrapy.Field()
stl = scrapy.Field()
blk = scrapy.Field()
blka = scrapy.Field()
pf = scrapy.Field()
pfd = scrapy.Field()
pts = scrapy.Field()
plus_minus = scrapy.Field()
period = scrapy.Field()
date = scrapy.Field()
def index_fields(self):
return {
'group_id': self['group_id'],
'team_id': self['team_id'],
'date': self['date'],
'period': self['period']
}
pipelines.py
# -*- coding: utf-8 -*-
import pymongo
import re
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class ScrapingPipeline(object):
def process_item(self, item, spider):
return item
class QuarterProcessor(object):
def process_item(self, item, spider):
if 'quarter' in item:
m = re.match('(Q|OT|H)(\d+)', item['quarter'])
if m.group(1) in ('Q', 'H'):
item['period'] = int(m.group(2))
elif m.group(1) == 'OT':
item['period'] = int(m.group(2)) + 4
else:
raise ValueError("Can't process quarter: %s" % item['quarter'])
return item
class ClockProcessor(object):
def process_item(self, item, spider):
if 'clock' in item:
(minutes, seconds) = item['clock'].split(':')
item['seconds'] = float(minutes) * 60.0 + float(seconds)
return item
class TextProcessor(object):
SHOT_RE = re.compile('(.+?) (((Tip|Alley Oop|Cutting|Dunk|Pullup|Turnaround|Running|Driving|Hook|Jump|3pt|Layup|Fadeaway|Bank|No) ?)+) [Ss]hot: (Made|Missed)( )?')
REBOUND_RE = re.compile('(.+?) Rebound ')
TEAM_REBOUND_RE = re.compile('Team Rebound')
DEFENSE_RE = re.compile('(Block|Steal): ?(.+?) ')
ASSIST_RE = re.compile('Assist: (.+?) ')
TIMEOUT_RE = re.compile('Team Timeout : (Short|Regular|No Timeout|Official)')
TURNOVER_RE = re.compile('(.+?) Turnover : ((Out of Bounds|Poss)? ?(- )?(Punched Ball|5 Second|Out Of Bounds|Basket from Below|Illegal Screen|No|Swinging Elbows|Double Dribble|Illegal Assist|Inbound|Palming|Kicked Ball|Jump Ball|Lane|Backcourt|Offensive Goaltending|Discontinue Dribble|Lost Ball|Foul|Bad Pass|Traveling|Step Out of Bounds|3 Second|Offensive Foul|Player Out of Bounds)( Violation)?( Turnover)?) ')
TEAM_TURNOVER_RE = re.compile('Team Turnover : ((8 Second Violation|5 Sec Inbound|Backcourt|Shot Clock|Offensive Goaltending|3 Second)( Violation)?( Turnover)?)')
FOUL_RE = re.compile('(.+?) Foul: (Clear Path|Flagrant|Away From Play|Personal Take|Inbound|Loose Ball|Offensive|Offensive Charge|Personal|Shooting|Personal Block|Shooting Block|Defense 3 Second)( Type (\d+))? ( )? ')
JUMP_RE = re.compile('Jump Ball (.+?) vs (.+)( )?')
VIOLATION_RE = re.compile('(.+?) Violation:(Defensive Goaltending|Kicked Ball|Lane|Jump Ball|Double Lane)( )?')
FREE_THROW_RE = re.compile('(.+?) Free Throw (Flagrant|Clear Path)? ?(\d) of (\d) (Missed)? ?()?')
TECHNICAL_FT_RE = re.compile('(.+?) Free Throw Technical (Missed)? ?()?')
SUB_RE = re.compile('(.+?) Substitution replaced by (.+?)$')
TEAM_VIOLATION_RE = re.compile('Team Violation : (Delay Of Game) ')
CLOCK_RE = re.compile('')
TEAM_RE = re.compile('
')
TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?')
DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?) ')
DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?) , (.+?) ')
EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)')
# pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb
def process_item(self, item, spider):
text = item.get('text', None)
if text:
item['events'] = []
while text:
l = len(text)
m = self.SHOT_RE.match(text)
if m:
event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)}
if '3pt' in m.group(2):
event['fg3a'] = 1
if m.group(5) == 'Made':
event['fg3m'] = 1
event['fgm'] = 1
event['pts'] = 3
else:
if m.group(5) == 'Made':
event['fg3m'] = 1
event['fgm'] = 1
event['pts'] = 2
item['events'].append(event)
text = text[m.end():].strip()
m = self.REBOUND_RE.match(text)
if m:
event = {'player': m.group(1), 'reb': 1}
item['events'].append(event)
text = text[m.end():].strip()
m = self.DEFENSE_RE.match(text)
if m:
event = {'player': m.group(2)}
if m.group(1) == 'Block':
item['events'][-1]['blka'] = 1
event['blk'] = 1
else:
event['stl'] = 1
item['events'].append(event)
text = text[m.end():].strip()
m = self.ASSIST_RE.match(text)
if m:
event = {'player': m.group(1), 'ast': 1}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TIMEOUT_RE.match(text)
if m:
event = {'timeout': m.group(1)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TURNOVER_RE.match(text)
if m:
event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TEAM_TURNOVER_RE.match(text)
if m:
event = {'turnover': m.group(1)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.TEAM_REBOUND_RE.match(text)
if m:
item['events'].append({'rebound': 'team'})
text = text[m.end():].strip()
m = self.FOUL_RE.match(text)
# TODO: Are all of these actual personal fouls?
if m:
event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)}
if m.group(4):
event['type'] = m.group(4)
item['events'].append(event)
text = text[m.end():].strip()
m = self.DOUBLE_FOUL_RE.match(text)
if m:
item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)})
item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)})
text = text[m.end():].strip()
m = self.JUMP_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'jump': 'home'})
item['events'].append({'player': m.group(2), 'jump': 'away'})
if m.group(3):
item['events'].append({'player': m.group(4), 'jump': 'possession'})
text = text[m.end():].strip()
m = self.VIOLATION_RE.match(text)
if m:
event = {'player': m.group(1), 'violation': m.group(2)}
item['events'].append(event)
text = text[m.end():].strip()
m = self.FREE_THROW_RE.match(text)
if m:
event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)}
if m.group(5) is None:
event['pts'] = 1
event['ftm'] = 1
if m.group(2):
event['special'] = m.group(2)
item['events'].append(event)
text = text[m.end():].strip()
m = self.TECHNICAL_FT_RE.match(text)
if m:
event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'}
if m.group(2) is None:
event['pts'] = 1
event['ftm'] = 1
item['events'].append(event)
text = text[m.end():].strip()
m = self.SUB_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'sub': 'out'})
item['events'].append({'player': m.group(2), 'sub': 'in'})
text = text[m.end():].strip()
m = self.TEAM_VIOLATION_RE.match(text)
if m:
item['events'].append({'violation': m.group(1)})
text = text[m.end():].strip()
m = self.CLOCK_RE.match(text)
if m:
item['clock'] = m.group(1)
text = text[m.end():].strip()
m = self.TEAM_RE.match(text)
if m:
item['team_abbreviation'] = m.group(1)
text = text[m.end():].strip()
m = self.TECHNICAL_RE.match(text)
if m:
if m.group(3):
item['events'].append({'team': m.group(3), 'technical': m.group(1)})
else:
item['events'].append({'player': m.group(1), 'technical': True})
text = text[m.end():].strip()
m = self.DOUBLE_TECH_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'technical': True})
item['events'].append({'player': m.group(2), 'technical': True})
text = text[m.end():].strip()
m = self.EJECTION_RE.match(text)
if m:
item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)})
text = text[m.end():].strip()
if len(text) == l:
raise ValueError('Could not parse text: %s' % text)
if len(text) == 0:
text = None
return item
#TODO, figure out offensive/defensive rebounds... we need to know teams for that
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[item.__class__.__name__].replace_one(item.index_fields(), dict(item), True)
return item
#!/usr/bin/env python
import sys
import os
season = int(sys.argv[1])
for year in (season, season+1):
months = range(9, 13) if season == year else range(1, 8)
for month in months:
for day in range(1, 32):
os.system('scrapy crawl nba -a scrape_date=%04d%02d%02d' % (year, month, day))
Next Steps
Moving forward, I'll probably switch from scraping the play-by-play data to using the API. However, I'm optimistic that much of the code for parsing the text will still be applicable. I have observed some differences between the API text and the text on the recap pages.
Once that switch is made, I'll need to integrate the play-by-play and lineup data. This will provide me with a data set where for every play I have both what happened and who was on the floor (offense and defense). This opens up a lot of possibilities.
The supreme goal is to predict the probabilities of various outcomes for a given lineup. However, this data can also be used to answer a lot of other questions. For example, a recent ESPN article looked at the impact of exhaustion on team performance. With this data set, we can investigate this at the lineup level, seeing how lineup-level performance is impacted by the minutes played.
About Author
Tom Walsh
Tom Walsh (M.Sc. Computer Science, University of Toronto) developed a desire to get deeper into the data while leading a team of developers at BSports building Scouting Information Systems for Major League Baseball teams. A course on Basketball...
Attractive component of content. I just stumbled upon your
web site and in accession capital to say that I get in fact enjoyed account your blog posts.
Anyway I will be subscribing to your feeds and even I achievement you access persistently quickly.
solitaire December 22, 2017
Tremendous things here. I'm very satisfied to see your article.
Thank yyou a lot and I'm having a look forward to touch you.
Wiill you please drop me a e-mail?
Fran April 24, 2017
Thanks for sharing this great work !
Rebecca July 11, 2016
I was just looking at your Scraping NBA Play-by-Play Data with Scrapy & MongoDB - NYC Data Science Academy BlogNYC Data Science Academy Blog website and see that your site has the potential to get a lot of visitors. I just want to tell you, In case you don't already know... There is a website service which already has more than 16 million users, and most of the users are interested in websites like yours. By getting your site on this network you have a chance to get your site more visitors than you can imagine. It is free to sign up and you can find out more about it here: http://ezurl.dk/gfc8 - Now, let me ask you... Do you need your site to be successful to maintain your way of life? Do you need targeted visitors who are interested in the services and products you offer? Are looking for exposure, to increase sales, and to quickly develop awareness for your website? If your answer is YES, you can achieve these things only if you get your site on the network I am talking about. This traffic service advertises you to thousands, while also giving you a chance to test the service before paying anything at all. All the popular websites are using this network to boost their traffic and ad revenue! Why aren’t you? And what is better than traffic? It’s recurring traffic! That's how running a successful site works... Here's to your success! Find out more here: http://inflightvideo.tv/a/b
http://www.marbellamoving.com/sv/ifk-goteborg-matchtroja/ March 30, 2016
http://www.marbellamoving.com/sv/ifk-goteborg-matchtroja/, http://www.marbellamoving.com/sv/troja-engelska/, http://www.marbellamoving.com/sv/tjock-troja/
Leta, Leta, Leta, Leta, Leta,
Leta, Leta, Leta, Leta, Leta,