Scraping NBA Play-by-Play Data with Scrapy & MongoDB
In my previous projects I worked with data on NBA lineups from stats.nba.com, first exploring some of the relationships between player performance and lineup performance, and then building an interactive tool to allow for further exploration. For this project, I wished to get more granular, working with NBA play-by-play data. Scraping the data turned out to be fairly trivial (although it did take about half a week to scrape one season), but it was a challenge to transform it into a useful state.
Schedules
The 3rd tab of each NBA Game Recap page contains the play-by-play for the game. However, an NBA season consists of 1,230 regular season games, so we need an automated method of finding the game pages. Ideally, we'd like to be able to scrape a single day at a time, as this lends itself to regular daily updates. The NBA has a daily schedule page with links to that day's game recaps, and the url pattern for a given date is easy to determine. So, our workflow will be to find the schedule page for a given date, extract that game recap links for each game, and then follow those to scrape the play-by-play for each game.
Scrapy
Initially, I chose to use scrapy mostly because it supports proper selectors (both CSS and XPath) for navigation HTML documents. However, the design of the framework lends itself to efficiently executing our intended workflow. This is because scrapy allows us to queue up additional pages for scraping, and will scrape those pages in parallel, so as we parse the schedule page, we can queue up each game recap page for scraping.
Our initial parse
method is quite simple:
def parse(self, response): for href in response.css("a.recapAnc::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_game_recap)
Scrapy uses python generators to yield objects to the framework for further processing. In this case, we're finding each game recap link within the response and yielding a scrapy.Request
, telling scrapy to scrape that link using the specified callback.
Parsing the game recap is a bit more complicated:
def parse_game_recap(self, response): away = None home = None quarter = None # There's some useful information in the url, so we extract it. # This probably should have been a single regex, but it doesn't matter much. game_id = re.search('([A-Z]+)', response.url).group(1) pbp_item = PlayByPlay() # We'll see scrapy Items shortly. # Find the play by play table and iterate its rows for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')): # If we get a row with team names, extract them. if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1: (away, home) = [x.strip() for x in row.xpath('td/text()').extract() else: # otherwise, build up the PlayByPlay item with the data in the row. pbp_item['quarter'] = quarter pbp_item['game_id'] = game_id pbp_item['index'] = index for field in row.xpath('td'): field_class = str(field.xpath('@class').extract_first()) if field_class == 'nbaGIPbPTblHdr': name = row.xpath('td/a/@name') if len(name) > 0: quarter = row.xpath('td/a/@name').extract_first() pbp_item['quarter'] = quarter elif len(field.xpath('@id')) > 0: # Sometimes we'll have rows that don't fit the structure of the # PlayByPlay item. We store them in a GameEvent item. event_item = GameEvent() event_item['type'] = field.xpath('@id').extract_first() event_item['text'] = field.xpath('div/text()').extract_first() event_item['quarter'] = quarter event_item['game_id'] = game_id event_item['index'] = index # We can yield items to, for processing by scrape's pipelines, # which we'll learn about later. yield event_item else: text = field.xpath('text()').extract_first().strip() if len(text) == 0: continue else: if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore': pbp_item['team'] = away pbp_item['text'] = text elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore': pbp_item['team'] = home pbp_item['text'] = text elif field_class == 'nbaGIPbPMid': pbp_item['clock'] = text elif field_class == 'nbaGIPbPMidScore': pbp_item['clock'] = text pbp_item['score'] = field.xpath('text()').extract()[1].strip() else: raise ValueError("Unknown class: %s" % field_class) if 'clock' in pbp_item: # Yield the PlayByPlay item we've been working on and create a new one. yield pbp_item pbp_item = PlayByPlay()
We see here how a scrapy parse method can return not just scrapy Request
objects, but also Item
objects.
Here is one of our basic scrapy items at this stage:
class PlayByPlay(scrapy.Item):โจ game_id = scrapy.Field() โจ quarter = scrapy.Field() โจ period = scrapy.Field() โจ clock = scrapy.Field() โจ score = scrapy.Field() โจ team = scrapy.Field()โจ text = scrapy.Field()โจ index = scrapy.Field()
Dates
We still haven't told scrapy which page to parse. Let's do that now. Here's how we initialize our Spider:
import scrapy import re import time from scraping.items import PlayByPlay, GameEvent class NbaSpider(scrapy.Spider): name = "nba" allowed_domains = ["nba.com"] # __init__ allows us to specify custom arguments that can be passed to scrapy with the -a option # in this case, 'scrape_date' def __init__(self, scrape_date=None, *args, **kwargs): super(NbaSpider, self).__init__(*args, **kwargs) # if no scrape_date is specified, default to yesterday if scrape_date is None: scrape_date = str(int(time.strftime('%Y%m%d')) - 1) # Here's where we define the starting URL self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date] def parse(self, response): ...
Now we can scrape a day of data like this: scrapy crawl nba -a scrape_date=20160226
Pipelines
Our basic scraper/crawler can now pull down the play-by-play for a given date, but we can't yet do anything with it. Scrape's pipelines allow us to work with our data. First, we'll store it somewhere.
MongoDB
MongoDB is a schema-less NoSQL database with an easy to use javascript-based query syntax. It lends itself to situations where we wish to engage in open-ended exploration of the data. It also saved me all the work of creating schemas for my database.
My MongoDB pipeline is very similar to the example here, except since our application has multiple Item types, we select our MongoDB collection based upon the class name. I've also elected to replace in the case of duplicates. To identify duplicates, we've added an index_fields
method to each of our Item types.
class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[item.__class__.__name__].replace_one(item.index_fields(), dict(item), True) return item
All of our Item types has index_field
methods. This one is from PlayByPlay
:
def index_fields(self): return { 'game_id': self['game_id'], 'index': self['index'], 'quarter': self['quarter'], 'date': self['date'] }
We need to configure our MongoPipeline to be invoked on each Item:
ITEM_PIPELINES = { 'scraping.pipelines.MongoPipeline': 300 }
Parsing Play-by-Play Data
Now comes the tough part. We need to parse the play-by-play strings to extract the underlying data. Here are some sample strings:
"Harden Driving Layup Shot: Missed Block: Faried (2 BLK)", "Ellis Running Layup Shot: Made (19 PTS)", "Vucevic Layup Shot: Missed Block: Withey (2 BLK)", "Holiday 3pt Shot: Made (10 PTS) Assist: Gordon (1 AST)", "Kaman Foul: Offensive (2 PF) (S Foster)", "Parsons 3pt Shot: Made (7 PTS) Assist: Nowitzki (1 AST)", "McLemore Turnover : Out of Bounds - Bad Pass Turnover (1 TO)", "Okafor Turnaround Jump Shot: Missed Block: Adams (3 BLK)", "Carroll Driving Floating Bank Jump Shot: Made (7 PTS)", "Kaman Turnover : Foul (3 TO)", "Millsap Turnover : Lost Ball (4 TO) Steal:Johnson (2 ST)", "Williams Foul: Personal (1 PF) (B Adams)", "Faried Dunk Shot: Made (10 PTS) Assist: Nelson (2 AST)", "Young Layup Shot: Made (12 PTS) Assist: Jack (8 AST)", "Withey Dunk Shot: Made (8 PTS) Assist: Neto (1 AST)", "Holiday Pullup Jump shot: Made (12 PTS)", "Mozgov Turnover : Lost Ball (1 TO) Steal:Calderon (2 ST)", "Clarkson 3pt Shot: Made (13 PTS) Assist: Russell (4 AST)", "Harden Step Back Jump shot: Made (15 PTS)", "McConnell Driving Reverse Layup Shot: Made (6 PTS)", "DeRozan Driving Reverse Layup Shot: Made (5 PTS) Assist: Lowry (4 AST)", "Afflalo Pullup Jump shot: Made (10 PTS) Assist: Calderon (2 AST)", "Hibbert Foul: Defense 3 Second (3 PF) (S Twardoski)", "Johnson Turnover : Bad Pass (1 TO) Steal:Butler (1 ST)", "Asik Turnover : Lost Ball (2 TO) Steal:Lowry (2 ST)", "Jump Ball Crowder vs Bazemore (Sullinger gains possession)"
To make sense of this, I used a disgusting mess of regular expressions:
class TextProcessor(object): SHOT_RE = re.compile('(.+?) (((Tip|Alley Oop|Cutting|Dunk|Pullup|Turnaround|Running|Driving|Hook|Jump|3pt|Layup|Fadeaway|Bank|No) ?)+) [Ss]hot: (Made|Missed)()?') REBOUND_RE = re.compile('(.+?) Rebound
') TEAM_REBOUND_RE = re.compile('Team Rebound') DEFENSE_RE = re.compile('(Block|Steal): ?(.+?)
') ASSIST_RE = re.compile('Assist: (.+?)
') TIMEOUT_RE = re.compile('Team Timeout : (Short|Regular|No Timeout|Official)') TURNOVER_RE = re.compile('(.+?) Turnover : ((Out of Bounds|Poss)? ?(- )?(Punched Ball|5 Second|Out Of Bounds|Basket from Below|Illegal Screen|No|Swinging Elbows|Double Dribble|Illegal Assist|Inbound|Palming|Kicked Ball|Jump Ball|Lane|Backcourt|Offensive Goaltending|Discontinue Dribble|Lost Ball|Foul|Bad Pass|Traveling|Step Out of Bounds|3 Second|Offensive Foul|Player Out of Bounds)( Violation)?( Turnover)?)
') TEAM_TURNOVER_RE = re.compile('Team Turnover : ((8 Second Violation|5 Sec Inbound|Backcourt|Shot Clock|Offensive Goaltending|3 Second)( Violation)?( Turnover)?)') FOUL_RE = re.compile('(.+?) Foul: (Clear Path|Flagrant|Away From Play|Personal Take|Inbound|Loose Ball|Offensive|Offensive Charge|Personal|Shooting|Personal Block|Shooting Block|Defense 3 Second)( Type (\d+))?
(
)?
') JUMP_RE = re.compile('Jump Ball (.+?) vs (.+)(
)?') VIOLATION_RE = re.compile('(.+?) Violation:(Defensive Goaltending|Kicked Ball|Lane|Jump Ball|Double Lane)(
)?') FREE_THROW_RE = re.compile('(.+?) Free Throw (Flagrant|Clear Path)? ?(\d) of (\d) (Missed)? ?(
)?') TECHNICAL_FT_RE = re.compile('(.+?) Free Throw Technical (Missed)? ?(
)?') SUB_RE = re.compile('(.+?) Substitution replaced by (.+?)$') TEAM_VIOLATION_RE = re.compile('Team Violation : (Delay Of Game)
') CLOCK_RE = re.compile('
') TEAM_RE = re.compile('
') TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?
') DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?)
') DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?)
, (.+?)
![]()
') EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)') # pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb def process_item(self, item, spider): text = item.get('text', None) if text: item['events'] = [] while text: l = len(text) m = self.SHOT_RE.match(text) if m: event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)} if '3pt' in m.group(2): event['fg3a'] = 1 if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 3 else: if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 2 item['events'].append(event) text = text[m.end():].strip() m = self.REBOUND_RE.match(text) if m: event = {'player': m.group(1), 'reb': 1} item['events'].append(event) text = text[m.end():].strip() m = self.DEFENSE_RE.match(text) if m: event = {'player': m.group(2)} if m.group(1) == 'Block': item['events'][-1]['blka'] = 1 event['blk'] = 1 else: event['stl'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.ASSIST_RE.match(text) if m: event = {'player': m.group(1), 'ast': 1} item['events'].append(event) text = text[m.end():].strip() m = self.TIMEOUT_RE.match(text) if m: event = {'timeout': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TURNOVER_RE.match(text) if m: event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_TURNOVER_RE.match(text) if m: event = {'turnover': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_REBOUND_RE.match(text) if m: item['events'].append({'rebound': 'team'}) text = text[m.end():].strip() m = self.FOUL_RE.match(text) # TODO: Are all of these actual personal fouls? if m: event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)} if m.group(4): event['type'] = m.group(4) item['events'].append(event) text = text[m.end():].strip() m = self.DOUBLE_FOUL_RE.match(text) if m: item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)}) item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)}) text = text[m.end():].strip() m = self.JUMP_RE.match(text) if m: item['events'].append({'player': m.group(1), 'jump': 'home'}) item['events'].append({'player': m.group(2), 'jump': 'away'}) if m.group(3): item['events'].append({'player': m.group(4), 'jump': 'possession'}) text = text[m.end():].strip() m = self.VIOLATION_RE.match(text) if m: event = {'player': m.group(1), 'violation': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.FREE_THROW_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)} if m.group(5) is None: event['pts'] = 1 event['ftm'] = 1 if m.group(2): event['special'] = m.group(2) item['events'].append(event) text = text[m.end():].strip() m = self.TECHNICAL_FT_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'} if m.group(2) is None: event['pts'] = 1 event['ftm'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.SUB_RE.match(text) if m: item['events'].append({'player': m.group(1), 'sub': 'out'}) item['events'].append({'player': m.group(2), 'sub': 'in'}) text = text[m.end():].strip() m = self.TEAM_VIOLATION_RE.match(text) if m: item['events'].append({'violation': m.group(1)}) text = text[m.end():].strip() m = self.CLOCK_RE.match(text) if m: item['clock'] = m.group(1) text = text[m.end():].strip() m = self.TEAM_RE.match(text) if m: item['team_abbreviation'] = m.group(1) text = text[m.end():].strip() m = self.TECHNICAL_RE.match(text) if m: if m.group(3): item['events'].append({'team': m.group(3), 'technical': m.group(1)}) else: item['events'].append({'player': m.group(1), 'technical': True}) text = text[m.end():].strip() m = self.DOUBLE_TECH_RE.match(text) if m: item['events'].append({'player': m.group(1), 'technical': True}) item['events'].append({'player': m.group(2), 'technical': True}) text = text[m.end():].strip() m = self.EJECTION_RE.match(text) if m: item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)}) text = text[m.end():].strip() if len(text) == l: raise ValueError('Could not parse text: %s' % text) if len(text) == 0: text = None return item
Problem: Who is Playing?
While the play-by-play data includes substitutions, it doesn't tell us who started each quarter. This means we don't know who was on the floor at any given point in time. However, by cross-referencing against the per-day, per-quarter lineup data, we should be able to figure this out.
First, we need to modify our Spider to fetch the lineup data.:
def parse(self, response): for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_game_recap) # Create Requests for lineup data for 4 quarters, plus 10 possible overtimes for period in range(1,15): url = self.lineup_pattern % (self.date, self.date, period, self.season) yield scrapy.Request(url, callback=self.parse_lineups) # Although the lineup data is a json API, we can still integrate it into our crawler def parse_lineups(self, response): jsonresponse = json.loads(response.body_as_unicode()) headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])]) for row in jsonresponse['resultSets'][0]['rowSet']: item = Lineup() item['date'] = self.scrape_date item['period'] = int(re.search('Period=(\d+)', response.url).group(1)) for index, value in enumerate(row): field = headers[index] item[field] = value yield item
Within the time-frame of this project, I didn't get as far as putting the lineups data together with the play-by-play data, but the basic idea would be to simulate each quarter starting with each of the lineups used in that quarter, finding the starting lineup that results in no inconsistencies in the data.
Putting it all Together
spiders/nba_spider.py
import scrapy import re import time import json from scraping.items import PlayByPlay, GameEvent, Lineup # This is the API for play-by-play... # http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500513&RangeType=2&Season=2015-16&SeasonType=Regular+Season&StartPeriod=1&StartRange=0 class NbaSpider(scrapy.Spider): name = "nba" allowed_domains = ["nba.com"] lineup_pattern = 'http://stats.nba.com/stats/leaguedashlineups?Conference=&DateFrom=%s&DateTo=%s&Division=&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=%d&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&TeamID=0&VsConference=&VsDivision=' def __init__(self, scrape_date=None, *args, **kwargs): super(NbaSpider, self).__init__(*args, **kwargs) if scrape_date is None: scrape_date = str(int(time.strftime('%Y%m%d')) - 1) match = re.search('(\d{4})(\d{2})(\d{2})', scrape_date) year = int(match.group(1)) month = int(match.group(2)) day = int(match.group(3)) self.date = '%02d%%2F%02d%%2F%04d' % (month, day, year) self.season = '%04d-%02d' % ((year, (year+1) % 100) if month > 7 else (year-1, year % 100)) self.scrape_date = scrape_date self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date] def parse(self, response): for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"): url = response.urljoin(href.extract()) yield scrapy.Request(url, callback=self.parse_game_recap) for period in range(1,15): url = self.lineup_pattern % (self.date, self.date, period, self.season) yield scrapy.Request(url, callback=self.parse_lineups) def parse_game_recap(self, response): away = None home = None quarter = None date = re.search('(\d+)', response.url).group(1) game_id = re.search('([A-Z]+)', response.url).group(1) pbp_item = PlayByPlay() for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')): if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1: (away, home) = [x.strip() for x in row.xpath('td/text()').extract()] else: pbp_item['quarter'] = quarter pbp_item['game_id'] = game_id pbp_item['index'] = index pbp_item['date'] = date for field in row.xpath('td'): field_class = str(field.xpath('@class').extract_first()) if field_class == 'nbaGIPbPTblHdr': name = row.xpath('td/a/@name') if len(name) > 0: quarter = row.xpath('td/a/@name').extract_first() pbp_item['quarter'] = quarter elif len(field.xpath('@id')) > 0: event_item = GameEvent() event_item['type'] = field.xpath('@id').extract_first() event_item['text'] = field.xpath('div/text()').extract_first() event_item['quarter'] = quarter event_item['game_id'] = game_id event_item['date'] = date event_item['index'] = index yield event_item else: text = field.xpath('text()').extract_first().strip() if len(text) == 0: continue else: if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore': pbp_item['team'] = away pbp_item['text'] = text elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore': pbp_item['team'] = home pbp_item['text'] = text elif field_class == 'nbaGIPbPMid': pbp_item['clock'] = text elif field_class == 'nbaGIPbPMidScore': pbp_item['clock'] = text pbp_item['score'] = field.xpath('text()').extract()[1].strip() else: raise ValueError("Unknown class: %s" % field_class) if 'clock' in pbp_item: yield pbp_item pbp_item = PlayByPlay() def parse_lineups(self, response): jsonresponse = json.loads(response.body_as_unicode()) headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])]) for row in jsonresponse['resultSets'][0]['rowSet']: item = Lineup() item['date'] = self.scrape_date item['period'] = int(re.search('Period=(\d+)', response.url).group(1)) for index, value in enumerate(row): field = headers[index] item[field] = value yield item
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class PlayByPlay(scrapy.Item): game_id = scrapy.Field() quarter = scrapy.Field() period = scrapy.Field() clock = scrapy.Field() score = scrapy.Field() team = scrapy.Field() text = scrapy.Field() index = scrapy.Field() date = scrapy.Field() events = scrapy.Field() seconds = scrapy.Field() team_abbreviation = scrapy.Field() def index_fields(self): return { 'game_id': self['game_id'], 'index': self['index'], 'quarter': self['quarter'], 'date': self['date'] } class GameEvent(scrapy.Item): type = scrapy.Field() text = scrapy.Field() quarter = scrapy.Field() period = scrapy.Field() game_id = scrapy.Field() index = scrapy.Field() date = scrapy.Field() events = scrapy.Field() clock = scrapy.Field() seconds = scrapy.Field() team_abbreviation = scrapy.Field() def index_fields(self): return { 'game_id': self['game_id'], 'index': self['index'], 'quarter': self['quarter'], 'date': self['date'] } class Lineup(scrapy.Item): group_set = scrapy.Field() group_id = scrapy.Field() group_name = scrapy.Field() team_id = scrapy.Field() team_abbreviation = scrapy.Field() gp = scrapy.Field() w = scrapy.Field() l = scrapy.Field() w_pct = scrapy.Field() min = scrapy.Field() fgm = scrapy.Field() fga = scrapy.Field() fg_pct = scrapy.Field() fg3m = scrapy.Field() fg3a = scrapy.Field() fg3_pct = scrapy.Field() ftm = scrapy.Field() fta = scrapy.Field() ft_pct = scrapy.Field() oreb = scrapy.Field() dreb = scrapy.Field() reb = scrapy.Field() ast = scrapy.Field() tov = scrapy.Field() stl = scrapy.Field() blk = scrapy.Field() blka = scrapy.Field() pf = scrapy.Field() pfd = scrapy.Field() pts = scrapy.Field() plus_minus = scrapy.Field() period = scrapy.Field() date = scrapy.Field() def index_fields(self): return { 'group_id': self['group_id'], 'team_id': self['team_id'], 'date': self['date'], 'period': self['period'] }
pipelines.py
# -*- coding: utf-8 -*- import pymongo import re # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class ScrapingPipeline(object): def process_item(self, item, spider): return item class QuarterProcessor(object): def process_item(self, item, spider): if 'quarter' in item: m = re.match('(Q|OT|H)(\d+)', item['quarter']) if m.group(1) in ('Q', 'H'): item['period'] = int(m.group(2)) elif m.group(1) == 'OT': item['period'] = int(m.group(2)) + 4 else: raise ValueError("Can't process quarter: %s" % item['quarter']) return item class ClockProcessor(object): def process_item(self, item, spider): if 'clock' in item: (minutes, seconds) = item['clock'].split(':') item['seconds'] = float(minutes) * 60.0 + float(seconds) return item class TextProcessor(object): SHOT_RE = re.compile('(.+?) (((Tip|Alley Oop|Cutting|Dunk|Pullup|Turnaround|Running|Driving|Hook|Jump|3pt|Layup|Fadeaway|Bank|No) ?)+) [Ss]hot: (Made|Missed)()?') REBOUND_RE = re.compile('(.+?) Rebound
') TEAM_REBOUND_RE = re.compile('Team Rebound') DEFENSE_RE = re.compile('(Block|Steal): ?(.+?)
') ASSIST_RE = re.compile('Assist: (.+?)
') TIMEOUT_RE = re.compile('Team Timeout : (Short|Regular|No Timeout|Official)') TURNOVER_RE = re.compile('(.+?) Turnover : ((Out of Bounds|Poss)? ?(- )?(Punched Ball|5 Second|Out Of Bounds|Basket from Below|Illegal Screen|No|Swinging Elbows|Double Dribble|Illegal Assist|Inbound|Palming|Kicked Ball|Jump Ball|Lane|Backcourt|Offensive Goaltending|Discontinue Dribble|Lost Ball|Foul|Bad Pass|Traveling|Step Out of Bounds|3 Second|Offensive Foul|Player Out of Bounds)( Violation)?( Turnover)?)
') TEAM_TURNOVER_RE = re.compile('Team Turnover : ((8 Second Violation|5 Sec Inbound|Backcourt|Shot Clock|Offensive Goaltending|3 Second)( Violation)?( Turnover)?)') FOUL_RE = re.compile('(.+?) Foul: (Clear Path|Flagrant|Away From Play|Personal Take|Inbound|Loose Ball|Offensive|Offensive Charge|Personal|Shooting|Personal Block|Shooting Block|Defense 3 Second)( Type (\d+))?
(
)?
') JUMP_RE = re.compile('Jump Ball (.+?) vs (.+)(
)?') VIOLATION_RE = re.compile('(.+?) Violation:(Defensive Goaltending|Kicked Ball|Lane|Jump Ball|Double Lane)(
)?') FREE_THROW_RE = re.compile('(.+?) Free Throw (Flagrant|Clear Path)? ?(\d) of (\d) (Missed)? ?(
)?') TECHNICAL_FT_RE = re.compile('(.+?) Free Throw Technical (Missed)? ?(
)?') SUB_RE = re.compile('(.+?) Substitution replaced by (.+?)$') TEAM_VIOLATION_RE = re.compile('Team Violation : (Delay Of Game)
') CLOCK_RE = re.compile('
') TEAM_RE = re.compile('
') TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?
') DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?)
') DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?)
, (.+?)
![]()
') EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)') # pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb def process_item(self, item, spider): text = item.get('text', None) if text: item['events'] = [] while text: l = len(text) m = self.SHOT_RE.match(text) if m: event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)} if '3pt' in m.group(2): event['fg3a'] = 1 if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 3 else: if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 2 item['events'].append(event) text = text[m.end():].strip() m = self.REBOUND_RE.match(text) if m: event = {'player': m.group(1), 'reb': 1} item['events'].append(event) text = text[m.end():].strip() m = self.DEFENSE_RE.match(text) if m: event = {'player': m.group(2)} if m.group(1) == 'Block': item['events'][-1]['blka'] = 1 event['blk'] = 1 else: event['stl'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.ASSIST_RE.match(text) if m: event = {'player': m.group(1), 'ast': 1} item['events'].append(event) text = text[m.end():].strip() m = self.TIMEOUT_RE.match(text) if m: event = {'timeout': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TURNOVER_RE.match(text) if m: event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_TURNOVER_RE.match(text) if m: event = {'turnover': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_REBOUND_RE.match(text) if m: item['events'].append({'rebound': 'team'}) text = text[m.end():].strip() m = self.FOUL_RE.match(text) # TODO: Are all of these actual personal fouls? if m: event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)} if m.group(4): event['type'] = m.group(4) item['events'].append(event) text = text[m.end():].strip() m = self.DOUBLE_FOUL_RE.match(text) if m: item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)}) item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)}) text = text[m.end():].strip() m = self.JUMP_RE.match(text) if m: item['events'].append({'player': m.group(1), 'jump': 'home'}) item['events'].append({'player': m.group(2), 'jump': 'away'}) if m.group(3): item['events'].append({'player': m.group(4), 'jump': 'possession'}) text = text[m.end():].strip() m = self.VIOLATION_RE.match(text) if m: event = {'player': m.group(1), 'violation': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.FREE_THROW_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)} if m.group(5) is None: event['pts'] = 1 event['ftm'] = 1 if m.group(2): event['special'] = m.group(2) item['events'].append(event) text = text[m.end():].strip() m = self.TECHNICAL_FT_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'} if m.group(2) is None: event['pts'] = 1 event['ftm'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.SUB_RE.match(text) if m: item['events'].append({'player': m.group(1), 'sub': 'out'}) item['events'].append({'player': m.group(2), 'sub': 'in'}) text = text[m.end():].strip() m = self.TEAM_VIOLATION_RE.match(text) if m: item['events'].append({'violation': m.group(1)}) text = text[m.end():].strip() m = self.CLOCK_RE.match(text) if m: item['clock'] = m.group(1) text = text[m.end():].strip() m = self.TEAM_RE.match(text) if m: item['team_abbreviation'] = m.group(1) text = text[m.end():].strip() m = self.TECHNICAL_RE.match(text) if m: if m.group(3): item['events'].append({'team': m.group(3), 'technical': m.group(1)}) else: item['events'].append({'player': m.group(1), 'technical': True}) text = text[m.end():].strip() m = self.DOUBLE_TECH_RE.match(text) if m: item['events'].append({'player': m.group(1), 'technical': True}) item['events'].append({'player': m.group(2), 'technical': True}) text = text[m.end():].strip() m = self.EJECTION_RE.match(text) if m: item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)}) text = text[m.end():].strip() if len(text) == l: raise ValueError('Could not parse text: %s' % text) if len(text) == 0: text = None return item #TODO, figure out offensive/defensive rebounds... we need to know teams for that class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[item.__class__.__name__].replace_one(item.index_fields(), dict(item), True) return item
settings.py
BOT_NAME = 'scraping' SPIDER_MODULES = ['scraping.spiders'] NEWSPIDER_MODULE = 'scraping.spiders' MONGO_URI = 'localhost:27017' MONGO_DATABASE = 'nba' ITEM_PIPELINES = { 'scraping.pipelines.QuarterProcessor': 100, 'scraping.pipelines.ClockProcessor': 102, 'scraping.pipelines.TextProcessor': 101, 'scraping.pipelines.MongoPipeline': 300 }
scrape_season.py
#!/usr/bin/env python import sys import os season = int(sys.argv[1]) for year in (season, season+1): months = range(9, 13) if season == year else range(1, 8) for month in months: for day in range(1, 32): os.system('scrapy crawl nba -a scrape_date=%04d%02d%02d' % (year, month, day))
Next Steps
Moving forward, I'll probably switch from scraping the play-by-play data to using the API. However, I'm optimistic that much of the code for parsing the text will still be applicable. I have observed some differences between the API text and the text on the recap pages.
Once that switch is made, I'll need to integrate the play-by-play and lineup data. This will provide me with a data set where for every play I have both what happened and who was on the floor (offense and defense). This opens up a lot of possibilities.
The supreme goal is to predict the probabilities of various outcomes for a given lineup. However, this data can also be used to answer a lot of other questions. For example, a recent ESPN article looked at the impact of exhaustion on team performance. With this data set, we can investigate this at the lineup level, seeing how lineup-level performance is impacted by the minutes played.