Scraping NBA Play-by-Play Data with Scrapy & MongoDB

Tom Walsh
Posted on Feb 29, 2016

In my previous projects I worked with data on NBA lineups from stats.nba.com, first exploring some of the relationships between player performance and lineup performance, and then building an interactive tool to allow for further exploration. For this project, I wished to get more granular, working with NBA play-by-play data. Scraping the data turned out to be fairly trivial (although it did take about half a week to scrape one season), but it was a challenge to transform it into a useful state.

Schedules

The 3rd tab of each NBA Game Recap page contains the play-by-play for the game. However, an NBA season consists of 1,230 regular season games, so we need an automated method of finding the game pages. Ideally, we'd like to be able to scrape a single day at a time, as this lends itself to regular daily updates. The NBA has a daily schedule page with links to that day's game recaps, and the url pattern for a given date is easy to determine. So, our workflow will be to find the schedule page for a given date, extract that game recap links for each game, and then follow those to scrape the play-by-play for each game.

Scrapy

Initially, I chose to use scrapy mostly because it supports proper selectors (both CSS and XPath) for navigation HTML documents. However, the design of the framework lends itself to efficiently executing our intended workflow. This is because scrapy allows us to queue up additional pages for scraping, and will scrape those pages in parallel, so as we parse the schedule page, we can queue up each game recap page for scraping.

Our initial parse method is quite simple:

def parse(self, response):
    for href in response.css("a.recapAnc::attr('href')"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_game_recap)

Scrapy uses python generators to yield objects to the framework for further processing. In this case, we're finding each game recap link within the response and yielding a scrapy.Request, telling scrapy to scrape that link using the specified callback.

Parsing the game recap is a bit more complicated:

def parse_game_recap(self, response):
    away = None
    home = None
    quarter = None
    # There's some useful information in the url, so we extract it.
    # This probably should have been a single regex, but it doesn't matter much.
    game_id = re.search('([A-Z]+)', response.url).group(1)
    pbp_item = PlayByPlay() # We'll see scrapy Items shortly.
        
    # Find the play by play table and iterate its rows
    for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')):
        # If we get a row with team names, extract them.
        if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1:
           (away, home) = [x.strip() for x in row.xpath('td/text()').extract()
        else:
            # otherwise, build up the PlayByPlay item with the data in the row.
            pbp_item['quarter'] = quarter
            pbp_item['game_id'] = game_id
            pbp_item['index'] = index
            for field in row.xpath('td'):
                field_class = str(field.xpath('@class').extract_first())
                if field_class == 'nbaGIPbPTblHdr':
                    name = row.xpath('td/a/@name')
                    if len(name) > 0:
                        quarter = row.xpath('td/a/@name').extract_first()
                        pbp_item['quarter'] = quarter
                elif len(field.xpath('@id')) > 0:
                    # Sometimes we'll have rows that don't fit the structure of the
                    # PlayByPlay item.  We store them in a GameEvent item.
                    event_item = GameEvent()
                    event_item['type'] = field.xpath('@id').extract_first()
                    event_item['text'] = field.xpath('div/text()').extract_first()
                    event_item['quarter'] = quarter
                    event_item['game_id'] = game_id
                    event_item['index'] = index
                    # We can yield items to, for processing by scrape's pipelines,
                    # which we'll learn about later.
                    yield event_item
                else:
                    text = field.xpath('text()').extract_first().strip()
                    if len(text) == 0:
                        continue
                    else:
                        if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore':
                            pbp_item['team'] = away
                            pbp_item['text'] = text
                        elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore':
                            pbp_item['team'] = home
                            pbp_item['text'] = text
                        elif field_class == 'nbaGIPbPMid':
                            pbp_item['clock'] = text
                        elif field_class == 'nbaGIPbPMidScore':
                            pbp_item['clock'] = text
                            pbp_item['score'] = field.xpath('text()').extract()[1].strip()
                        else:
                            raise ValueError("Unknown class: %s" % field_class)
            if 'clock' in pbp_item:
                # Yield the PlayByPlay item we've been working on and create a new one.
                yield pbp_item
                pbp_item = PlayByPlay()

We see here how a scrapy parse method can return not just scrapy Request objects, but also Item objects.

Here is one of our basic scrapy items at this stage:

class PlayByPlay(scrapy.Item):

    game_id = scrapy.Field()

    quarter = scrapy.Field()

    period = scrapy.Field()

    clock = scrapy.Field()

    score = scrapy.Field()

    team = scrapy.Field()

    text = scrapy.Field()

    index = scrapy.Field()

Dates

We still haven't told scrapy which page to parse. Let's do that now. Here's how we initialize our Spider:

import scrapy
import re
import time

from scraping.items import PlayByPlay, GameEvent

class NbaSpider(scrapy.Spider):
    name = "nba"
    allowed_domains = ["nba.com"]

    # __init__ allows us to specify custom arguments that can be passed to scrapy with the -a option
    # in this case, 'scrape_date'
    def __init__(self, scrape_date=None, *args, **kwargs):
        super(NbaSpider, self).__init__(*args, **kwargs)

        # if no scrape_date is specified, default to yesterday
        if scrape_date is None:
            scrape_date = str(int(time.strftime('%Y%m%d')) - 1)

        # Here's where we define the starting URL
        self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date]

    def parse(self, response):
       ...

Now we can scrape a day of data like this: scrapy crawl nba -a scrape_date=20160226

Pipelines

Our basic scraper/crawler can now pull down the play-by-play for a given date, but we can't yet do anything with it. Scrape's pipelines allow us to work with our data. First, we'll store it somewhere.

MongoDB

MongoDB is a schema-less NoSQL database with an easy to use javascript-based query syntax. It lends itself to situations where we wish to engage in open-ended exploration of the data. It also saved me all the work of creating schemas for my database.

My MongoDB pipeline is very similar to the example here, except since our application has multiple Item types, we select our MongoDB collection based upon the class name. I've also elected to replace in the case of duplicates. To identify duplicates, we've added an index_fields method to each of our Item types.

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[item.__class__.__name__].replace_one(item.index_fields(), dict(item), True)
        return item

All of our Item types has index_field methods. This one is from PlayByPlay:

def index_fields(self):
    return {
        'game_id': self['game_id'],
        'index': self['index'],
        'quarter': self['quarter'],
        'date': self['date']
     }

We need to configure our MongoPipeline to be invoked on each Item:

ITEM_PIPELINES = {
    'scraping.pipelines.MongoPipeline': 300
}

Parsing Play-by-Play Data

Now comes the tough part. We need to parse the play-by-play strings to extract the underlying data. Here are some sample strings:

   "Harden Driving Layup Shot: Missed Block: Faried (2 BLK)",
    "Ellis Running Layup Shot: Made (19 PTS)",
    "Vucevic Layup Shot: Missed Block: Withey (2 BLK)",
    "Holiday 3pt Shot: Made (10 PTS) Assist: Gordon (1 AST)",
    "Kaman Foul: Offensive (2 PF) (S Foster)",
    "Parsons 3pt Shot: Made (7 PTS) Assist: Nowitzki (1 AST)",
    "McLemore Turnover : Out of Bounds - Bad Pass Turnover (1 TO)",
    "Okafor Turnaround Jump Shot: Missed Block: Adams (3 BLK)",
    "Carroll Driving Floating Bank Jump Shot: Made (7 PTS)",
    "Kaman Turnover : Foul (3 TO)",
    "Millsap Turnover : Lost Ball (4 TO) Steal:Johnson (2 ST)",
    "Williams Foul: Personal (1 PF) (B Adams)",
    "Faried Dunk Shot: Made (10 PTS) Assist: Nelson (2 AST)",
    "Young Layup Shot: Made (12 PTS) Assist: Jack (8 AST)",
    "Withey Dunk Shot: Made (8 PTS) Assist: Neto (1 AST)",
    "Holiday Pullup Jump shot: Made (12 PTS)",
    "Mozgov Turnover : Lost Ball (1 TO) Steal:Calderon (2 ST)",
    "Clarkson 3pt Shot: Made (13 PTS) Assist: Russell (4 AST)",
    "Harden Step Back Jump shot: Made (15 PTS)",
    "McConnell Driving Reverse Layup Shot: Made (6 PTS)",
    "DeRozan Driving Reverse Layup Shot: Made (5 PTS) Assist: Lowry (4 AST)",
    "Afflalo Pullup Jump shot: Made (10 PTS) Assist: Calderon (2 AST)",
    "Hibbert Foul: Defense 3 Second (3 PF) (S Twardoski)",
    "Johnson Turnover : Bad Pass (1 TO) Steal:Butler (1 ST)",
    "Asik Turnover : Lost Ball (2 TO) Steal:Lowry (2 ST)",
    "Jump Ball Crowder vs Bazemore (Sullinger gains possession)"

To make sense of this, I used a disgusting mess of regular expressions:

class TextProcessor(object):
    SHOT_RE = re.compile('(.+?) (((Tip|Alley Oop|Cutting|Dunk|Pullup|Turnaround|Running|Driving|Hook|Jump|3pt|Layup|Fadeaway|Bank|No) ?)+) [Ss]hot: (Made|Missed)( )?')
    REBOUND_RE = re.compile('(.+?) Rebound ')
    TEAM_REBOUND_RE = re.compile('Team Rebound')
    DEFENSE_RE = re.compile('(Block|Steal): ?(.+?) ')
    ASSIST_RE = re.compile('Assist: (.+?) ')
    TIMEOUT_RE = re.compile('Team Timeout : (Short|Regular|No Timeout|Official)')
    TURNOVER_RE = re.compile('(.+?) Turnover : ((Out of Bounds|Poss)? ?(- )?(Punched Ball|5 Second|Out Of Bounds|Basket from Below|Illegal Screen|No|Swinging Elbows|Double Dribble|Illegal Assist|Inbound|Palming|Kicked Ball|Jump Ball|Lane|Backcourt|Offensive Goaltending|Discontinue Dribble|Lost Ball|Foul|Bad Pass|Traveling|Step Out of Bounds|3 Second|Offensive Foul|Player Out of Bounds)( Violation)?( Turnover)?) ')
    TEAM_TURNOVER_RE = re.compile('Team Turnover : ((8 Second Violation|5 Sec Inbound|Backcourt|Shot Clock|Offensive Goaltending|3 Second)( Violation)?( Turnover)?)')
    FOUL_RE = re.compile('(.+?) Foul: (Clear Path|Flagrant|Away From Play|Personal Take|Inbound|Loose Ball|Offensive|Offensive Charge|Personal|Shooting|Personal Block|Shooting Block|Defense 3 Second)( Type (\d+))? ( )? ')
    JUMP_RE = re.compile('Jump Ball (.+?) vs (.+)( )?')
    VIOLATION_RE = re.compile('(.+?) Violation:(Defensive Goaltending|Kicked Ball|Lane|Jump Ball|Double Lane)( )?')
    FREE_THROW_RE = re.compile('(.+?) Free Throw (Flagrant|Clear Path)? ?(\d) of (\d) (Missed)? ?()?')
    TECHNICAL_FT_RE = re.compile('(.+?) Free Throw Technical (Missed)? ?()?')
    SUB_RE = re.compile('(.+?) Substitution replaced by (.+?)$')
    TEAM_VIOLATION_RE = re.compile('Team Violation : (Delay Of Game) ')
    CLOCK_RE = re.compile('')
    TEAM_RE = re.compile('

') TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?') DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?) ') DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?) , (.+?) ') EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)') # pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb def process_item(self, item, spider): text = item.get('text', None) if text: item['events'] = [] while text: l = len(text) m = self.SHOT_RE.match(text) if m: event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)} if '3pt' in m.group(2): event['fg3a'] = 1 if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 3 else: if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 2 item['events'].append(event) text = text[m.end():].strip() m = self.REBOUND_RE.match(text) if m: event = {'player': m.group(1), 'reb': 1} item['events'].append(event) text = text[m.end():].strip() m = self.DEFENSE_RE.match(text) if m: event = {'player': m.group(2)} if m.group(1) == 'Block': item['events'][-1]['blka'] = 1 event['blk'] = 1 else: event['stl'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.ASSIST_RE.match(text) if m: event = {'player': m.group(1), 'ast': 1} item['events'].append(event) text = text[m.end():].strip() m = self.TIMEOUT_RE.match(text) if m: event = {'timeout': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TURNOVER_RE.match(text) if m: event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_TURNOVER_RE.match(text) if m: event = {'turnover': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_REBOUND_RE.match(text) if m: item['events'].append({'rebound': 'team'}) text = text[m.end():].strip() m = self.FOUL_RE.match(text) # TODO: Are all of these actual personal fouls? if m: event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)} if m.group(4): event['type'] = m.group(4) item['events'].append(event) text = text[m.end():].strip() m = self.DOUBLE_FOUL_RE.match(text) if m: item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)}) item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)}) text = text[m.end():].strip() m = self.JUMP_RE.match(text) if m: item['events'].append({'player': m.group(1), 'jump': 'home'}) item['events'].append({'player': m.group(2), 'jump': 'away'}) if m.group(3): item['events'].append({'player': m.group(4), 'jump': 'possession'}) text = text[m.end():].strip() m = self.VIOLATION_RE.match(text) if m: event = {'player': m.group(1), 'violation': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.FREE_THROW_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)} if m.group(5) is None: event['pts'] = 1 event['ftm'] = 1 if m.group(2): event['special'] = m.group(2) item['events'].append(event) text = text[m.end():].strip() m = self.TECHNICAL_FT_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'} if m.group(2) is None: event['pts'] = 1 event['ftm'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.SUB_RE.match(text) if m: item['events'].append({'player': m.group(1), 'sub': 'out'}) item['events'].append({'player': m.group(2), 'sub': 'in'}) text = text[m.end():].strip() m = self.TEAM_VIOLATION_RE.match(text) if m: item['events'].append({'violation': m.group(1)}) text = text[m.end():].strip() m = self.CLOCK_RE.match(text) if m: item['clock'] = m.group(1) text = text[m.end():].strip() m = self.TEAM_RE.match(text) if m: item['team_abbreviation'] = m.group(1) text = text[m.end():].strip() m = self.TECHNICAL_RE.match(text) if m: if m.group(3): item['events'].append({'team': m.group(3), 'technical': m.group(1)}) else: item['events'].append({'player': m.group(1), 'technical': True}) text = text[m.end():].strip() m = self.DOUBLE_TECH_RE.match(text) if m: item['events'].append({'player': m.group(1), 'technical': True}) item['events'].append({'player': m.group(2), 'technical': True}) text = text[m.end():].strip() m = self.EJECTION_RE.match(text) if m: item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)}) text = text[m.end():].strip() if len(text) == l: raise ValueError('Could not parse text: %s' % text) if len(text) == 0: text = None return item

Problem: Who is Playing?

While the play-by-play data includes substitutions, it doesn't tell us who started each quarter. This means we don't know who was on the floor at any given point in time. However, by cross-referencing against the per-day, per-quarter lineup data, we should be able to figure this out.

First, we need to modify our Spider to fetch the lineup data.:

def parse(self, response):
    for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_game_recap)
    # Create Requests for lineup data for 4 quarters, plus 10 possible overtimes
    for period in range(1,15):
        url = self.lineup_pattern % (self.date, self.date, period, self.season)
        yield scrapy.Request(url, callback=self.parse_lineups)

# Although the lineup data is a json API, we can still integrate it into our crawler
def parse_lineups(self, response):
    jsonresponse = json.loads(response.body_as_unicode())
    headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])])
    for row in jsonresponse['resultSets'][0]['rowSet']:
        item = Lineup()
        item['date'] = self.scrape_date
        item['period'] = int(re.search('Period=(\d+)', response.url).group(1))
        for index, value in enumerate(row):
            field = headers[index]
            item[field] = value
        yield item

Within the time-frame of this project, I didn't get as far as putting the lineups data together with the play-by-play data, but the basic idea would be to simulate each quarter starting with each of the lineups used in that quarter, finding the starting lineup that results in no inconsistencies in the data.

Putting it all Together

spiders/nba_spider.py

import scrapy
import re
import time
import json

from scraping.items import PlayByPlay, GameEvent, Lineup

# This is the API for play-by-play...
# http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500513&RangeType=2&Season=2015-16&SeasonType=Regular+Season&StartPeriod=1&StartRange=0

class NbaSpider(scrapy.Spider):
    name = "nba"
    allowed_domains = ["nba.com"]

    lineup_pattern = 'http://stats.nba.com/stats/leaguedashlineups?Conference=&DateFrom=%s&DateTo=%s&Division=&GameID=&GameSegment=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=%d&PlusMinus=N&Rank=N&Season=%s&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&TeamID=0&VsConference=&VsDivision='

    def __init__(self, scrape_date=None, *args, **kwargs):
        super(NbaSpider, self).__init__(*args, **kwargs)
        if scrape_date is None:
            scrape_date = str(int(time.strftime('%Y%m%d')) - 1)
        match = re.search('(\d{4})(\d{2})(\d{2})', scrape_date)
        year = int(match.group(1))
        month = int(match.group(2))
        day = int(match.group(3))
        self.date = '%02d%%2F%02d%%2F%04d' % (month, day, year)
        self.season = '%04d-%02d' % ((year, (year+1) % 100) if month > 7 else (year-1, year % 100))
        self.scrape_date = scrape_date
        self.start_urls = ['http://www.nba.com/gameline/%s/' % scrape_date]

    def parse(self, response):
        for href in response.css("a.recapAnc::attr('href')") + response.css("div.nbaFnlMnRecapDiv > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_game_recap)
        for period in range(1,15):
            url = self.lineup_pattern % (self.date, self.date, period, self.season)
            yield scrapy.Request(url, callback=self.parse_lineups)


    def parse_game_recap(self, response):
        away = None
        home = None
        quarter = None
        date = re.search('(\d+)', response.url).group(1)
        game_id = re.search('([A-Z]+)', response.url).group(1)
        pbp_item = PlayByPlay()
        for index, row in enumerate(response.xpath('//div[@id="nbaGIPBP"]//tr')):
            if int(row.xpath('@class="nbaGIPBPTeams"').extract_first()) == 1:
                (away, home) = [x.strip() for x in row.xpath('td/text()').extract()]
            else:
                pbp_item['quarter'] = quarter
                pbp_item['game_id'] = game_id
                pbp_item['index'] = index
                pbp_item['date'] = date
                for field in row.xpath('td'):
                    field_class = str(field.xpath('@class').extract_first())
                    if field_class == 'nbaGIPbPTblHdr':
                        name = row.xpath('td/a/@name')
                        if len(name) > 0:
                            quarter = row.xpath('td/a/@name').extract_first()
                            pbp_item['quarter'] = quarter
                    elif len(field.xpath('@id')) > 0:
                        event_item = GameEvent()
                        event_item['type'] = field.xpath('@id').extract_first()
                        event_item['text'] = field.xpath('div/text()').extract_first()
                        event_item['quarter'] = quarter
                        event_item['game_id'] = game_id
                        event_item['date'] = date
                        event_item['index'] = index
                        yield event_item
                    else:
                        text = field.xpath('text()').extract_first().strip()
                        if len(text) == 0:
                            continue
                        else:
                            if field_class == 'nbaGIPbPLft' or field_class == 'nbaGIPbPLftScore':
                                pbp_item['team'] = away
                                pbp_item['text'] = text
                            elif field_class == 'nbaGIPbPRgt' or field_class == 'nbaGIPbPRgtScore':
                                pbp_item['team'] = home
                                pbp_item['text'] = text
                            elif field_class == 'nbaGIPbPMid':
                                pbp_item['clock'] = text
                            elif field_class == 'nbaGIPbPMidScore':
                                pbp_item['clock'] = text
                                pbp_item['score'] = field.xpath('text()').extract()[1].strip()
                            else:
                                raise ValueError("Unknown class: %s" % field_class)
                if 'clock' in pbp_item:
                    yield pbp_item
                    pbp_item = PlayByPlay()

    def parse_lineups(self, response):
        jsonresponse = json.loads(response.body_as_unicode())
        headers = dict([(i, str(j.lower())) for i, j in enumerate(jsonresponse['resultSets'][0]['headers'])])
        for row in jsonresponse['resultSets'][0]['rowSet']:
            item = Lineup()
            item['date'] = self.scrape_date
            item['period'] = int(re.search('Period=(\d+)', response.url).group(1))
            for index, value in enumerate(row):
                field = headers[index]
                item[field] = value
            yield item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class PlayByPlay(scrapy.Item):
    game_id = scrapy.Field()
    quarter = scrapy.Field()
    period = scrapy.Field()
    clock = scrapy.Field()
    score = scrapy.Field()
    team = scrapy.Field()
    text = scrapy.Field()
    index = scrapy.Field()
    date = scrapy.Field()
    events = scrapy.Field()
    seconds = scrapy.Field()
    team_abbreviation = scrapy.Field()

    def index_fields(self):
        return {
            'game_id': self['game_id'],
            'index': self['index'],
            'quarter': self['quarter'],
            'date': self['date']
         }


class GameEvent(scrapy.Item):
    type = scrapy.Field()
    text = scrapy.Field()
    quarter = scrapy.Field()
    period = scrapy.Field()
    game_id = scrapy.Field()
    index = scrapy.Field()
    date = scrapy.Field()
    events = scrapy.Field()
    clock = scrapy.Field()
    seconds = scrapy.Field()
    team_abbreviation = scrapy.Field()

    def index_fields(self):
        return {
            'game_id': self['game_id'],
            'index': self['index'],
            'quarter': self['quarter'],
            'date': self['date']
         }


class Lineup(scrapy.Item):
    group_set = scrapy.Field()
    group_id = scrapy.Field()
    group_name = scrapy.Field()
    team_id = scrapy.Field()
    team_abbreviation = scrapy.Field()
    gp = scrapy.Field()
    w = scrapy.Field()
    l = scrapy.Field()
    w_pct = scrapy.Field()
    min = scrapy.Field()
    fgm = scrapy.Field()
    fga = scrapy.Field()
    fg_pct = scrapy.Field()
    fg3m = scrapy.Field()
    fg3a = scrapy.Field()
    fg3_pct = scrapy.Field()
    ftm = scrapy.Field()
    fta = scrapy.Field()
    ft_pct = scrapy.Field()
    oreb = scrapy.Field()
    dreb = scrapy.Field()
    reb = scrapy.Field()
    ast = scrapy.Field()
    tov = scrapy.Field()
    stl = scrapy.Field()
    blk = scrapy.Field()
    blka = scrapy.Field()
    pf = scrapy.Field()
    pfd = scrapy.Field()
    pts = scrapy.Field()
    plus_minus = scrapy.Field()
    period = scrapy.Field()
    date = scrapy.Field()

    def index_fields(self):
        return {
            'group_id': self['group_id'],
            'team_id': self['team_id'],
            'date': self['date'],
            'period': self['period']
         }

pipelines.py

# -*- coding: utf-8 -*-

import pymongo
import re
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class ScrapingPipeline(object):
    def process_item(self, item, spider):
        return item

class QuarterProcessor(object):
    def process_item(self, item, spider):
        if 'quarter' in item:
            m = re.match('(Q|OT|H)(\d+)', item['quarter'])
            if m.group(1) in ('Q', 'H'):
                item['period'] = int(m.group(2))
            elif m.group(1) == 'OT':
                item['period'] = int(m.group(2)) + 4
            else:
                raise ValueError("Can't process quarter: %s" % item['quarter'])
        return item

class ClockProcessor(object):
    def process_item(self, item, spider):
        if 'clock' in item:
            (minutes, seconds) = item['clock'].split(':')
            item['seconds'] = float(minutes) * 60.0 + float(seconds)
        return item

class TextProcessor(object):
    SHOT_RE = re.compile('(.+?) (((Tip|Alley Oop|Cutting|Dunk|Pullup|Turnaround|Running|Driving|Hook|Jump|3pt|Layup|Fadeaway|Bank|No) ?)+) [Ss]hot: (Made|Missed)( )?')
    REBOUND_RE = re.compile('(.+?) Rebound ')
    TEAM_REBOUND_RE = re.compile('Team Rebound')
    DEFENSE_RE = re.compile('(Block|Steal): ?(.+?) ')
    ASSIST_RE = re.compile('Assist: (.+?) ')
    TIMEOUT_RE = re.compile('Team Timeout : (Short|Regular|No Timeout|Official)')
    TURNOVER_RE = re.compile('(.+?) Turnover : ((Out of Bounds|Poss)? ?(- )?(Punched Ball|5 Second|Out Of Bounds|Basket from Below|Illegal Screen|No|Swinging Elbows|Double Dribble|Illegal Assist|Inbound|Palming|Kicked Ball|Jump Ball|Lane|Backcourt|Offensive Goaltending|Discontinue Dribble|Lost Ball|Foul|Bad Pass|Traveling|Step Out of Bounds|3 Second|Offensive Foul|Player Out of Bounds)( Violation)?( Turnover)?) ')
    TEAM_TURNOVER_RE = re.compile('Team Turnover : ((8 Second Violation|5 Sec Inbound|Backcourt|Shot Clock|Offensive Goaltending|3 Second)( Violation)?( Turnover)?)')
    FOUL_RE = re.compile('(.+?) Foul: (Clear Path|Flagrant|Away From Play|Personal Take|Inbound|Loose Ball|Offensive|Offensive Charge|Personal|Shooting|Personal Block|Shooting Block|Defense 3 Second)( Type (\d+))? ( )? ')
    JUMP_RE = re.compile('Jump Ball (.+?) vs (.+)( )?')
    VIOLATION_RE = re.compile('(.+?) Violation:(Defensive Goaltending|Kicked Ball|Lane|Jump Ball|Double Lane)( )?')
    FREE_THROW_RE = re.compile('(.+?) Free Throw (Flagrant|Clear Path)? ?(\d) of (\d) (Missed)? ?()?')
    TECHNICAL_FT_RE = re.compile('(.+?) Free Throw Technical (Missed)? ?()?')
    SUB_RE = re.compile('(.+?) Substitution replaced by (.+?)$')
    TEAM_VIOLATION_RE = re.compile('Team Violation : (Delay Of Game) ')
    CLOCK_RE = re.compile('')
    TEAM_RE = re.compile('

') TECHNICAL_RE = re.compile('(.+?) Technical (- )?([A-Z]+)? ?') DOUBLE_TECH_RE = re.compile('Double Technical - (.+?), (.+?) ') DOUBLE_FOUL_RE = re.compile('Foul : (Double Personal) - (.+?) , (.+?) ') EJECTION_RE = re.compile('(.+?) Ejection:(First Flagrant Type 2|Second Technical|Other)') # pts, tov, fta, pf, blk, reb, blka, ftm, fg3a, pfd, ast, fg3m, fgm, dreb, fga, stl, oreb def process_item(self, item, spider): text = item.get('text', None) if text: item['events'] = [] while text: l = len(text) m = self.SHOT_RE.match(text) if m: event = {'player': m.group(1), 'fga': 1, 'type': m.group(2)} if '3pt' in m.group(2): event['fg3a'] = 1 if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 3 else: if m.group(5) == 'Made': event['fg3m'] = 1 event['fgm'] = 1 event['pts'] = 2 item['events'].append(event) text = text[m.end():].strip() m = self.REBOUND_RE.match(text) if m: event = {'player': m.group(1), 'reb': 1} item['events'].append(event) text = text[m.end():].strip() m = self.DEFENSE_RE.match(text) if m: event = {'player': m.group(2)} if m.group(1) == 'Block': item['events'][-1]['blka'] = 1 event['blk'] = 1 else: event['stl'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.ASSIST_RE.match(text) if m: event = {'player': m.group(1), 'ast': 1} item['events'].append(event) text = text[m.end():].strip() m = self.TIMEOUT_RE.match(text) if m: event = {'timeout': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TURNOVER_RE.match(text) if m: event = {'player': m.group(1), 'tov': 1, 'note': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_TURNOVER_RE.match(text) if m: event = {'turnover': m.group(1)} item['events'].append(event) text = text[m.end():].strip() m = self.TEAM_REBOUND_RE.match(text) if m: item['events'].append({'rebound': 'team'}) text = text[m.end():].strip() m = self.FOUL_RE.match(text) # TODO: Are all of these actual personal fouls? if m: event = {'player': m.group(1), 'pf': 1, 'note': m.group(2)} if m.group(4): event['type'] = m.group(4) item['events'].append(event) text = text[m.end():].strip() m = self.DOUBLE_FOUL_RE.match(text) if m: item['events'].append({'player': m.group(2), 'pf': 1, 'note': m.group(1), 'against': m.group(3)}) item['events'].append({'player': m.group(3), 'pf': 1, 'note': m.group(1), 'against': m.group(2)}) text = text[m.end():].strip() m = self.JUMP_RE.match(text) if m: item['events'].append({'player': m.group(1), 'jump': 'home'}) item['events'].append({'player': m.group(2), 'jump': 'away'}) if m.group(3): item['events'].append({'player': m.group(4), 'jump': 'possession'}) text = text[m.end():].strip() m = self.VIOLATION_RE.match(text) if m: event = {'player': m.group(1), 'violation': m.group(2)} item['events'].append(event) text = text[m.end():].strip() m = self.FREE_THROW_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'attempt': m.group(3), 'of': m.group(4)} if m.group(5) is None: event['pts'] = 1 event['ftm'] = 1 if m.group(2): event['special'] = m.group(2) item['events'].append(event) text = text[m.end():].strip() m = self.TECHNICAL_FT_RE.match(text) if m: event = {'player': m.group(1), 'fta': 1, 'ftm': 1, 'special': 'Technical'} if m.group(2) is None: event['pts'] = 1 event['ftm'] = 1 item['events'].append(event) text = text[m.end():].strip() m = self.SUB_RE.match(text) if m: item['events'].append({'player': m.group(1), 'sub': 'out'}) item['events'].append({'player': m.group(2), 'sub': 'in'}) text = text[m.end():].strip() m = self.TEAM_VIOLATION_RE.match(text) if m: item['events'].append({'violation': m.group(1)}) text = text[m.end():].strip() m = self.CLOCK_RE.match(text) if m: item['clock'] = m.group(1) text = text[m.end():].strip() m = self.TEAM_RE.match(text) if m: item['team_abbreviation'] = m.group(1) text = text[m.end():].strip() m = self.TECHNICAL_RE.match(text) if m: if m.group(3): item['events'].append({'team': m.group(3), 'technical': m.group(1)}) else: item['events'].append({'player': m.group(1), 'technical': True}) text = text[m.end():].strip() m = self.DOUBLE_TECH_RE.match(text) if m: item['events'].append({'player': m.group(1), 'technical': True}) item['events'].append({'player': m.group(2), 'technical': True}) text = text[m.end():].strip() m = self.EJECTION_RE.match(text) if m: item['events'].append({'player': m.group(1), 'ejection': True, 'note': m.group(2)}) text = text[m.end():].strip() if len(text) == l: raise ValueError('Could not parse text: %s' % text) if len(text) == 0: text = None return item #TODO, figure out offensive/defensive rebounds... we need to know teams for that class MongoPipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[item.__class__.__name__].replace_one(item.index_fields(), dict(item), True) return item

settings.py

BOT_NAME = 'scraping'

SPIDER_MODULES = ['scraping.spiders']
NEWSPIDER_MODULE = 'scraping.spiders'

MONGO_URI = 'localhost:27017'
MONGO_DATABASE = 'nba'

ITEM_PIPELINES = {
    'scraping.pipelines.QuarterProcessor': 100,
    'scraping.pipelines.ClockProcessor': 102,
    'scraping.pipelines.TextProcessor': 101,
    'scraping.pipelines.MongoPipeline': 300
}

scrape_season.py

#!/usr/bin/env python

import sys
import os

season = int(sys.argv[1])

for year in (season, season+1):
    months = range(9, 13) if season == year else range(1, 8)
    for month in months:
        for day in range(1, 32):
            os.system('scrapy crawl nba -a scrape_date=%04d%02d%02d' % (year, month, day))

Next Steps

Moving forward, I'll probably switch from scraping the play-by-play data to using the API. However, I'm optimistic that much of the code for parsing the text will still be applicable. I have observed some differences between the API text and the text on the recap pages.

Once that switch is made, I'll need to integrate the play-by-play and lineup data. This will provide me with a data set where for every play I have both what happened and who was on the floor (offense and defense). This opens up a lot of possibilities.

The supreme goal is to predict the probabilities of various outcomes for a given lineup. However, this data can also be used to answer a lot of other questions. For example, a recent ESPN article looked at the impact of exhaustion on team performance. With this data set, we can investigate this at the lineup level, seeing how lineup-level performance is impacted by the minutes played.

About Author

Tom Walsh

Tom Walsh

Tom Walsh (M.Sc. Computer Science, University of Toronto) developed a desire to get deeper into the data while leading a team of developers at BSports building Scouting Information Systems for Major League Baseball teams. A course on Basketball...
View all posts by Tom Walsh >

Leave a Comment

Avatar
Mytvxweb Iptv Donation January 4, 2018
Attractive component of content. I just stumbled upon your web site and in accession capital to say that I get in fact enjoyed account your blog posts. Anyway I will be subscribing to your feeds and even I achievement you access persistently quickly.
Avatar
solitaire December 22, 2017
Tremendous things here. I'm very satisfied to see your article. Thank yyou a lot and I'm having a look forward to touch you. Wiill you please drop me a e-mail?
Avatar
Fran April 24, 2017
Thanks for sharing this great work !
Avatar
Rebecca July 11, 2016
I was just looking at your Scraping NBA Play-by-Play Data with Scrapy & MongoDB - NYC Data Science Academy BlogNYC Data Science Academy Blog website and see that your site has the potential to get a lot of visitors. I just want to tell you, In case you don't already know... There is a website service which already has more than 16 million users, and most of the users are interested in websites like yours. By getting your site on this network you have a chance to get your site more visitors than you can imagine. It is free to sign up and you can find out more about it here: http://ezurl.dk/gfc8 - Now, let me ask you... Do you need your site to be successful to maintain your way of life? Do you need targeted visitors who are interested in the services and products you offer? Are looking for exposure, to increase sales, and to quickly develop awareness for your website? If your answer is YES, you can achieve these things only if you get your site on the network I am talking about. This traffic service advertises you to thousands, while also giving you a chance to test the service before paying anything at all. All the popular websites are using this network to boost their traffic and ad revenue! Why aren’t you? And what is better than traffic? It’s recurring traffic! That's how running a successful site works... Here's to your success! Find out more here: http://inflightvideo.tv/a/b
Avatar
http://www.marbellamoving.com/sv/ifk-goteborg-matchtroja/ March 30, 2016
http://www.marbellamoving.com/sv/ifk-goteborg-matchtroja/, http://www.marbellamoving.com/sv/troja-engelska/, http://www.marbellamoving.com/sv/tjock-troja/ Leta, Leta, Leta, Leta, Leta, Leta, Leta, Leta, Leta, Leta,

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp