Play Data, Play Ball! - Exploring Baseball Data with R

Avatar
Posted on Dec 31, 2014
  • Demo day: Dec 9, 2014. No photos available.
  • Contributed by Summit Suen & Wayne Chen.
  • They took the CORP-R 002 class with Vivian Zhang (Taiwan Open data and data science / 臺北國際 OPEN DATA 培訓) in Oct, 2014. This post was based on their final project submission.

  • What comes up with you while talking about SPORTS DATA ANALYSIS?!


    Hot, Hot, Hot: Big Data & Sports (1/2)


    Hot, Hot, Hot: Big Data & Sports (2/2)


    In fact, works analyzing baseball data with statistical techniques has already exist: Sabermatrics


    There are plenty of excellent packages that could help us analyzing baseball data.

    A simple demo with r-packages: dplyr & lahman

    library(Lahman)
    library(dplyr)
    totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
    names(totalRS) <- c("yearID", "RUN")
    head(totalRS)
    ## Source: local data frame [6 x 2]
    ## 
    ##   yearID      RUN
    ## 1   1871 93.12897
    ## 2   1872 95.21474
    ## 3   1873 73.15998
    ## 4   1874 58.55903
    ## 5   1875 70.08774
    ## 6   1876 47.01267
    

    And one could easily observe the trend of scoring in MLB history via lahman + ggplot2

    library(ggplot2)
    ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()
    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
    

    plot of chunk unnamed-chunk-2


    Even admire the career numbers of Derek Jeter! LOL

    library(Lahman)
    library(dplyr)
    head(filter(Batting, playerID == "jeterde01"))
    ##    playerID yearID stint teamID lgID   G G_batting  AB   R   H X2B X3B HR
    ## 1 jeterde01   1995     1    NYA   AL  15        15  48   5  12   4   1  0
    ## 2 jeterde01   1996     1    NYA   AL 157       157 582 104 183  25   6 10
    ## 3 jeterde01   1997     1    NYA   AL 159       159 654 116 190  31   7 10
    ## 4 jeterde01   1998     1    NYA   AL 149       149 626 127 203  25   8 19
    ## 5 jeterde01   1999     1    NYA   AL 158       158 627 134 219  37   9 24
    ## 6 jeterde01   2000     1    NYA   AL 148       148 593 119 201  31   4 15
    ##   RBI SB CS BB  SO IBB HBP SH SF GIDP G_old
    ## 1   7  0  0  3  11   0   0  0  0    0    15
    ## 2  78 14  7 48 102   1   9  6  9   13   157
    ## 3  70 23 12 74 125   0  10  8  2   14   159
    ## 4  84 30  6 57 119   1   5  3  3   13   149
    ## 5 102 19  8 91 116   5  12  3  6   12   158
    ## 6  73 22  4 68  99   4  12  3  3   14   148
    

    Here comes another awesome package: pitchRx
    Which offers APIs accessing daily MLB logs and 2D/3D animations.

    install.packages("pitchRx")
    library(pitchRx)


    OpenWAR is a open project for calculating WAR - Wins Above Replacement
    WAR offers an estimate to answer the question, “If this player got injured and their team had to replace them with a freely available minor leaguer or a AAAA player from their bench, how much value would the team be losing?”

    install.packages("Sxslt", repos = "http://www.omegahat.org/R", type = "source")
    require(devtools)
    install_github("openWAR", "beanumber")


    After the amazing tour of useful packages, we need to write our OWN crawler/analyzer!!!


    Data Source


    Crawler in R (1/2)

    ## RSelenium + phantomJS
    library(XML)
    library(RSelenium)
    
    pJS <- phantom()
    Sys.sleep(5)
    remDr <- remoteDriver(browserName = 'phantomjs')
    remDr$open()
    
    url <- 'http://www.cpbl.com.tw/stats_hr.aspx'
    remDr$navigate(url)
    
    hr_tables <- list()

    Crawler in R (2/2)

    year <- seq(1990, 2014)
    for (i in 1:length(year)) {
      option <- remDr$findElement(using = 'xpath', paste0("//*/option[@value = '",year[i],"']"))
      option$clickElement()
      src <- option$getPageSource()
      hr_tables[i] <- readHTMLTable(src[[1]], stringAsFactors = FALSE, encoding = "utf8")
    }
    
    # View(hr_tables)
    
    remDr$close()
    pJS$stop()

    Let's PLAY DATA!

    load("hr_tables.RData")
    
    head(hr_tables[[25]])
    ##   NUMBER YEAR      GID      DATE STADIUM   BATTER  BATTERTEAM  PITCHER
    ## 1      # YEAR GAME NO.      DATE STADIUM   PLAYER PLAYER TEAM  PITCHER
    ## 2      1 2014        2 2014/3/23    天母 林 威 助    中信兄弟 費 古 洛
    ## 3      2 2014        3 2014/3/23  澄清湖 詹 智 堯      Lamigo 林 正 豐
    ## 4      3 2014        4 2014/3/25    新莊 張 志 豪    中信兄弟 林 晨 樺
    ## 5      4 2014        5 2014/3/26  嘉義市 林 泓 育      Lamigo 林 岳 平
    ## 6      5 2014        6 2014/3/27    新莊 張 志 豪    中信兄弟 黃 勝 雄
    ##    PITCHERTEAM RBI REMARK
    ## 1 PITCHER TEAM RBI REMARK
    ## 2 統一7-ELEVEn   1       
    ## 3         義大   1       
    ## 4         義大   1       
    ## 5 統一7-ELEVEn   1       
    ## 6         義大   1
    

    Draw some pictures (1/2)

    par(family = 'Heiti TC Light')
    m <- ggplot(hr_table, aes(x = as.numeric(GID)))
    m + geom_density(aes(fill = factor(BatterTeam), alpha = 0.01)) + 
        labs(title = paste0("HR Distribution of ", year[i]," Season 中華職棒", year[i], "賽季逐場全壘打分布")) + 
        theme(text = element_text(family="Heiti TC Light"), plot.title = element_text(size = 26), legend.text = element_text(size = 20))

    plot of chunk unnamed-chunk-10


    Keep going - further analysis

    load("sc_tables.RData")
    
    # 賽伯計量學的畢達哥拉斯定理
    win_pc <- function(RS, RA) {
      perc <- RS^2 / (RS^2 + RA^2) 
      return(perc)
    }


    Draw some pictures (2/2)

    levels(win_prob$team)
    ## [1] "統一7-ELEVEn" "義大"         "中信兄弟"     "Lamigo"
    
    # fig <- ggplot(data = win_prob, aes(x = team, y = real, fill = factor(half)))
    # fig + geom_bar(stat = "identity", position="dodge")
    ggplot(data = win_prob, aes(x = expd, y = real, color = factor(team))) + stat_smooth() + geom_point() + coord_fixed()
    ## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
    

    plot of chunk unnamed-chunk-13


    An example demonstrating the relationship between "Real Value" v.s. "Market Value"

    example



    Advertising

    About Author

    Related Articles

    Leave a Comment

    No comments found.

    View Posts by Categories


    Our Recent Popular Posts


    View Posts by Tags

    #python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp