Fastest subway in NYC (Data updates every 10seconds!)

Avatar
Posted on Jul 24, 2015

Open R shiny App from a new window here!

Play with the App here:

Author and Project:

- Author: XAVIER CAPDEPON

- Xavier was a student of the Data Science Bootcamp#2 (B002) - Data Science, Data Mining and Machine Learning - from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Vivian.

- The post is based on his "R Shiny App" project final submission.

 

Project context:

Over the last year, my sister-in-law mentioned several times that, in her experience, the subways were not running with the same punctuality from the Bronx to Lower Manhattan during the early morning peak. It surprised me and I kept her statement in a part of my mind.

She usually takes the 1 train or the 4 train at the terminal in the Bronx and rides the train everyday to the Lower Manhattan and the Financial district. She experienced that the 4 train (4-5-6 green-apple colored line ) had more delays and were often slower that the 1 train (1-2-3 red-tomato colored line) early in the morning but at the end of the afternoon, the 4 train is slightly faster.

Curiously, on the east side of Manhattan, the 4 train is running express between 149th street to Brooklyn Bridge stations (22 stations from the Bronx to Financial District) whereas the 1 train is only running local (33 stations).

As we started learning R and Shiny, I believed that these trains running at different paces and facing different delays was the right subject to learn to code in R and use the Shiny App package and could potentially answer my skepticism towards my sister-in-law's statements.

 

Expectations and Challenges:

Given my first and short experience in the public transportation field 10 years ago as a junior research engineer, I had high expectations regarding the quality of the data since the technologies, data collection, data transmission, and data system, changed a lot over the years. Naively, I thought that the challenges would only be on the coding part.

Naturally, I was wrong and I faced a series of challenges with the collections of data, with the data itself and the coding that I am going to explain in this post.

During the project, Jason, our teacher, told me once that I needed to be careful in term of planning and that I shall re-scale the project if difficulties were encountered given the delivery deadline for the project. Several times, I wondered if the delivery of the project was doable in the one week timeline.

This app project  encompass all the following technical challenges:

- API real feed
- data cleaning and interpretation
- time-series
- R packages conflicts
- Shiny interactivity and design
- data file size and processing time
- Python communication with R

 

Disorganized data sources:

Surprisingly, the first challenge was getting the data from the www.mta.info website. Getting the right page and be able to return to it again is actually a challenge given the confusing website architecture, the numerous secondary websites and the numerous links.

The following link provides all the static feed regarding buses, subways and trains managed by the MTA after signing a disclaimer.

The real time data feed for the subway is available here. The following file contains the explanations about the real time feed but it appears that most of the fields are actually empty. At the date of today, the real time data are only available for line 1-2-3, 4-5-6 and the S line between Grand Central and Times Square in Manhattan. I found surprising to provide real time data for a line that go back on forth between 2 close terminals with a really high frequency...

 

Redrawing NY City Map using Public Transportation Data

In order to get more familiar with R coding, I started playing with the static data feed from the MTA mentioned above. Each files contain some different data about the public transport network like the stops, schedules, colors and descriptions, path coordinates... It was really confusing for the first few hours but after getting familiar with the data set, they are quite easy to work with. The files are built, I believe, using a Google format and the knowledge can be used for any other city using the same. My original idea was to add some features to draw customized part of New York city using Shiny with added layers of information but it wasn't doable given the size of the files involved and the process to draw the pictures through ggplot2.

AllBusesSubways2

Fig 1. : Subway  and Bus paths in NYC.
The brightest white color corresponds to several buses on the same path.
The picture was generated with RStudio and ggplot2 library based on the MTA static data feeds.

The code to draw the pictures of the NYC transportation network system has been exported outside of the shiny app files and is available here.

Fastest subway in New York City

Build the historical data:

In order to study the behavior of the subways, I needed to collect the data from the real time feed over the longest possible period of time.

However, The feed from the MTA is actually not a txt file like the static data but a gtfs (google specific format).

It appeared really quickly that R doesn't have yet a package / library designed to read gtfs feeds, but Python has a specific  package called google.transit available here. Please note that the easiest way to install it is through the console commands:

# Using easy_install
easy_install --upgrade gtfs-realtime-bindings

# Using pip
pip install --upgrade gtfs-realtime-bindings

In order to download the information, I wrote a code in Python using the google.transit package which was combined with the web scraping of the Delays using Beautiful Soup package (it will be mentioned later in this blog post). The code is available here. Essentially, the code reads the gtfs file and extracts different information such as expected arrival time, train ID, system time, station, alerts, such as described below:

### save in file definition
def savefile(listitems):
    # 'a' to add line to txt file
    with open('mta.txt', 'a') as f:
            f.write(listitems + 'n')

### final code to download and clean the real time feed
def loopdwlmta(x):
    
    #system time for reference
    tmpSys = datetime.datetime.strftime(datetime.datetime.now(), '%Y-%m-%d %H:%M:%S')
    
    ##load list of lines perturbated
    listlineperturbated = listofperturbations(urllist)

    ## send request to api mta.info using my own key
    feed = gtfs_realtime_pb2.FeedMessage()
    response = urllib.urlopen('http://datamine.mta.info/mta_esi.php?key=5a8f6f4375e9f4bdea9b0c86afeaf911')
    feed.ParseFromString(response.read())
    
    
    ## looping the content downloaded
    for entity in feed.entity:
        if entity.HasField('trip_update'):
    
            ## trip update: 3 data characterisitcs
            TripIdi = entity.trip_update.trip.trip_id
            StartDatei = entity.trip_update.trip.start_date
            Routei = entity.trip_update.trip.route_id
            if Routei in listlineperturbated:
                Delaysi = "delays"
            else:
                Delaysi = ""

            ## trip alert: 1 data characterisitcs
            if entity.HasField('alert'):
                Alerti = entity.alert.header_text
            else:
                Alerti = ''

            ## looking to record first 2 predicted arrival time 
            Arrivetmp=[]
            stop_id=[]

            tripcomplete = []

            if len(entity.trip_update.stop_time_update) <> 0:
                #record only first three trip updates
                p3trip = min(len(entity.trip_update.stop_time_update),1)

                j=0
                while j < p3trip:
                    tripcomplete = []
                    #print j
                    ArrivalTime = entity.trip_update.stop_time_update[j].arrival.time
                    stop_id = entity.trip_update.stop_time_update[j].stop_id
                    station_nth=j

                    tripcomplete.extend([tmpSys,TripIdi,StartDatei,Routei,Alerti,station_nth,stop_id,ArrivalTime,Delaysi])
                    #print tripcomplete, type(tripcomplete)
                    tripcomplstr = ','.join([str(i) for i in tripcomplete])
                    savefile(tripcomplstr)

                    j+=1

Code 1.: the script above reads and filters data from the real time gtfs file issued by mta.info.

A loop was running for the remaining time of the project (about 6 days) on my home computer to download the data over time every five minutes and storing them in a text file.

The text output file was a quick solution given the time frame of the project that could be used as a easy junction between R and Python since the two algorithms were run on different computers.

A future improvement will be to change the code to create and amend a SQL database saved on the disk since SQL is better to handle the size of the data and later data manipulation.

import time 
while True:
    try: 
        loopdwlmta(1)
    except:
        continue
    time.sleep(300)

Code 2.: the script above is a simple script to launch the real time script (cf. code 1) at a regular time interval.

After a few tries and fails, I set up the interval time to 5 minutes (300s). Afterwards, I think the critical interval time to download the feed shall be 90 seconds maximum given my experience and the analysis of the data. It is my understanding that 90 seconds also corresponds to the time after which the MTA department stops updating the expected feed for a train stalled between stations.

The Python code captured data during six days including a week-end.

blogpic3

Pict. 2: screen of the "Fastest Subway Line Analysis" App window showing the users widgets and a map of the studied lines 1 and 4.

 

Data output surprises and follow-up strategy:

The outputs data is relatively clean once it is filtered by the Python script but I was surprised by the data for several reasons:

1. Expected time versus real time position:
The MTA only provides the expected arrival time of the train to the next station and not their real time position, so all the data are based on expected arrival time basis. (Correction: there is maybe a way to capture the train standing in a station by capturing a "depart time" falling before a "arrival time" in the MTA real time feed file).

The only way to mitigate the problem and calculate a close estimate of the travel time for each train is actually to download as often as possible the expected time arrival to the next station for each train and capture the maximum time arrival at a particular station for each train.

2. Empty "Alert" field and unrealistic train delays :
Curiously, the real time feed file contains a field called "Alert" attached to every train ID - that my code is capturing - but unfortunately, the field doesn't not seem to ever be used.

After this finding, I quickly decided to web scrap the Delays from the mta.info website in the Service Status section. I will comment later in this blog about this item which constitute a real concern.

3. Ghost trains which never move at the terminus:

The MTA real time feed file contains trainID of trains that never depart or are on hold for up to an hour and finally disappear without departing. Given the train terminal configuration on 242nd street (I live nearby), these ghost trains can't be standing near the passenger platform and can't all be real trains. These outliers  are a concern as they falsify the analysis.

Originally, I was expecting to calculate the travel time between the departure terminal to the last station in the target area. Given this finding, I restricted the analysis to the station after the departure terminal to the last station in order to avoid the trains that never moved or stay on hold at the terminal.

4. Process to calculate time of travel versus depart time.

In order to calculate the travel time versus departure time, the code follows the following steps:
a. the code captures the train IDs passing on the second station after the terminal.
b. given (a), the code captures the same train IDs passing on the second to last station in the arrival area. Since the feed download was set up every five minutes, I chose to capture the train IDs at the second to last or third to last station to have more data points given the short time line of a single week of historical data.

I implemented several features such as downloading several expected arrival time but the MTA would omit data without notice and I had to reshuffle my project several times.

 

Non quantifiable Delays:

Since the alert tag from the real time MTA feed was always empty, I implemented a script running in parallel to the real time feed in order to capture the delays on the subway from the Service Status on mta.info.

# Perturbation on MTA info -- web scraping of mta.info :

#definition of extracting function 

def extract_url_content(urlv, headersv):
        # Request by url and headers
        req =  requests.get(urlv, headers=headers)
        demande = req.text
        # Request status code
        statut_demande = req.status_code
        if statut_demande/100 in [4,5]:
            return 'error on requests with error: ', statut_demande
        return BeautifulSoup(demande)

## perturbations on MTA lines are publised here:

urllist = [
    'http://www.mta.info/status/subway/123',
    'http://www.mta.info/status/subway/456',
    'http://www.mta.info/status/subway/7',
    'http://www.mta.info/status/subway/ACE',
    'http://www.mta.info/status/subway/BDFM',
    'http://www.mta.info/status/subway/JZ',
    'http://www.mta.info/status/subway/L',
    'http://www.mta.info/status/subway/NQR',
    'http://www.mta.info/status/subway/S'
]

## Summary of the loop
def listofperturbations(listOfLines):
    
    ## webscraping of mta/info using the list of url listed above
    linesperturbated = []
    for url in listOfLines:
        headers = None
        urlbody = extract_url_content(url, headers)
        #print url, 'find TitleDelay',  urlbody.find(class_="TitleDelay")
        
        if urlbody.find(class_="TitleDelay") <> None:
            Delaystext = urlbody.find(class_="TitleDelay").get_text()
            #print 'delaystext', Delaystext
            if Delaystext == "Delays":
                lineaffected001 = urlbody.findAll(id='status_display')
                lineaffected002 = [i.findAll('img') for i in lineaffected001]
                #print 'line affectd 002', lineaffected002
            for j in range(len(lineaffected002)):
                    #print j
                    for i in lineaffected002[j]:
                        try:
                            linesperturbated.extend(i['alt'][:1])
                        except:
                            continue
                            
                    #textaffected001 = [i for i in lineaffected002[j]]
                    #print textaffected001
                    #linesperturbated.extend(textaffected001)
    return linesperturbated

code 3. : script to capture the delays on every subway line from the service status mta.info

The outputs didn't satisfy me and at first, I barely know what to do with the information. There are several problems on how the delays are announced and not quantifiable as such:

1. Delays are mentioned on the website in different manners and often in a non structured sentence not directly readable by a script and using a mix of word and pictures to represent the lines. A typical hypothetical example on the  , "delays on the trains [icon(2)] running on the same line than [icon(4)] due to an incident in the Bronx". The code shall only consider the 2 trains but we don't know if both direction is concerned and what are the train ID in the Bronx. Additionally, the structure of the sentence changes depending on the agent writing it and the website is not properly structured so the class and tags are mixed and it is difficult to extract anything.

2. Since the delays are not tagged to the train IDs, it became really difficult to know what are the train concerned. Unlikely, for example, the trains nearby a terminal will be concerned with an incident in the middle of the line. As a first step, I chose to tag all the trains running at the time of the declared delays, whatever their position were.

3. Also, since the delays aren't tagged to the train IDs and are a completely independent information from the train IDs, I chose to write the script such as if a delay was announced once during the trip of a particular train ID, the code will report a delay on this train ID independently of the delay duration.

Analysis:

Once the outputs of the Python script including the delays have been stored in a text file for several days, the code in the shiny app helpers file read the data and interpret it. The code is available here and here: see graph_trip_time_subway function.

Please note that for the purpose of migrating the App on the server, the data output have been saved in a rds file in order for the App to run more smoothly and several steps have been frozen.

##2. Load the file for historical data 
## based on scrap of real time MTA API
library(dplyr)
library(lubridate)
library(chron)

# ## Load historical file of real time data records built using Python.
# mtahist = read.csv("data/mta.txt",header=FALSE,stringsAsFactors = FALSE)
# names(mtahist) <- c('tmpSys','TripId','StartDate','Route','alert','j','stop_id','ArrivalTime','Delays')
# 
# ## Datafile Setup:
# ## Select only first record for each trainId for each time stamp: j==0
# ## Add direction N or S 
# ## Capture and convert time from date stamps
# ## Add col of unique trainID
# mtahist001 = filter(mtahist,j==0)
# mtahist002 = mutate(mtahist001,Direction=substr(mtahist001$stop_id,4,5)) %>%
#   mutate(.,TrainId = paste(TripId,StartDate,sep=""))
# mtahist004 = mutate(mtahist002,ArrivalTime=ifelse(ArrivalTime==0,
#                                                   as.numeric(ymd_hms(mtahist002$tmpSys, tz = "America/New_York")),
#                                                   ArrivalTime))
# mtahist003 = mutate(mtahist004,TmpSys=substr(tmpSys,12,20))
# mtahist003$TmpSys <- chron(times. = mtahist003$TmpSys)
# saveRDS(mtahist003, "mtaHistoricalFeed.rds")
## file was saved in rds format for better results on shiny server
mtafile_cleaned = readRDS("data/mtaHistoricalFeed.rds")
graph_trip_time_subway<-function(line,timeinf,timesup,datafile,delays,direction){
  ## Select line and directions:
#   line_input = line
#   direct_input = c('S')
  
  
  
  library(dplyr)
  library(lubridate)
  library(chron)
  
  ## Given the timeframe for the project, the calculation is simplified:
  ## A. Given the unexpected behaviour happening at terminal, the train time is going to be capture on the first station after terminal
  ## B. Given that the historical file is updated every 5 min, the train arrival time is going to be capture on last two stations to get more significant data points
  ## => B. the update needs to be fixed in future.
  
  ## Select start and arrival station:
  ## we are limiting the study to line 1 and 4 for purpose of this exercise:
  ##
  if (direction=='S'){
    station_line = c("103S","139S","140S","402S","419S","420S")
    starting_station = c("103S","402S")
  } else {
    station_line = c("101N","103N","139N","401N","402N","420N")
    starting_station = c("139N","420N")
  }
  
  #starting_station = paste(ss, direct_input, sep="")
  mtahist010 = filter(datafile,
                      Route %in% line,
                      stop_id %in% starting_station )
  
  mathist011 = filter(mtahist010,timesup>chron::hours(mtahist010$TmpSys),
                                 timeinf<chron::hours(mtahist010$TmpSys))

  ## isoler les ID des trains partant entre les 2 dates
  mathist013 = distinct(arrange(mathist011,desc(ArrivalTime)), TrainId)

  
  ## in the large historique file, isolate the ID from prev step
  mathist015 = filter(datafile, TrainId %in% mathist013$TrainId)

  mathist016 = filter(mathist015,stop_id %in% station_line)%>%
    group_by(.,TrainId) %>%
    mutate(.,tempsroulage=(max(ArrivalTime)-min(ArrivalTime))/60)
  mathist017=  arrange(mathist016,desc(tempsroulage))
  mathist018=  distinct(arrange(mathist016,desc(tempsroulage)), TrainId)
  mathist019 = (filter(mathist018,tempsroulage>1))
  #mean(mathist019$tempsroulage)

  #color for graph
  if (length(line)==1){
    if (line==1){
      couleurline = c("#EE352E")
    } else{
      couleurline = "#00933C"
    }
  } else {
    couleurline = c("#EE352E","#00933C")
  }
  print(line)
  print(direction)
  print(couleurline)
  Sys.setenv(TZ='GMT')
  library(ggplot2)
  if (delays==FALSE){
    g = ggplot(mathist019, aes(TmpSys,tempsroulage, group=Route,color=Route)) + 
      geom_point(alpha=.7) +
      coord_cartesian(ylim=c(0,100)) +
      geom_smooth(method="auto") +
      geom_rug() +
      scale_color_manual(values=couleurline)+
      xlab("Depart Time") +
      ylab("Travel Time (in minutes)") +
      ggtitle("MTA Subway Travel Time") +
      scale_x_chron(format="%H:%M")
  } else {
    g = ggplot(mathist019, aes(TmpSys,tempsroulage)) + 
      geom_point(aes(group=Route,color=Route),alpha=.7) +
      coord_cartesian(ylim=c(0,100)) +
      geom_smooth(aes(group=Route,color=Route),method="auto") +
      geom_rug(aes(group=Route,color=Route)) +
      geom_point(aes(group=Delays,size=Delays,color=Route)) +
      scale_color_manual(values=couleurline)+
      xlab("Depart Time") +
      ylab("Travel Time (in minutes)") +
      ggtitle("MTA Subway Travel Time") +
      scale_x_chron(format="%H:%M")
  }
  return(g)
}

codes 5 and 6.: script filters and graphs the output saved in txt file by the python script described earlier.

Given the line, the direction of the line, the depart time range chosen by the App user and to show or not the delays, the script:

a. clean and filter the inputs,
b. concatenate an unique train ID by combining the date and the unique daily train ID,
c. decompose the station ID in order to extract the direction,
d. convert the time string into readable time (this step gave me a really hard time during three days because the diverse R packages, including ggplot, were conflicting in terms of time zone referential).
e. given the time range for the depart time, the trains ID that pass in the second station after the terminal are filtered and then the script look for those than pass on the last two stations of the target arrival area as mentioned in my hypothesis.
f. the travel time for each train is calculated given the conditions exposed in (e)
g. all the travel time are graphed function of the depart time range are graphed using ggplot.

blogpic2

Pict. 3: screen of the Analysis Shiny App page.
The user chooses the options to interfere with the graph of the subway analysis.

 

Time Analysis Conclusion:

My sister-in-law's observations are not completely verified by the data but we can affirm the following:

It is correct that there is a peak in the travel time early in the morning and in the end of the afternoon for both the trains 1 and 4 in both in the South direction but the peaks are less pronounced in the North direction.

With the data collected on a short period of time, we can observe that it is shorter by about 5 to 10 minutes to use the 4 express train rather than the 1 train.

We can also observe that the 4 train occurs more delay alerts than the 1 train and that some late trains (like the 1 train) are occurring delays without any delay alerts being notify by the MTA status service.

Also, the analysis would deserve to get a data set on a longer period with a higher frequency of update and shall differentiate the week days from the week-end days.

 

Real Time Map:

The real time map was added in the last hours before the deadline and was really just assembling a simplified puzzle of the different techniques that I used to draw the maps and built the analysis of the travel time mentioned above.

Originally, a simple text file generated by the Python script similar to script mentioned earlier (confer code 1. and 2.) every 30s was read by Shiny every time  the file was updated using the function:

fileData <- reactiveFileReader(1000, session, 'data/mtaRealTime.txt',read.csv)

This option was used after making unsuccessful research in launching Python in R: for example, the last version of R doesn't support yet the rPython package and I had to find a quick solution that could work.

 

In order to publish the App on the server, the R script was modified to launch the Python script every time that the R Shiny session is launched by a user using:

fileData <- system("/usr/bin/python mtaRealTime.py", intern=TRUE)

The data is put in the memory of the console and then Shiny read the file every 10 seconds using the reactive command:

autoInvalidate <- reactiveTimer(10000, session)

blogpic4Pict. 4: screen of the Real Time Map App page.

 

Possible improvements:

- reduce time interval refresh to download historical data to less than 90s.
- create and amend SQL database to store historical data instead of using text file
- Separate data collected on Week-End and during Week Days.
- Study the speed between every station on lines 1 and 4.
- Real time map: add feature such as time to arrival of the next train; train standing at current station.

 

Thanks:

A great thanks to Andrew, Bryan and Jason, my teachers for their help, advise and recommendations during the project and Claire and Shu, my classmates, for their explanations about SQL database and html design advise respectively.

About Author

Leave a Comment

Avatar
銉溿儷銈炽儬 Volcom Destination Tote Black 濂虫€х敤 銉儑銈c兗銈?銉愩儍銈?闉?銉忋兂銉夈儛銉冦偘 January 23, 2016
Hurrah! After all I got a webpage from where I be able to truly obtain useful facts regarding my study and knowledge.
Avatar
google analytics December 14, 2015
First off I want to say excellent blog! I had a quick question that I'd like to ask if you do not mind. I was interested to know how you center yourself and clear your head prior to writing. I've had a difficult time clearing my mind in getting my ideas out there. I truly do take pleasure in writing however it just seems like the first 10 to 15 minutes are usually wasted simply just trying to figure out how to begin. Any recommendations or tips? Appreciate it!

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp