NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > R > Transportunities

Transportunities

Jake Lehrhoff
Posted on Nov 17, 2015

Contributed by Jake Lehrhoff. Jake took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his second class project (due the 4th week of the program).

The Situation

If you ask a room of New Yorkers to raise their hand if they experienced a transportation woe in the last week, you can expect just about every hand to go up, angrily or wearily. And anyone who doesnโ€™t raise their hand is likely to have a quiet moment of joy, recognizing the miracle of seven straight days of unperturbed subway rides and easily accessible taxis.

NEW YORK, NY - MARCH 10: Commuters wait to board a New York City subway car at Grand Central Terminal during evening rush hour on March 10, 2014 in New York City. New statistics revealed that public transit ridership is at its highest since 1956. (Photo by Andrew Burton/Getty Images)

(Photo by Andrew Burton/Getty Images)

But of course New Yorkers have transportation woes. In just 305 square miles there are 8.5-million of us, andโ€”from someone who commutes to midtownโ€”about as many tourists, give or take. The enormity of NYCโ€™s transportation needs makes it ripe for data scienceโ€™s helping hand.

Solutions Make for New Problems

The last decade has seen an explosion of transportation opportunities beyond the traditional options of subway, bus, taxi, or our own two feet, most notably, Uber and Citibike. And the success of these newcomers is not just due to their unique offerings or that the transportation space was ripe for disruption: these data-conscious companies look to improve the transportation experience by understanding how we, the customers, engage with their product.

As businesses that provide the same fundamental service, Uber and Citibike face many of the same problems. Bikes have to be placed where riders want to start their trips, just as Uber drivers have to be in the vicinity of their fares. Uber and Citibike only exist as solutions to our transportation problems if they are convenient, efficient, and cost effective. If there are no available bikes at the nearest bank or if there is a 15-minute wait for an Uber, customers may feel better served by more traditional options.

Uber has even greater problems, as drivers aren't employees but "contractors." Uber drivers have the latitude to work when and where they like, completely at their own discretion. So what guarantees that "contractors" decide to work where the demand is?

An Investigation

All it takes to avoid these problems is data and wherewithal. With six months of pickups from April to September, 2014, we can paint a picture of customer engagement and propose best practices for maximizing effectiveness and efficiency.

Changing ridership is a function of more than just time and location. By adding weather data from the National Oceanic and Atmospheric Association, we can answer all sorts of fascinating questions. Do Union Square residents still ride Citibikes when it drizzles? Are fewer Ubers hailed around midtown on warm days? Are late morning commuters more likely to take an Uber if it's pouring? Suddenly, itโ€™s not simply a question of when and where to we ride, but why.

The App

For a 360ยบ view of the data, Iโ€™ve developed a Shiny app that offers both basic exploratory data analysis and a tool to visualize Uber and Citibike pickups with high specificity. Open the app and test out the scenarios below.

Screen Shot 2015-11-17 at 4.06.55 PM

The "Plots" tab

 

Unknown-6

A look at monthly ridership shows a steady upward trend among Uber customers and a steep increase in the spring before a level summer for Citibike. More data is necessary to confidently predict what these trends mean, whether Uberโ€™s upward trajectory depicts continual market gain and Citibikeโ€™s data shows a seasonality of use or not.

The fourth tab, "Heat Map," offers greater control to your investigation of Uber and Citibike ridership.

Consider this scenario: You're an Uber driver. It's 6am on a weekday. You're awake because commuters are awake. Also, you left the bedroom window open to enjoy the fresh spring air, but now a tree-full of birds are celebrating life like a symphony of car alarms. But more importantly, commuters. Off to work. Where will you find the most pickups? What part of town do you want to make your way back to after each drop off? What parts are Uber deserts?

Let's take a look. Set the app to "Uber," "May," "6-7am," "Weekdays," "No Rain," and "Any."

Screen Shot 2015-11-17 at 4.08.03 PM

Wow! The Upper East Side is busy!

Now let's say you let yourself sleep in (you deserve it). By 9am, where has the densest ridership moved? What happens as you keep driving into the afternoon?

We can answer similar questions with the Citibike data. It's still May, an exciting moth for Citibikers because the weather improves with each passing day. Does Citibike need to be more careful with its redistribution effort on warm afternoons? Are different areas of town affected?

Untitled-1

On the left, any weather, and on the right, hot days

Overall, Union Square is a busy area for spring Citibike pickups, but warm days see more riders around Fulton Street and the West Village.

Play around with all of the options to discover what combination of time and weather paints the most unique customer engagement.

Future Directions

Really, this is the tip of the iceberg. Both of these companies face the fundamental problem of migration. If an Uber driver brings a commuter from the Upper East Side to downtown, that is one fewer driver in the UES to pick up the next downtown commuter. Eventually, the system will be overwhelmed and an area may become an Uber "desert." The same is true in a much more startling fashion for Citibike. Late commuters may have to hunt for a bike, or else, rely on alternative transportation. Equally frustrating, what happens when there's nowhere to park your bike? Visualizing pickups alone can't solve these problems.

Additionally, this investigation did not take into account Uber wait time. Identifying times and locations where Uber riders have to wait an unreasonable amount of time for a ride (and thus might choose to grab a cab or hop on the subway) would offer significant value to Uber as a company.

Conclusions

Uber and Citibike have thrived since their launches because they make our lives easier. But there is still room for improvement. Uber's main hiring draw--that contractors can work when and where they like--is admirable, but that doesn't mean that it necessarily serves the customers. The simple action of pinging a driver before a rainstorm, urging them out the door to make some quick cash, may help flood the streets with cars when eager riders are huddled under awnings, hoping against logic and experience that a car will miraculously accept their request.

All transportation options have their downsides. Walking takes too long. Cabs are expensive. The subway is crowded and smelly and breaks down and is filled with angry people. Uber and Citibike aren't perfect either, but they can use data science to improve their product tremendously.

The Code

All of the data and code can be found in my public github repository. Enjoy!

Gathering Data

There are three sources of data: Uber, Citibike, and weather. The Uber data comes courtesy of fivethirtyeight.com's github repository, as they had previously requested proprietary Uber data and was granted six months of pickups. The Citibike data is openly available through their website--they are generously open-source. Weather data was granted by the National Oceanic and Atmospheric Association, amazingly, within 5 hours of the request.

Munging Data

"Clean" data does not mean that it is ready for analysis. R doesn't automatically know that something that looks like a date is a date . Similarly, vast datasets have to be pared down to the relevant data. timeDate and dplyr were invaluable tools in these tasks.

Weather

First, the weather data had to be simplified, as it contained readings from all the weather stations in New York City. For the purpose of these analyses, one would suffice. I chose the Central Park observatory and applied the following filter function:

library(dplyr)
library(timeDate)

weather <- read.csv('data/weather.csv')
weather <- filter(weather, STATION_NAME == "NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US")

 

Now that my thousands of rows were down to a couple hundred, the columns had to be selected. I chose to limit the scope to precipitation, minimum temperature, and maximum temperature. Finally, the date variable can be formatted as actual dates.

weather <- select(weather, DATE, PRCP, TMAX, TMIN)
weather$DATE <- as.POSIXct(as.character(weather$DATE), format = "%Y%m%d")

Uber and Citibike

The Uber and Citibike data came in large files, one per month, all munging was done individually to avoid corrupting a large file and losing large swaths of time and effort. Thankfully, the single most important variable--the location of the pickup--required no attention beyond normalizing the column names between data sets. The real munging concerned creating the variables that I was interested in visualizing.

The below function turns the date column into a time object, creates a column specifying if a given pickup occurred on the weekend, creates a time column parsed from the date, and joins the weather table. For flexibility, the function takes time format as an argument so it can be applied to both datasets. Two examples of this function applied are included below.

timefunc <- function(df, column, format){
  df$date <- as.POSIXct(column, format = format)
  df$weekend <- sapply(df$date, isWeekend)
  df$time <- format(df$date, format="%H:%M")
  df$DATE <- as.POSIXct(format(df$date, format="%Y-%m-%d"))
  df <- inner_join(df, weather, by="DATE")
  return(df)
}

uber4 <- timefunc(uber4, uber4$Date.Time, "%m/%d/%Y %H:%M:%S")
citi4 <- timefunc(citi4, citi4$starttime, "%Y-%m-%d %H:%M:%S")

Global Summaries

For the plots on the third tab of the app, I needed a way to look at global ridership. For this, I created a function called "makeridership." It summarizes monthly data by precipitation and temperature data before being combined with other ridership data. Then the important columns are added: weekday vs. weekend, drizzly days, stormy days, hot days, and month.

makeridership <- function(dataset, ridership){
  summarise(group_by(dataset, DATE),
            count=n(),
            PRCP=mean(PRCP),
            TMAX=mean(TMAX),
            TMIN=mean(TMIN)) %>%
    mutate(., weekend = sapply(.$DATE, isWeekend)) %>%
    rbind(ridership, .)
}

uberridership <- makeridership(uber5, uberridership)
# ...
citiridership <- makeridership(citi5, citiridership)
# ...

citiridership <- mutate(citiridership, drizzle = PRCP > 0 & PRCP <100,
                        downpour = PRCP >=100, hot = TMAX>267, ride="Citi")
uberridership <- mutate(uberridership, drizzle = PRCP > 0 & PRCP <100,
                        downpour = PRCP >=100, hot = TMAX>267, ride="Uber")
ridership1 <- rbind(uberridership, citiridership)
ridership1$DATElt <- as.POSIXlt(ridership1$DATE)
ridership1$mon <- ridership1$DATElt$mon+1

Finally, with cleaned data, we can summarize the number of rides taken on each type of day. We are ready to graph global data.

ridership1 <- select(ridership1, count, ride, mon)
ridership1 <- group_by(ridership1, ride, mon) %>% summarise(., sum(count))

Map Object

The only other work before we move on to the Shiny app is to create the map object that we will lay the heat map over. ggmap's "get_map" function allows specification of any location in the world. For a central view of Manhattan, I've chosen Columbus Circle.

map <- get_map(location = "Columbus Circle", zoom = 12, 
               source = "stamen", scale = c(1200,1200), maptype = "toner") 
map1 <- ggmap(map, extent="normal", fullpage = TRUE)

Heat Map Function

The heatmap is created by calling "mapfunc" and layering it on the google map. That function lives in another R file called "helpers." It is a stat_density2d plot, using the longitude and latitude from the monthly datasets.

mapfunc <- function(df){ 
   m <- map1 + geom_density2d(data=df, aes(x=Lon, y=Lat), 
                              color = "grey40") + 
        stat_density2d(data=df, aes(x=Lon, y=Lat, fill=..level.., 
                                    alpha=..level..), 
                       size = 1, bins = 16, geom='polygon') + 
        scale_fill_gradient(low = "green", high = "red") + 
        scale_alpha(range = c(0.00, 0.25), guide = FALSE) + 
        theme(legend.position = "none", axis.title = element_blank(), 
              text = element_text(size = 12)) print(m) 
}

Shiny

Shiny apps have two main components, the ui or "user interface" and the server. Think of these pieces as the pretty things you tap on your phone and whatever on earth is happening inside there that makes those pretty things work. Below is the code for a few of the selectors on the "Heat Map" page. These selectors don't actually filter anything. They are merely buttons, like the keys on your keyboard. However, when activated, they talk to other code within the system which executes a command of sorts. For example, in the ui there are "radioButtons" to select Uber or Citibike, a "selectInput" for the month, and a "sliderInput" for the time of day.

UI

radioButtons("transportation", label = h3("Pick your ride"), 
             choices = list("Uber", "Citibike"), 
             selected = "Uber"), 

selectInput("month", label = h3("Pick a month"), 
            choices = list("April","May","June","July","August","September"), 
                           selected = "April"), 

sliderInput("time", label = h3("Time of Day"), min = 0, max = 24, 
            value = c(0,24)), 

actionButton("goButton","Map me!")

The Server

Each of those selectors function as filters. First, the transportation button selects the dataset, "ubersample" or "citisample."

ride <- reactive({ switch(input$transportation,
 "Uber" = ubersample,
 "Citibike" = citisample) })

 

Then, our dataset--now called ride()--is pushed through a series of filters. The month filter searches the column in our dataset named "mon" and only selects rows where the value is 4. The time filter has more moving parts, but functions similarly: the "time" column of our dataset is searched and rows are selected where time is greater than or equal to the minimum input time and less than or equal to the maximum input time. Finally, when you click the "goButton," our filtered data is pushed through the plotting function, and the app is populated with your specific map.

ridefilter2 <- reactive({
 # Month filter
 if (input$month == "April") {
 ridefilter <- ride() %>% filter(., mon == "4")
 }
 # ...
 
 # Time filter
 mintime <- input$time[1]
 maxtime <- input$time[2]
 ridefilter <- ridefilter %>% filter(., time >= mintime, time <= maxtime)
#  ...

 heatmap <- eventReactive(input$goButton, {
 mapfunc(ridefilter2())
 })

So, it's not a single function that puts the heat map on the page, but the interaction between the UI and the server--selecting data sets, filtering rows, and compiling the remaining data into a beautiful image.


Unknown

 

About Author

Jake Lehrhoff

Jake Lehrhoff is a man of many hats. Currently, he is an analyst at Spotify, helping to improve an already amazing product. Previously, he spent six years teaching middle school English and chairing the department at a school...
View all posts by Jake Lehrhoff >

Related Articles

Machine Learning
Ames House Prices Predictions
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses
Data Visualization
Beyond the Podium: A Global Journey Through Formula 1 History
Meetup
Building a Safer Future
Meetup
New York Restaurants: Inspection Data Analysis, Statistics and More - R Programming Language

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application