Tokyo Scrappy Venues: Tokyo Gig Guide Web Scraping Project

Posted on Apr 28, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction and Motivations

For decades, the Tokyo area has been renowned for its unique, heterogeneous, and dynamic musical landscape. Such a reputation was one of the factors that attracted me to move to Japan, where I lived and worked for 3.5 years. 

That said, the region is vast...and the public tends to be fragmented--i.e., loyal to particular venues, artists, and/or micro-genres. As a result, there are few instances of bridges amongst local musical communities, one loses a sense of the broader picture of the tendencies and developments comprising the Tokyo scene. 

Snapshot of map indicating concentrations of venues per neighborhood in central Tokyo. Interactive map here.

The Tokyo Gig Guide Archive

In an effort to gain insights into these tendencies and developments, I decided to scrape the Tokyo Gig Guide, an event listing to which artists, curators, festival producers, etc., can contribute information regarding upcoming music events in the Tokyo area. The above link leads to an archive of ca. 21,000 events occurring between 2008-2019. The listing includes data from ca. 700 "live houses"--ranging from major multi-day festival sites to tiny out-of-the-way bars--distributed over ca. 200 neighborhoods.

As is depicted in the figures below, the main pages convey event title, venue, and genre categories. Clicking on each event hyperlink leads to a page specifying date (in standardized year-month-day format), start-time, venue address and area, closest train station, and (in some cases) advance and/or door ticket price, as well as venue and access map URLs. 

     Tokyo Gig Guide archive main page.


Event date, time, and door price information.


As above, but with advance ticket price indicated.


Venue website, address, neighborhood, access map URL, and closest train station information.

Given the quantity of pages to scrape (419 main pages, each displaying 50 events--ca. 21,370 pages in total), Scrapy was employed to extract site data, which was exported to three CSV files. One contained information for recent events (occurring between 2017-2019), another was reserved for historical data (ca. 2008-2010), and a third for everything in between. 

Primary Research Questions

Once the data was extracted and compiled, the following questions guided the subsequent analysis phase of the project: 

  1. What have been the most popular tags (genre categories) in recent years, compared to a decade ago?
  2. Which venues and neighborhoods of the Tokyo metro area have been the most active recently and historically?
  3. Which months are typically the densest with respect to number of events? In which year(s) have the greatest number of festivals been presented?
  4. Which neighborhoods, venues, and genres have featured the most expensive events? What have been typical price ranges for tickets?

Data Subsets and Tables 

As was mentioned above, the data was segmented into three main tables, corresponding to first 100, last 100, and middle 219 event pages, respectively. These 3 tables were concatenated, for the purposes of time series and ticket price analysis. For the purposes of comparisons between recent and historical neighborhood, venue, and genre values, it was most practical to retain separate data frames for 2017-2019 and 2008-2010 observations. 

Due to the fact that that longitude and latitude coordinates of neighborhoods and venues were not included in the Tokyo Gig Guide dataset, the table containing all events was merged with another containing geographic and postal data for Japan (downloaded from this site).  This was by no means the optimal solution, as the geographic/postal dataset was not comprehensive with respect to Tokyo area neighborhoods. However, in the interest of visualizing the relative concentrations and spread of venues and events across the region, the combined table proved to be satisfactory (see interactive map links above and in EDA Part 2 section below). 

In addition, a data frame consisting exclusively of festivals was extracted from the Tokyo Gig Guide data and written to a CSV for the purposes of time series and ticket price analysis, as well as to serve as a general reference beyond the immediate context of this project. 

EDA Part 1: Geographic Area and Venue Rankings/Relationships for Recent vs. Historic Events 

Here are the ten neighborhoods offering the greatest number of events from 2017-2019 and from 2008-2010, respectively: 

For both periods, Shibuya has by far the most active...but for the historical data,  it is more heavily weighted than other areas. 

The notorious "Shibuya Scramble" intersection in central Tokyo.

Similar comparison, but for venues: 

In both graphs, it is evident that a) there is one venue that dominates all others in terms of event quantity (U-hA in Koenji in the recent data, O-Nest in Shibuya in the 2008-2010 data). However, with the exceptions of Super Deluxe and Fever,  there are no shared venues between the 2008-2010 and 2017-2019 plots. 

Is it the case that, for active areas such as Shibuya and Koenji, there are significant concentrations of venues, or do a few venues in these neighborhoods frequently post to the Tokyo Gig Guide? 

The above bar plots make clear that areas such as Shibuya, Shimokitazawa, and Koenji have been host to a significant number of live houses, of which a few have posted frequently to the Tokyo Gig Guide. 

EDA Part 2: Genre Category 

Event distribution per venue in central Tokyo. Interactive map here.

Data Preprocessing 

In many instances, multiple genre categories have been assigned to a particular event. Therefore, it was necessary to reshape the tables in question, such that only one genre would be represented in a given row. 

2017-2019 vs. 2008-2010 Comparisons

As is illustrated below, the distribution of frequently occurring genres for  2017-2019 vs. 2008-2010 differ significantly. (This is likely due in part to live house lifespans: smaller venues in particular may have closed or opened within the period encapsulated by the archive dataset.) 

Whereas "Improvised" leads for 2017-2019, "Indie" is the most popular tag for the historical data. For the former, "Indie" falls at place number 10. Similarly, "Improvised" is ranked number 9 for the 2008-2010 period.  

The "Indie-Improvised Cross-Fade" 

When did the ranking for the "Indie" tag begin to decline and the "Improvised" tag increase? According to the line graph below, there was a "cross-fade" in 2012-2013, resulting in a permanent shift (thus far): 

Venue Event vs. Genre Distributions

As is indicated in the scatterplot below, the number of events per venue has a strong positive correlation with the number of genres represented by each venue for both the 2008-2010 (r ≈ 0.912) and 2017-2019 (r ≈ 0.814) data (blue dots and red dots, respectively).  (That said, the correlation is not as strong as I had initially assumed.)  

Aside: Genre Word Clouds 

To further illustrate the differences, word clouds were generated for recent and historical data subsets (top and bottom, respectively):  

EDA Part 3: Time Series 

The following graph indicates total numbers of events per month (tallied over the span of the entire dataset): 

Clearly, October and November are top-ranking in this regard, while January and August are lowest-ranking. As major holidays in Japan during which people tend to return to their hometowns occur in January and August (New Year's and Obon, respectively), it is to be expected that these months have been the least active. 

But how have these monthly activity levels varied from one year to the next? As is conveyed on the box plot below, the IQR's (inter-quartile ranges) for all months expect May and August are relatively wide. Therefore, net monthly event total is not a reliable metric in isolation. 

The figures below depict monthly event distributions by area and genre (respectively) for the 2017-2019 period. 

It appears that Shibuya has been most active in September, and the "Improvised" genre most prominent in October. 

What about the number of festivals per year listed in the Tokyo Gig Guide?

According to the above histogram, there was a peak in 2012, as opposed to an equal distribution across all documented years. 

EDA Part 4: Ticket Price Analyses 

Data Cleaning/Preprocessing

Unlike the area and genre values, there was no standard format for ticket price data. A given entry might contain a number exclusively (e.g., 5000), possibly with an intervening comma (e.g., 11,000). But in most cases, addition text was included (e.g., "3000 yen + 1 free drink" or "donation plus ¥500"). Free events were usually denoted as "free", or some variant thereof (e.g, "Free!").  Furthermore, price was not a required field, resulting in numerous missing values. 

As such, it was necessary to extract numerical values from the texts, to replace "free" indications with "0", and to decide how to deal with NaN values (they were ultimately replaced with the column mean).

In addition, ticket price was represented by two variables: advance price and door price.  Due to the fact that, in many cases, only door price was indicated (or vice versa),  for each event, the maximum value of these variables was utilized as the reference price when performing calculations and analyses. 

(N.b.: 100 JPY ≈ 1 USD.) 

At first blush, it appears that events in Kudanshita, Ebisu, and Shibuya assigned  "Rock," "Pop," "Indie," "Festival," "Mixed Genre", and "Electronic" genre categories have had substantially higher ticket prices than for those in other areas and genres. However, it is worth investigating the data further to gain an understanding as to what is actually influencing these results: 

There are six outlier shows with ticket prices in the range of 34,000-100,000 JPY (ca. 340-1000 USD). These include performances by celebrity acts (e.g., Paul McCartney and Björk). Two further listings are in the range of 2200-2500 JPY. All other events fall under ¥20,000. When the outliers (with prices greater than ¥20,000) are filtered out, the box plots for ticket price range by neighborhood and genre are as follows: 

"Festival" has by far the widest IQR, but is not strictly speaking a unified musical genre. The median prices by area and genre do not vary significantly. 

The mean ticket price for all events (including outliers) is ca. 3,211 JPY. The median is ca. 2, 990 JPY. 


From the above analyses, it is evident that the rankings of genre tags have shifted over the past decade. The most significant shift occurred in 2012-2013, at which time the genre "Indie" declined in appearances, and "Improvised" increased. By contrast, there have been few changes with respect to most active areas (neighborhoods) in Tokyo...but the respective distributions of events per area and the most active venues in those areas are significantly different.

There is a strong positive correlation between number of events and number of genres represented by a given venue.

October and November have been the most active months. However, based upon each month's IQR with respect to number of events, there are sizable differences from one year to the next. IQR's for May and August are the narrowest.

In 2012, the peak number of festivals was recorded.

The most expensive posted event was a Paul McCartney concert in Kudanshita (¥100,000 = ca. 1000 USD). The average ticket price for the entire dataset was ¥3,211 (ca. 32 USD). Ticket prices for most events fell under ¥20,000, and there seemed to be no strong correlation between area or genre and price once outliers were filtered out. Caveat: given the quantity of missing ticket price data, it is challenging to draw solid conclusions in this regard.

Future Work 

Salient future objectives include: 

  1. integrating Tokyo Gig Guide dataset with other related event listing data; 
  2. performing rigorous natural language processing analysis of event titles; 
  3. extending the date range to a ca. 50 year period; 
  4. investigating machine learning applications, such as ticket price prediction or genre classification models. 

External Links 

For code, visualizations, and other supporting material, please visit the project Github repository

Author's LinkedIn profile:






About Author

Alexander Sigman

WIth a unique background in music composition + technology, cognitive science, and data science and extensive experience in machine learning R & D and software engineering, Alex Sigman has a passion for adding value to data, gaining actionable...
View all posts by Alexander Sigman >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI