Tokyo Scrappy Venues: Tokyo Gig Guide Web Scraping Project
Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Introduction and Motivations
For decades, the Tokyo area has been renowned for its unique, heterogeneous, and dynamic musical landscape. Such a reputation was one of the factors that attracted me to move to Japan, where I lived and worked for 3.5 years.
That said, the region is vast...and the public tends to be fragmented--i.e., loyal to particular venues, artists, and/or micro-genres. As a result, there are few instances of bridges amongst local musical communities, one loses a sense of the broader picture of the tendencies and developments comprising the Tokyo scene.
In an effort to gain insights into these tendencies and developments, I decided to scrape the Tokyo Gig Guide, an event listing to which artists, curators, festival producers, etc., can contribute information regarding upcoming music events in the Tokyo area. The above link leads to an archive of ca. 21,000 events occurring between 2008-2019. The listing includes data from ca. 700 "live houses"--ranging from major multi-day festival sites to tiny out-of-the-way bars--distributed over ca. 200 neighborhoods.
As is depicted in the figures below, the main pages convey event title, venue, and genre categories. Clicking on each event hyperlink leads to a page specifying date (in standardized year-month-day format), start-time, venue address and area, closest train station, and (in some cases) advance and/or door ticket price, as well as venue and access map URLs.
Given the quantity of pages to scrape (419 main pages, each displaying 50 events--ca. 21,370 pages in total), Scrapy was employed to extract site data, which was exported to three CSV files. One contained information for recent events (occurring between 2017-2019), another was reserved for historical data (ca. 2008-2010), and a third for everything in between.
Primary Research Questions
Once the data was extracted and compiled, the following questions guided the subsequent analysis phase of the project:
- What have been the most popular tags (genre categories) in recent years, compared to a decade ago?
- Which venues and neighborhoods of the Tokyo metro area have been the most active recently and historically?
- Which months are typically the densest with respect to number of events? In which year(s) have the greatest number of festivals been presented?
- Which neighborhoods, venues, and genres have featured the most expensive events? What have been typical price ranges for tickets?
Data Subsets and Tables
As was mentioned above, the data was segmented into three main tables, corresponding to first 100, last 100, and middle 219 event pages, respectively. These 3 tables were concatenated, for the purposes of time series and ticket price analysis. For the purposes of comparisons between recent and historical neighborhood, venue, and genre values, it was most practical to retain separate data frames for 2017-2019 and 2008-2010 observations.
Due to the fact that that longitude and latitude coordinates of neighborhoods and venues were not included in the Tokyo Gig Guide dataset, the table containing all events was merged with another containing geographic and postal data for Japan (downloaded from this site). This was by no means the optimal solution, as the geographic/postal dataset was not comprehensive with respect to Tokyo area neighborhoods. However, in the interest of visualizing the relative concentrations and spread of venues and events across the region, the combined table proved to be satisfactory (see interactive map links above and in EDA Part 2 section below).
In addition, a data frame consisting exclusively of festivals was extracted from the Tokyo Gig Guide data and written to a CSV for the purposes of time series and ticket price analysis, as well as to serve as a general reference beyond the immediate context of this project.
EDA Part 1: Geographic Area and Venue Rankings/Relationships for Recent vs. Historic Events
Here are the ten neighborhoods offering the greatest number of events from 2017-2019 and from 2008-2010, respectively:
For both periods, Shibuya has by far the most active...but for the historical data, it is more heavily weighted than other areas.
Similar comparison, but for venues:
In both graphs, it is evident that a) there is one venue that dominates all others in terms of event quantity (U-hA in Koenji in the recent data, O-Nest in Shibuya in the 2008-2010 data). However, with the exceptions of Super Deluxe and Fever, there are no shared venues between the 2008-2010 and 2017-2019 plots.
Is it the case that, for active areas such as Shibuya and Koenji, there are significant concentrations of venues, or do a few venues in these neighborhoods frequently post to the Tokyo Gig Guide?
The above bar plots make clear that areas such as Shibuya, Shimokitazawa, and Koenji have been host to a significant number of live houses, of which a few have posted frequently to the Tokyo Gig Guide.
EDA Part 2: Genre Category
In many instances, multiple genre categories have been assigned to a particular event. Therefore, it was necessary to reshape the tables in question, such that only one genre would be represented in a given row.
2017-2019 vs. 2008-2010 Comparisons
As is illustrated below, the distribution of frequently occurring genres for 2017-2019 vs. 2008-2010 differ significantly. (This is likely due in part to live house lifespans: smaller venues in particular may have closed or opened within the period encapsulated by the archive dataset.)
Whereas "Improvised" leads for 2017-2019, "Indie" is the most popular tag for the historical data. For the former, "Indie" falls at place number 10. Similarly, "Improvised" is ranked number 9 for the 2008-2010 period.
The "Indie-Improvised Cross-Fade"
When did the ranking for the "Indie" tag begin to decline and the "Improvised" tag increase? According to the line graph below, there was a "cross-fade" in 2012-2013, resulting in a permanent shift (thus far):
Venue Event vs. Genre Distributions
As is indicated in the scatterplot below, the number of events per venue has a strong positive correlation with the number of genres represented by each venue for both the 2008-2010 (r ≈ 0.912) and 2017-2019 (r ≈ 0.814) data (blue dots and red dots, respectively). (That said, the correlation is not as strong as I had initially assumed.)
Aside: Genre Word Clouds
To further illustrate the differences, word clouds were generated for recent and historical data subsets (top and bottom, respectively):
EDA Part 3: Time Series
The following graph indicates total numbers of events per month (tallied over the span of the entire dataset):
Clearly, October and November are top-ranking in this regard, while January and August are lowest-ranking. As major holidays in Japan during which people tend to return to their hometowns occur in January and August (New Year's and Obon, respectively), it is to be expected that these months have been the least active.
But how have these monthly activity levels varied from one year to the next? As is conveyed on the box plot below, the IQR's (inter-quartile ranges) for all months expect May and August are relatively wide. Therefore, net monthly event total is not a reliable metric in isolation.
The figures below depict monthly event distributions by area and genre (respectively) for the 2017-2019 period.
It appears that Shibuya has been most active in September, and the "Improvised" genre most prominent in October.
What about the number of festivals per year listed in the Tokyo Gig Guide?
According to the above histogram, there was a peak in 2012, as opposed to an equal distribution across all documented years.
EDA Part 4: Ticket Price Analyses
Unlike the area and genre values, there was no standard format for ticket price data. A given entry might contain a number exclusively (e.g., 5000), possibly with an intervening comma (e.g., 11,000). But in most cases, addition text was included (e.g., "3000 yen + 1 free drink" or "donation plus ¥500"). Free events were usually denoted as "free", or some variant thereof (e.g, "Free!"). Furthermore, price was not a required field, resulting in numerous missing values.
As such, it was necessary to extract numerical values from the texts, to replace "free" indications with "0", and to decide how to deal with NaN values (they were ultimately replaced with the column mean).
In addition, ticket price was represented by two variables: advance price and door price. Due to the fact that, in many cases, only door price was indicated (or vice versa), for each event, the maximum value of these variables was utilized as the reference price when performing calculations and analyses.
(N.b.: 100 JPY ≈ 1 USD.)
At first blush, it appears that events in Kudanshita, Ebisu, and Shibuya assigned "Rock," "Pop," "Indie," "Festival," "Mixed Genre", and "Electronic" genre categories have had substantially higher ticket prices than for those in other areas and genres. However, it is worth investigating the data further to gain an understanding as to what is actually influencing these results:
There are six outlier shows with ticket prices in the range of 34,000-100,000 JPY (ca. 340-1000 USD). These include performances by celebrity acts (e.g., Paul McCartney and Björk). Two further listings are in the range of 2200-2500 JPY. All other events fall under ¥20,000. When the outliers (with prices greater than ¥20,000) are filtered out, the box plots for ticket price range by neighborhood and genre are as follows:
"Festival" has by far the widest IQR, but is not strictly speaking a unified musical genre. The median prices by area and genre do not vary significantly.
The mean ticket price for all events (including outliers) is ca. 3,211 JPY. The median is ca. 2, 990 JPY.
From the above analyses, it is evident that the rankings of genre tags have shifted over the past decade. The most significant shift occurred in 2012-2013, at which time the genre "Indie" declined in appearances, and "Improvised" increased. By contrast, there have been few changes with respect to most active areas (neighborhoods) in Tokyo...but the respective distributions of events per area and the most active venues in those areas are significantly different.
There is a strong positive correlation between number of events and number of genres represented by a given venue.
October and November have been the most active months. However, based upon each month's IQR with respect to number of events, there are sizable differences from one year to the next. IQR's for May and August are the narrowest.
In 2012, the peak number of festivals was recorded.
The most expensive posted event was a Paul McCartney concert in Kudanshita (¥100,000 = ca. 1000 USD). The average ticket price for the entire dataset was ¥3,211 (ca. 32 USD). Ticket prices for most events fell under ¥20,000, and there seemed to be no strong correlation between area or genre and price once outliers were filtered out. Caveat: given the quantity of missing ticket price data, it is challenging to draw solid conclusions in this regard.
Salient future objectives include:
- integrating Tokyo Gig Guide dataset with other related event listing data;
- performing rigorous natural language processing analysis of event titles;
- extending the date range to a ca. 50 year period;
- investigating machine learning applications, such as ticket price prediction or genre classification models.
For code, visualizations, and other supporting material, please visit the project Github repository.
Author's LinkedIn profile: https://www.linkedin.com/in/alexander-sigman-6718b414/