A Traveler’s Guide to Broadway Musicals

Posted on Aug 16, 2018

Motivation: Broadway is one of the signatures of New York City. Statistics shows that 13.8 million people attended a Broadway show during the 2017 – 2018 season ( https://www.broadwayleague.com/press/press-releases/2017-2018-broadway-end-of-season-statistics/), a number which is ~1.6 times of the NYC population. Statistics also shows ~60% of the attendance was contributed by tourists. As tourists make up a significant percentage of the Broadway audience, it would be interesting to find out what their take is on the shows.k. Are there any patterns, and can we use this information to guide the future tourists? In order to explore possible answers to these questions, I did some research on the reviews of some of the most popular Broadway musicals on Tripadvisor. Tripadvisor might not be the most comprehensive or professional website for Broadway reviews, but that is an ideal place to study the traveler's real opinion on those shows, as, local people  do not normally post their reviews there. If you are traveling to NYC and considering taking into a Broadway show, those reviews might be helpful.



(1) Methodologies:

I used the Scrapy package of Python to do the web scraping. I chose ~10 most popular musicals in Broadway and collected the reviews and some user information. For  shows with over 5,000 reviews I only grabbed half of review items. I ended up scraping ~ 20,000 review items in total. This is the starting point for all my analysis.


(2) Analysis

i. The reviewers

I firstly plotted the distributions for reviewers' review counts and the votes (indicating the reviews are helpful) they received. Since most of the reviewers are ordinary travelers, the voting is not affected by biases such as reputations. As we can see in the charts, most of the people published less than ten reviews and received very few endorsements. Only a small portion of them are active writers of reviews. I defined a metric called review quality (number of helpful votes received/number of reviews written) to roughly quantify the impact of a particular reviewer. I further divided them into three groups (low, medium and high quality) and set the means on the ratings they gave to these selected shows.

 A clear trend came to light. There may be two reasons. Probably people tend to think the reviews containing some criticisms are more trustworthy, or people who are actively comment tend to be on the picky side.




ii. Seasonal fluctuations:

Then I looked at the number of reviews vs. the month. Clear patterns can be observed in the bar graph. After the holiday season, the number of reviews drop sharply in February. If we assume the number of reviews are correlated with the attendance, it indicates that the tourist attendance at Broadway shows hit the bottom in February. It gradually picks up in the spring and finally reaches the peak in July, as NYC is a popular destination for travelers during their summer vacations.  Attendance in the second half of the year remains good with some random fluctuations .



When looking at this graph, you may wonder if there is also some similar pattern in tourists' overall experience (eg. satisfaction or not). Is there an optimal time to experience  a Broadway show? I looked at the ratings in detail and found the rating distributions do not fluctuate too much across the year. This is a good news for tourists: you can go to Broadway any time of the year and enjoy the same experience from the shows.



You may have noticed that the overall ratings are pretty high. It is indeed one problem of this data since it suffers strong "survival bias." These popular shows in Broadway are the best of the best. Competitions in Broadway stages are fierce. Only 20% of the new production each year can break even. Far fewer shows can achieve success and survive to the next season and beyond. Therefore, the most popular shows must be outstanding in many ways so that people traveling in from across the country or across the world are willing to spend their money and their vacation time here  to see them.

Another question I sought to answer was: do these reviews just contain words of praise without any insight?

Not really. I will show you some analysis on the review simply using the word cloud.

iii. Analysis on the review

We begin by running a word cloud for all the reviews.


From this word cloud we can see a lot of key words, such as “performance,”, ”song,” ”story,”, “cast,” and so on. But it is not easy to see any pattern from it. Sometimes too much information is not all that informative. So we need to take a different approach:. How about looking at individual shows?

We can do so with the musical Come from Away, a new production just landed on Broadway last March, and it has the highest rating on TripAdvisor. Based on true stories happened in a small Canadian town far away from United States in the following week after 9/11, the musical makes us believe in humanity over hate in the darkest hours.



From this word cloud, we can see that ”story” is the biggest key word in the reviews, indicating that what the audiences appreciated most was the story of the show. “Music” and “cast” received a lot of attention too. The story-telling is its best part. With no props beyond a few tables and chairs and no elaborate stage sets or costumes, a dozen people vividly conveyed a warm story.

Next (I am not following the order of the overall ratings), let’s look at  the longest running musical currently on Broadway, The Phantom of the Opera' . The word cloud shows that members of the audience are most fascinated by the music of the show. The shows’ songs drew countless people into the world of musicals, including me. Besides music, surprisingly, people mentioned ”seat” a lot of times. I think it is probably because Majestic Theater is a bigger theater. Where do you sit really matters on the experience so that people tend to keep talking about it. In contrast to Come from Away, ”story” is not among the key words anymore; “cast” received less attention too. For this show, music outshines everything else.



The next one word cloud comes from The Lion King. This is where things start to get more interesting.



Although the music of The Lion King is outstanding, and  the story is well-known to everyone, those are not what people focus on. Costumes are the most commented-on component. I think this is indeed the key of its success because the music and story are nothing new to the audiences. But the spectacular costumes give the audiences, especially the kids, (another key words in the word cloud) sitting in the theater a totally different experience from watching a movie. Besides costume, we notice that “'ticket” receives considerable attention, probably because its tickets are normally quite pricey.

Now let us move to Hamilton, the biggest Broadway hit in recent years. What do people talk most about this show? If you think people talks about music, history, story or even rap about this show most often, you are wrong. Actually “ticket” is the word which enjoys the most attentions.. In my opinion, Hamilton is truly a work of genius, but when people pay more attention on tickets than on the show itself, it is not something good. Besides tickets, of course, audiences should like the music, the story, the cast and the performance, and all of them are highlighted in the word cloud.



Now let us have look of the four word cloud graphs together.



We can easily see the four shows have different key words from the reviews. Thanks to the diversity of Broadway shows, theater-lovers can always find what they like on the stages. Diverse as they are, good music is the bottom line for good musicals. So you will find “music” is a significant element in all of them.

Furthermore, if the show is in a bigger theater, people tend to mention “seat” more often since it is a critical factor. Similarly, the more expensive the tickets are, the more often people will talk about it. We see in the case of  Come from Away,  which is playing  in a smaller theater with lower ticket prices that you do not see  “seat” and “ticket” in the word cloud. That indicates to me that for that musical, people can focus more on the show itself in contrast to the Phantom of the Opera in which “seat” is of central concern and Hamilton in which “ticket” dominates everything else.

We have covered some overall patterns of the reviews. How about criticisms? We do see some low ratings in the previous bar graphs. I looked into the negative reviews (have 1 and 2 in ratings) of some shows. Here is one example:

The key words in low rating reviews of Hamilton.


We can see besides the complaints on the ticket price that “understudy”  was mentioned quite a few times. I do not think it is necessarily because the understudy did a bad job. It is natural that people got upset when they did not see their favorite actors/actresses showing up on stage. But when you add in the very high price they pay as a factor on their feelings, the disappointment grows to the point of exaggeration. I checked out some bad reviews and found  people often emphasize that they paid a fortune for the show but ended up watching understudies. So if you really care, do some homework one the cast schedules.

In summary, what is the take away from this little study?

  1. Broadway shows are highly diverse in topics and features. You might not like everyone of them, but there must be something for your particular taste. Do some homework before purchasing the tickets. Also if you want to see some actor/actress in particular, check his/her schedule.
  2. If you come from away to NYC and have already spent a fortune on the flight and other stuff, my suggestion is not try to save too much on the tickets. For a lot of theaters and shows, different seats will bring you totally different experience. You don’t want to be one of the people who end up posting "I should have bought a better seat" on TripAdvisor.

About Author

Zhenggang Xu

Zhenggang is currently a data science fellow in NYC data science academy. He received his education in computational chemistry and worked in deep water exploration for a few years. He believes in numbers since computations have helped him...
View all posts by Zhenggang Xu >

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI