Predicting Meetup Members

Posted on Jul 11, 2016
Contributed by Taraqur Rahman. He was in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between April 11th to July 1st, 2016. This post is based on his final class project - Capstone(due on the 12th week of the program).

Meetup is a social networking site that brings together people with similar interests. Individuals, organizations, and companies can create groups where people can join and participate in the group activities.

 

Using Meetup’s APIs I was able to get group requests for about 2200 meetups per state. From prior experience, I learned to use a small dataset first to create the skeleton code. Once I can confirm the code works, I can use a bigger dataset. The dataset I worked with contains five cities: New York, Los Angeles, Chicago, Houston, and Phoenix. These cities, according to top5ofanything.com, are considered to be the top 5 largest cities in 2010.

 

The question I had in mind was if I were to create a meetup, how could I maximize my number of members. The reasoning behind this is that if I would to start a meetup, I would like to know what features would give me the greatest number of members.

Raw_Members

This histogram shows that most common number of members is between 0 and 500 members.

 

I am starting off with a graph that shows the number of members for all five states (above). This shows that almost all meetups have less than 5000 members. There are still a few meetups though that has more than that. The highest number of members per city is shown below.

 

Top Meetup for Each City

Max_Table

This table shows the top 5 meetups in each city.

 

I filtered few of the outliers (members > 3000) and plotted a boxplot to get a better sense in how the members were distributed throughout the cities (below). New York seems to have the highest median members (about 575 members). This makes sense since NY does have the highest population.

 

BoxPlot_State

The next graph shows the number of members per category. The category that gets the most members is writing and then tech. There are eight categories that are not available. Since there are only eight missing values, I decided to apply the Simple Random Imputation, where I randomly select a category and input it for the missing value.

 

Members by Category

This shows the number of members based on the category. The most popular category is writing, then tech. The least popular is arts-culture and book-clubs.

 

Next I looked at the ratings. I compared the ratings and members to see if there is any relationship using a joint plot (below). There seems to be a few 0 ratings based on the graph. Looking more into this, it seemed that it is 0 because people did not rate it yet, not because people rated it 0. Therefor I needed to do something about it. Leaving it 0 will skew the results since it is not truly a rating.

 

Raw_jointplot

This shows barely any correlation between the number of members of each meetup and the rating of each meetup.

 

I graphed the categories vs. ratings. This was interesting because the average of each category seems to be very similar to each other. This way I can use a Mean Value Imputation, where I input the average values other than the missing ones, and input them.

Ratings by Category

The average ratings for each category are basically the same.

 

Now I believe I am ready to implement models to see if I can predict how many members a group will make based on the features. Right away I can say that I do not have a lot of data. The dataset I requested had only 22 variables. So I decided to perform feature engineering to get more variables.

 

The first variable I added was the age of the meetup. My thoughts were that the older the meetup, the more time it had to accumulate members. The raw dataset did not have the age. It had the date created in epoch time, (seconds after the beginning of UTC). I converted that into the regular time format and subtracted 2015 from the year to get the age. I also added a variable that gives the number of words of the description. The reason I did this is to see if the amount of words matter. If it does then we can make a detailed description when creating a meetup. If not then we can write two or three sentences that get to the point. Another variable I added was the number of topics listed. The topics are basically the subcategories of the meetup. For instance if the category of a meetup is film, the topics would be acting, screenwriting, film industry, etc. I thought the more the topics, the more people the meetup will reach to.

 

I also dummified few columns as well. There was a column where it displayed the type of joining. There were three types of join modes. The three types were open, closed, and approval. Open means that anybody can join the meetup. Closed means no one else can join it. Based on skimming through the data, if the joining is closed, it means that the meetup is no longer active. Approval means that you would have to ask for an invite from the founder. This will definitely affect the total number of members. Another variable I dummified was the visibility. Visibility is basically who can see the meetups, members or anybody. If people cannot see meetups then most likely there would be fewer members, in my opinion.

 

After creating these extra variables I ran a random forest model on it. Not to my surprise, I received a very low accuracy rate about 30%. Based on these results I figured out the importance of each feature.

Feature Importance

This bar graph shows what features (in percentage) are most important in determining the number of members.

 

Feats

These are the top 13 features that influenced my model. The to big ones are latitude and longitude.

This is not surprising to me because I believe I did not create/use enough data. I only used 5 cities, which come out to be about 8800 rows and 22 columns. I worked on a small data set because I wanted to get my code working first and set a benchmark for myself with the accuracy of the model.

 

Also an issue I had with the dataset was that I was predicting the number of members based on the data I had. The data I had was mostly categorical. It was an issue but something I was willing to work with. I read about methods to use for this specific case and one of them that I came across was word2vec. This method converts words into numbers. Using this method I would like to turn each word into a numeric value so I can use that to better predict number of members. I can also use Naïve Bayes to do some sentiment analysis. Also I would like to apply more models to find which one is a better one and maybe ensemble it.

 

Another thing I could have done, that might solve my regression problem, was to create a class for the number of members. For instance, if they have 100-200 members they would be in class B. That way the regression problem would have turned into a classification problem. But that would be too easy…

About Author

Taraqur Rahman

During his career as a Sales Associate, Taraqur analyzed data to help support both the sales and marketing teams. Seeing through his own eyes how much data can influence decisions, Taraqur joined NYCDSA as a data scientist in...
View all posts by Taraqur Rahman >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI