Predicting Meetup Members
Meetup is a social networking site that brings together people with similar interests. Individuals, organizations, and companies can create groups where people can join and participate in the group activities.
Using Meetupโs APIs I was able to get group requests for about 2200 meetups per state. From prior experience, I learned to use a small dataset first to create the skeleton code. Once I can confirm the code works, I can use a bigger dataset. The dataset I worked with contains five cities: New York, Los Angeles, Chicago, Houston, and Phoenix. These cities, according to top5ofanything.com, are considered to be the top 5 largest cities in 2010.
The question I had in mind was if I were to create a meetup, how could I maximize my number of members. The reasoning behind this is that if I would to start a meetup, I would like to know what features would give me the greatest number of members.
I am starting off with a graph that shows the number of members for all five states (above). This shows that almost all meetups have less than 5000 members. There are still a few meetups though that has more than that. The highest number of members per city is shown below.
Top Meetup for Each City
I filtered few of the outliers (members > 3000) and plotted a boxplot to get a better sense in how the members were distributed throughout the cities (below). New York seems to have the highest median members (about 575 members). This makes sense since NY does have the highest population.
The next graph shows the number of members per category. The category that gets the most members is writing and then tech. There are eight categories that are not available. Since there are only eight missing values, I decided to apply the Simple Random Imputation, where I randomly select a category and input it for the missing value.

This shows the number of members based on the category. The most popular category is writing, then tech. The least popular is arts-culture and book-clubs.
Next I looked at the ratings. I compared the ratings and members to see if there is any relationship using a joint plot (below). There seems to be a few 0 ratings based on the graph. Looking more into this, it seemed that it is 0 because people did not rate it yet, not because people rated it 0. Therefor I needed to do something about it. Leaving it 0 will skew the results since it is not truly a rating.

This shows barely any correlation between the number of members of each meetup and the rating of each meetup.
I graphed the categories vs. ratings. This was interesting because the average of each category seems to be very similar to each other. This way I can use a Mean Value Imputation, where I input the average values other than the missing ones, and input them.
Now I believe I am ready to implement models to see if I can predict how many members a group will make based on the features. Right away I can say that I do not have a lot of data. The dataset I requested had only 22 variables. So I decided to perform feature engineering to get more variables.
The first variable I added was the age of the meetup. My thoughts were that the older the meetup, the more time it had to accumulate members. The raw dataset did not have the age. It had the date created in epoch time, (seconds after the beginning of UTC). I converted that into the regular time format and subtracted 2015 from the year to get the age. I also added a variable that gives the number of words of the description. The reason I did this is to see if the amount of words matter. If it does then we can make a detailed description when creating a meetup. If not then we can write two or three sentences that get to the point. Another variable I added was the number of topics listed. The topics are basically the subcategories of the meetup. For instance if the category of a meetup is film, the topics would be acting, screenwriting, film industry, etc. I thought the more the topics, the more people the meetup will reach to.
I also dummified few columns as well. There was a column where it displayed the type of joining. There were three types of join modes. The three types were open, closed, and approval. Open means that anybody can join the meetup. Closed means no one else can join it. Based on skimming through the data, if the joining is closed, it means that the meetup is no longer active. Approval means that you would have to ask for an invite from the founder. This will definitely affect the total number of members. Another variable I dummified was the visibility. Visibility is basically who can see the meetups, members or anybody. If people cannot see meetups then most likely there would be fewer members, in my opinion.
After creating these extra variables I ran a random forest model on it. Not to my surprise, I received a very low accuracy rate about 30%. Based on these results I figured out the importance of each feature.

This bar graph shows what features (in percentage) are most important in determining the number of members.
This is not surprising to me because I believe I did not create/use enough data. I only used 5 cities, which come out to be about 8800 rows and 22 columns. I worked on a small data set because I wanted to get my code working first and set a benchmark for myself with the accuracy of the model.
Also an issue I had with the dataset was that I was predicting the number of members based on the data I had. The data I had was mostly categorical. It was an issue but something I was willing to work with. I read about methods to use for this specific case and one of them that I came across was word2vec. This method converts words into numbers. Using this method I would like to turn each word into a numeric value so I can use that to better predict number of members. I can also use Naรฏve Bayes to do some sentiment analysis. Also I would like to apply more models to find which one is a better one and maybe ensemble it.
Another thing I could have done, that might solve my regression problem, was to create a class for the number of members. For instance, if they have 100-200 members they would be in class B. That way the regression problem would have turned into a classification problem. But that would be too easyโฆ