Bonsai- Optimizing Forum Queries and How To Save It
Project GitHub | LinkedIn: Niki Moritz Hao-Wei Matthew Oren
The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Recently, I've been trying my hand at a new hobby: Bonsai, the art of cultivating trees in small pots. After some initial success in the humid summer months, I soon found myself staring at the wilted leaves of a small dying tree, and, realizing that I might need some help, searched for advice as to how I might revive my poor tree.
A healthy bonsai!
I turned to Bonsai Empire, a large forum with over 6,000 postings to see what fellow enthusiasts have been discussing. I quickly noticed a large disparity in the number of responses to each post, and decided to scrape the multi-layered forum with a scrapy spider to investigate which topics were getting the most responses.
Scraped data in hand, I took the stems of every word in each post's topic to consider words with the same root (e.g., choose & choosing) as the same word and removed meaningless "stopwords" such as "your" and "is" to get a list of meaningful topics and their corresponding responses.
Thus, the top ten topics with the most responses are as follows:
STEM | TOTAL RESPONSES |
bonsai | 1324 |
tree | 749 |
help | 620 |
new | 366 |
junip | 282 |
ficu | 239 |
thi | 219 |
need | 218 |
elm | 214 |
leav | 213 |
This list is, of course, biased towards topics with the most postings, so we look to the average number of responses for better insight, and obtain the following list of topics:
STEM | AVERAGE RESPONSES |
---|---|
introduc | 83.4 |
wisconsin | 55.0 |
nonsai | 54.0 |
recommendations | 48.0 |
corkscrew | 48.0 |
challenge | 46.0 |
competit | 46.0 |
halp | 44.0 |
concretec | 44.0 |
aggress | 44.0 |
However, the integer values for Average Responses indicate that these values may be from single posts, so we filter the list for topics that appear in at least two posts, and obtain the list below:
STEM | AVERAGE RESPONSES |
---|---|
introduc | 83.4 |
competit | 46.0 |
gnarl | 41.5 |
alps | 40.0 |
walmart | 36.25 |
heaven | 33.0 |
guid | 30 |
fusion | 29.75 |
monster | 29.5 |
pics | 29.0 |
It should be noted that many of the topics above are relatively infrequent, so the avid responses may simply be anomalies. For the highest likelihood of response, one should write a topic that includes words from the top of both of the total and average response.