Github Scraper: A tool for Examining the Machine Learning

Posted on Feb 10, 2020
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Github is one of the most popular version control systems in use today, with over 100 million projects available to users. Because of this, it is one of the best sources to check on the current state of Computer Science. My Github Analyzer application scrapes thousands of machine learning projects in order to determine which machine learning libraries are most commonly used, and analyze various statistics about machine learning projects as a whole.

 

Data Collection

The Python library Scrapy was used to get data from Github. The web scraping consisted of two spiders:

  1. GithubLinksSpider.py - Use search terms to get project links
  2. GithubProjectsSpider.py - Use project links to get data for each project

Each search term on Github allows the user to view 100 pages of information with 10 project links per page. I used 5 search terms classified by language to find projects:

  1. Python
  2. Jupyter Notebook
  3. Java
  4. Javascript
  5. All languages

Each search term produced 1000 links, although the final result produced around 4235 unique results. Getting project links was difficult due to Github servers timing out requests unless the rate of requests per second was low. Fortunately, the scraping for individual projects was much more forgiving in that regard.

 

Results

The following data was obtained by analyzing the readme of each project and searching for references to each library, as well as their common aliases. 

 

    

This plot shows the popularity of each library. It demonstrates that most libraries aren’t used much, and the top 5 libraries have by far the most usage.

 




These 2 plots demonstrate the number of total commits the scraped projects have for each library. They demonstrate that the libraries with the most commits are very different than their base popularity would suggest. The Histogram shows that most libraries have under 10000 commits, although there are significant outliers that may influence the results of the bar plot.



This plot demonstrates that the libraries whose projects have the most stars are Pyevolve and NuPIC, while the rest have very few stars. 




This plot groups the projects by license. It reveals that the vast majority of projects have no license, although mit, apache, and gpl have a few uses. 

 


This plot graphs the relationship between commits and releases. It shows that a lot of projects have 0 or 1 releases, but once the number of releases is greater than 1, there seems to be a positive correlation between commits and releases.

 

Conclusion

The common pattern in the data is that the vast majority of projects on Github are small and don’t have any significant number of commits, stars, or other indicators of influence. The same seems to be true for different libraries, which have a few that most people use, but the rest have very little use. As a whole, Github seems to have a few big projects that get most of the activity from users, and the rest are small and inconsequential.

 

Future work

One of the difficulties of using Scrapy on Github is that Github uses Javascript rendering on some html tags, such as the number of contributors and the date of each project. I attempted to use scrapy-splash to render these tags, but it didn’t work. If I were to redo this project, I would use Selenium instead of Scrapy, since it was built with Javascript support as a primary feature.

Another feature I would  add is to scrape the commit history for each project in order to see how machine learning projects have changed over time and which contributors are the most active.

About Author

Seth Jackson

Seth Jackson is an expert in logic, economics, and philosophy with over 10 years of experience writing software. After getting a BA in Computer Science, he completed NYCDSA's Data Science program in order to obtain insights from data...
View all posts by Seth Jackson >

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI