GitHub Profiler: A Tool for Repository Evaluation

Posted on Mar 11, 2017

GitHub hosts over 84 million repositories, a number that continues to grow rapidly. Software developers must consider a number of important factors as they decide whether to use -- or contribute to -- a project hosted on the site. GitHub Profiler provides a number of indicators that can help with such decisions.

With the wealth of public repositories it hosts, GitHub often makes it easy to find many libraries that appear to be well-suited to the task at hand. However, when selecting a library to use in constructing software, a developer needs to consider some factors that may not be immediately obvious. How active is the repository and how many people are contributing to it, by committing code and by commenting on issues? Do the developer or developers address issues in a timely manner? Is the documentation easy to read? How relevant is the focus of the repository to the needs that the developer seeks to fill? These can all be important considerations. If a library is not actively maintained or is poorly documented, relying on it could prove risky.

GitHub Profiler acquires data from a number of sources, primarily using the GitHub API, but also web scraping of GitHub and other sources, to research the repository that the user selects. It then analyzes the data it acquired in order to provide indicators of the activity of and participation in the GitHub repository, the readability of its documentation, and the subject matter of the repository. Through the metrics, graphs, and keywords it provides, GitHub Profiler constructs a succinct portrait of a repository to help a developer to make a quick determination. Clearly, the indicators that GitHub Profiler provides can contribute to the decision-making process, but they can not, by themselves, provide all the information that a developer needs.

The first three screenshots below illustrate panels that GitHub Profiler provides using as an example the visualize_ML repository by GitHub user ayush1997.  First, the summary information panel immediately identifies Python as the programming language of the repository. It then identifies several topics that GitHub users have deemed relevant descriptions of the repository, such as machine learning, data analysis, and visualization. The topics have been acquired through web scraping of GitHub using the Python libraries Requests and Beautiful Soup.  Finally, the summary information section also shows the star count, in this case 144. The star count indicates the number of GitHub users who have marked the repository with a star to bookmark it (an indication of interest) and to show appreciation of the project.


Next, GitHub Profiler provides a graph showing the number of unique committers and commenters on issues on a monthly basis for the selected repository. The example below quickly shows that there has never been more than one committer each month, and no one has committed code to the repository in recent months. This serves as a warning that the repository may be abandoned or at least is not undergoing regular development. The graph also shows that two people submitted comments on issues in August 2016, but no one commented in any other months. These are also indications that, despite its star count, the repository has not achieved significant participation.


The repository text metrics panel shown below provides indicators based on analysis of the text in the description and the readme file of the repository. Polarity is a measurement of how positive or negative the verbiage is, with -1 meaning completely negative, 0 meaning neutral, and 1 meaning completely positive. Below, the description is identified as neutral on this scale, while the readme is seen as slightly positive.  The subjectivity measurement attempts to characterize the description and readme based on how objective or subjective they are. Its scale is from 0, meaning completely objective, to 1, meaning completely subjective. By this measure, the description of the visualize_ML repository is seen as objective, while its readme is moderately subjective. Both the polarity and subjectivity calculations are performed using the TextBlob library.


The text metrics shown above also include estimates of the readability of the text in the description and the readme file. These estimates, which are computed using the Textstat library, feature the Flesch Reading Ease Score, a common measure of readability on a scale from 0 as most confusing to 100 as easiest to read. A composite grade level calculation is also provided using several means of determining the readability of text. In the case of the visualize_ML repository, the description is found to be quite difficult to read, and harder to read than the readme file.

The following screenshots show three more panels that GitHub Profiler offers, using another repository, machine-learning by cognoma, as an example. The maintenance metrics panel gives vital measurements to potential users of a library regarding how a repository is maintained. As noted below, the measurements are based on the calculations made by the site and scraped from that site. For cognoma's machine-learning project, most issues remain open and issues take an average of three months to be resolved.


A graph of the monthly number of commits provides a quick overview of the frequency of changes to the code of the repository. For cognoma's machine-learning repository, there have been two to eight commits in most recent months, but no commits in November 2016 or in the current month.


The graph below depicts the polarity of comments on issues for the selected repository over time. It also shows the frequency of comments over time, which, in the example below, appear to be much more frequent in the early months of the graph. The polarity of the comments seem to be overwhelmingly in the neutral to positive range, with a relative small number of downward spikes into negative territory.


Finally, the remaining panel in GitHub Profiler shows key terms detected in the description and the readme file. The example below is from a profile of the pattern repository by the user clips (the Computational Linguistics & Psycholinguistics Research Center). Using TextBlob to detect noun phrases, GitHub Profiler identifies a few key words and phrases that characterize the pattern project from its description, such as Python, natural language processing, and machine learning. Its search for key words and phrases in repository's readme file lists many of the project's contributors as well as relevant terms such as sentiment analysis, part-of-speech taggers, and phrases relevant to the installation process. This provides the correct impression that the project's readme file is devoted largely to installation instructions and a listing of contributors, while more comprehensive documentation is found elsewhere.


GitHub Profiler could be extended further through additional types of indicators and graphs that could help developers to evaluate the utility of a repository quickly. It could also benefit from refinement of some of its existing methods. For example, the key terms identified in a readme file might be more useful if they were organized into categories or otherwise evaluated. Finally, the Profiler could be extended to enable head-to-head comparisons of repositories, allowing a developer to gain a quick indication of which may be better suited to his or her needs.


About Author


Evan Frisch

Evan Frisch has more than a decade and a half of experience using technology and data to achieve results for organizations in the private, public, and non-profit sectors. Evan received his undergraduate degree with honors from Yale University,...
View all posts by Evan Frisch >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp