NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > GitHub Profiler: A Tool for Repository Evaluation

GitHub Profiler: A Tool for Repository Evaluation

Evan Frisch
Posted on Mar 11, 2017

GitHub hosts over 84 million repositories, a number that continues to grow rapidly. Software developers must consider a number of important factors as they decide whether to use -- or contribute to -- a project hosted on the site. GitHub Profiler provides a number of indicators that can help with such decisions.

With the wealth of public repositories it hosts, GitHub often makes it easy to find many libraries that appear to be well-suited to the task at hand. However, when selecting a library to use in constructing software, a developer needs to consider some factors that may not be immediately obvious. How active is the repository and how many people are contributing to it, by committing code and by commenting on issues? Do the developer or developers address issues in a timely manner? Is the documentation easy to read? How relevant is the focus of the repository to the needs that the developer seeks to fill? These can all be important considerations. If a library is not actively maintained or is poorly documented, relying on it could prove risky.

GitHub Profiler acquires data from a number of sources, primarily using the GitHub API, but also web scraping of GitHub and other sources, to research the repository that the user selects. It then analyzes the data it acquired in order to provide indicators of the activity of and participation in the GitHub repository, the readability of its documentation, and the subject matter of the repository. Through the metrics, graphs, and keywords it provides, GitHub Profiler constructs a succinct portrait of a repository to help a developer to make a quick determination. Clearly, the indicators that GitHub Profiler provides can contribute to the decision-making process, but they can not, by themselves, provide all the information that a developer needs.

The first three screenshots below illustrate panels that GitHub Profiler provides using as an example the visualize_ML repository by GitHub user ayush1997.  First, the summary information panel immediately identifies Python as the programming language of the repository. It then identifies several topics that GitHub users have deemed relevant descriptions of the repository, such as machine learning, data analysis, and visualization. The topics have been acquired through web scraping of GitHub using the Python libraries Requests and Beautiful Soup.  Finally, the summary information section also shows the star count, in this case 144. The star count indicates the number of GitHub users who have marked the repository with a star to bookmark it (an indication of interest) and to show appreciation of the project.

SummaryInformation

Next, GitHub Profiler provides a graph showing the number of unique committers and commenters on issues on a monthly basis for the selected repository. The example below quickly shows that there has never been more than one committer each month, and no one has committed code to the repository in recent months. This serves as a warning that the repository may be abandoned or at least is not undergoing regular development. The graph also shows that two people submitted comments on issues in August 2016, but no one commented in any other months. These are also indications that, despite its star count, the repository has not achieved significant participation.

MonthlyNumberOfUniqueContributors

The repository text metrics panel shown below provides indicators based on analysis of the text in the description and the readme file of the repository. Polarity is a measurement of how positive or negative the verbiage is, with -1 meaning completely negative, 0 meaning neutral, and 1 meaning completely positive. Below, the description is identified as neutral on this scale, while the readme is seen as slightly positive.  The subjectivity measurement attempts to characterize the description and readme based on how objective or subjective they are. Its scale is from 0, meaning completely objective, to 1, meaning completely subjective. By this measure, the description of the visualize_ML repository is seen as objective, while its readme is moderately subjective. Both the polarity and subjectivity calculations are performed using the TextBlob library.

RepositoryTextMetrics

The text metrics shown above also include estimates of the readability of the text in the description and the readme file. These estimates, which are computed using the Textstat library, feature the Flesch Reading Ease Score, a common measure of readability on a scale from 0 as most confusing to 100 as easiest to read. A composite grade level calculation is also provided using several means of determining the readability of text. In the case of the visualize_ML repository, the description is found to be quite difficult to read, and harder to read than the readme file.

The following screenshots show three more panels that GitHub Profiler offers, using another repository, machine-learning by cognoma, as an example. The maintenance metrics panel gives vital measurements to potential users of a library regarding how a repository is maintained. As noted below, the measurements are based on the calculations made by the site IsItMaintained.com and scraped from that site. For cognoma's machine-learning project, most issues remain open and issues take an average of three months to be resolved.

MaintenanceMetrics

A graph of the monthly number of commits provides a quick overview of the frequency of changes to the code of the repository. For cognoma's machine-learning repository, there have been two to eight commits in most recent months, but no commits in November 2016 or in the current month.

MonthlyNumberOfCommits

The graph below depicts the polarity of comments on issues for the selected repository over time. It also shows the frequency of comments over time, which, in the example below, appear to be much more frequent in the early months of the graph. The polarity of the comments seem to be overwhelmingly in the neutral to positive range, with a relative small number of downward spikes into negative territory.

IssueCommentPolarity

Finally, the remaining panel in GitHub Profiler shows key terms detected in the description and the readme file. The example below is from a profile of the pattern repository by the user clips (the Computational Linguistics & Psycholinguistics Research Center). Using TextBlob to detect noun phrases, GitHub Profiler identifies a few key words and phrases that characterize the pattern project from its description, such as Python, natural language processing, and machine learning. Its search for key words and phrases in repository's readme file lists many of the project's contributors as well as relevant terms such as sentiment analysis, part-of-speech taggers, and phrases relevant to the installation process. This provides the correct impression that the project's readme file is devoted largely to installation instructions and a listing of contributors, while more comprehensive documentation is found elsewhere.

RepositoryKeywords

GitHub Profiler could be extended further through additional types of indicators and graphs that could help developers to evaluate the utility of a repository quickly. It could also benefit from refinement of some of its existing methods. For example, the key terms identified in a readme file might be more useful if they were organized into categories or otherwise evaluated. Finally, the Profiler could be extended to enable head-to-head comparisons of repositories, allowing a developer to gain a quick indication of which may be better suited to his or her needs.

 

About Author

Evan Frisch

Evan Frisch has more than a decade and a half of experience using technology and data to achieve results for organizations in the private, public, and non-profit sectors. Evan received his undergraduate degree with honors from Yale University,...
View all posts by Evan Frisch >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

Cancel reply

You must be logged in to post a comment.

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application