NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship 🏆 Data Analytics Bootcamp
Free Lesson
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See 🔥
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular 🔥 Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New 🎉 Generative AI for Finance New 🎉 Generative AI for Marketing New 🎉
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular 🔥 Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular 🔥 Data Science R: Machine Learning Designing and Implementing Production MLOps New 🎉 Natural Language Processing for Production (NLP) New 🎉
Find Inspiration
Get Course Recommendation Must Try 💎 An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Free Lessons
Intro to Data Science New Release 🎉
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See 🔥
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Python > Exploring Polycystic Ovarian Syndrome (PCOS) Symptoms with Data

Exploring Polycystic Ovarian Syndrome (PCOS) Symptoms with Data

Sarah Beth Powell
Posted on May 24, 2023

According to the Endocrine Society, Polycystic ovarian syndrome, PCOS, is one of the most common causes of female infertility, affecting as many as 5 million American women of childbearing age. Women affected by PCOS may produce higher than normal amounts of male hormones , which may impact their overall health, even past their childbearing years. Symptoms can be different for every woman, which makes it very difficult to diagnose. In this exploratory data analysis project, I explored three specific questions:

  1. Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
  2. Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
  3. Which symptoms do PCOS patients exhibit most frequently?

Data

The data that was used for this project includes all physical and clinical parameters from a group of patients collected from ten different hospitals across Kerala, India. The original data set and notebook can be found on Kaggle. The data contains one Comma Separated Values (CSV) file and one Excel file. 

  • PCOS_Data_without_infertility contains 45 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number). 
  • PCOS_infertility contains 6 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number).

The data in these two files are identical other than one containing partial data of 6 columns versus the full 45 columns. The Kaggle author did not specify why the two files were provided. Many of the features in the dataset require domain knowledge in the medical field. Please reference the Jupyter Notebook published on Github for a data dictionary to refer to better understand the feature set. 

Initial Inspection & Data Preparation

Before I explored the questions of interest noted above, I inspected the data to get a sense of the data’s general construct. During this inspection, I completed the following tasks:

  1. Load the data. 
  2. Describe the data. 
  3. Inspect the data for missing values.
  4. Make initial observations about the data for subsequent steps such as data cleaning and pre-processing. 

After the initial inspection was completed, I cleaned the data by finding and removing any missing values and duplicate values. Missing values were removed versus imputed do the minimal impact to the overall analysis and dataset. I also removed any unnecessary white space in the column names for easier data manipulation. In addition to cleaning the data, I also processed the data to calculate the correct values for Body Mass Index (BMI), FSH/LH, and Waist:Hip Ratio columns. I converted binary columns to Boolean data types versus strings for easier analysis and dropped other columns in the dataset that were ill-defined from the data source (Sl. No,  Cycle (R/I), Fast Food (Y/N), Marriage Status (Yrs)). 

Exploratory Data Analysis

Now that the data has been cleaned and processed, we are ready to conduct our Exploratory Data Analysis with our three original questions in mind:

  1. Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
  2. Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
  3. Which symptoms do PCOS patients exhibit most frequently?

Starting with the first question, I chose to plot a correlation matrix to see if any of the features were linearly correlated with other features in the data set. I used the Python package Seaborn for my visualization of the correlation matrix, which by default uses the Pearson’s correlation coefficient. The Pearson’s correlation coefficient is a statistical measure that expresses the strength of the linear relationship between two variables which can range from -1 to +1. It is important to note that correlation does not necessarily imply causation. If one variable shows strong positive correlation (+1), this does not necessarily mean that exhibiting this symptom means that this symptom causes PCOS. 

Because the correlation matrix is so large, it is difficult to see what features are correlated with other features. From the heatmap above, we can make a couple of observations that answer our first question of interest:

  • Follicles in the left and right ovaries and symptoms such as skin darkening, hair growth and weight gain are highly correlated with PCOS. 
  • We can validate these correlation values with medical research published thus far. PCOS patients are more likely to have a larger number of follicles that likely will not mature and therefore will prevent the patient from ovulating which reduces the number of follicles found in the ovary. 
  • Skin darkening, hair growth and weight gain are all symptoms that are most frequent in PCOS patients due to overproduction of hormones that are exhibited in males. 

After the correlation matrix was generated, to advance our analysis, I set out  to see if any of the features that had a Pearson’s correlation coefficient of 0.7 or higher  showed significance in terms of a PCOS diagnosis by using inferential statistics and hypothesis testing. The Pearson’s correlation coefficient of 0.7 was chosen arbitrarily, though it was high enough to indicate a strong positive correlation with the PCOS dependent variable. The first step was to conduct Analysis of Variance (ANOVA) on each feature with the PCOS feature with a Pearson’s correlation coefficient of 0.7 or higher. ANOVA is a statistical test developed by the statistician Ronald Fisher that  is used to analyze if two or more population means are equal or different. Twenty-seven out of 45 features were chosen due to this threshold for  the cleaned data frame. 

After conducting ANOVA on the 27 features that met the 0.7 threshold, I set my hypothesis test which included defining the null and alternative hypotheses, confidence level and significance level. The null hypothesis was defined as such that  the mean of each variable is equal to the same mean of that variable to patients that have PCOS. I l set my significance level to 0.05 to be  95% confident in my conclusion and accept 5% error if the conclusions are incorrect.

Given that our hypothesis is true, if the probability of observing the average of that variable is extreme or as extreme as the one we observed is higher than the significance level, alpha = 0.05, then we fail to reject the null hypothesis. 

  • If the p-value is greater than alpha, we would retain the null hypothesis, meaning we have sufficient statistical evidence to assume that the variable we are observing is not correlated with a PCOS diagnosis. 
  • If the p-value is lower than alpha, we would reject the null hypothesis in favor of the alternative, meaning that we have enough statistical evidence to assume that particular variable is correlated with a PCOS diagnosis and is, therefore, significant.

The bar chart below shows the p-values of each PCOS feature from each ANOVA test that was conducted. The orange line shows the significance level of 0.05. If the feature is to the left of the line, this indicates that the feature’s p-value is less than 0.05, indicating significance. 

From the inferential statistics analysis, evidence suggests that the features that differ from PCOS diagnosed patients and non-PCOS patients are the following:

  • Foll_No_R
  • Cycle_Length
  • Anti_Mull_Horm
  • Waist_in
  • Age
  • Prolactin
  • Follicle_Stim_Horm
  • Waist_Hip_Ratio

If we were a patient or provider, we could start our examination with these characteristics to see if PCOS was a cause of infertility. 

After the first question was answered and thoroughly analyzed, I moved on answering the second question:

  • Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?

To answer this question, a simple categorical aggregation would help us answer this question:

According to this table, acne and hair loss were the most frequent symptoms exhibited by non-PCOS patients. Other than these two symptoms, most non-PCOS patients did not exhibit any other PCOS related symptoms in this data set. 

Now that we know that acne and hair loss are the most frequent symptoms exhibited by non-PCOS patients, let’s take a look at the most frequent symptoms in PCOS patients to look for similarities or differences. 

From the table above, it looks like the most frequent symptoms exhibited by PCOS patients are acne, skin darkening and weight gain. After doing this simple categorical variable aggregation, we can conclude that the only frequent symptom that PCOS and non-PCOS patients have in common is acne which may be caused by a variety of other conditions, not just PCOS. 

Conclusion

In conclusion, PCOS symptoms exist on a spectrum. They can vary from woman to woman, and even within the same woman, depending on the timing of her cycle or even her age. In this exploratory data analysis, I found that follicle count, cycle length, AMH, PRL, FSH, age and waist measurements are the highest possible indicators of PCOS. The most frequent symptoms exhibited by non-PCOS patients were acne and hair loss. The most frequent symptoms exhibited by PCOS patients were acne, weight gain and skin darkening. 

It is of vital importance that the provider and patient work together to narrow down symptoms, run diagnostic tests, and continuously monitor to treat PCOS. Further research and even machine learning techniques could be developed to help treat and diagnose such variable conditions in patients. 
Please note that additional visualizations such as distributions of features are provided in PCOS_Analysis.ipynb on Github.

About Author

Sarah Beth Powell

I'm a proven project manager, with curiosity in data science and solving problems through statistics, math and coding. I have over 8 years of experience ranging from people analytics in human resources to assortment optimization in retail. With...
View all posts by Sarah Beth Powell >

Related Articles

Data Visualization
Be a YouTube mrBeast
Python
Does investing in education reduce STI risk in California?
Capstone
Demographic-Based Real Estate Investing
Python
Implementing The K-Means Algorithm to Predict Passenger Survival Status
Data Engineering
Building an Automated Data Pipeline for Retail Trade Survey Data

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2326 posts
Alumni 60 posts
APIs 40 posts
AWS 11 posts
Big Data 48 posts
Capstone 196 posts
Career Education 7 posts
Community 70 posts
Data Engineering 1 posts
Data Science News and Sharing 71 posts
Data Visualization 312 posts
Events 5 posts
Featured 37 posts
Hadoop 13 posts
Machine Learning 336 posts
Meetup 142 posts
Python 435 posts
R 395 posts
R Shiny 548 posts
R Visualization 440 posts
Spark 17 posts
Student Works 1610 posts
Tableau 12 posts
TensorFlow 2 posts
Web Scraping 480 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    © 2023 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application