Exploring Polycystic Ovarian Syndrome (PCOS) Symptoms with Data

Posted on May 24, 2023

According to the Endocrine Society, Polycystic ovarian syndrome, PCOS, is one of the most common causes of female infertility, affecting as many as 5 million American women of childbearing age. Women affected by PCOS may produce higher than normal amounts of male hormones , which may impact their overall health, even past their childbearing years. Symptoms can be different for every woman, which makes it very difficult to diagnose. In this exploratory data analysis project, I explored three specific questions:

  1. Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
  2. Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
  3. Which symptoms do PCOS patients exhibit most frequently?


The data that was used for this project includes all physical and clinical parameters from a group of patients collected from ten different hospitals across Kerala, India. The original data set and notebook can be found on Kaggle. The data contains one Comma Separated Values (CSV) file and one Excel file. 

  • PCOS_Data_without_infertility contains 45 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number). 
  • PCOS_infertility contains 6 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number).

The data in these two files are identical other than one containing partial data of 6 columns versus the full 45 columns. The Kaggle author did not specify why the two files were provided. Many of the features in the dataset require domain knowledge in the medical field. Please reference the Jupyter Notebook published on Github for a data dictionary to refer to better understand the feature set. 

Initial Inspection & Data Preparation

Before I explored the questions of interest noted above, I inspected the data to get a sense of the data’s general construct. During this inspection, I completed the following tasks:

  1. Load the data. 
  2. Describe the data. 
  3. Inspect the data for missing values.
  4. Make initial observations about the data for subsequent steps such as data cleaning and pre-processing. 

After the initial inspection was completed, I cleaned the data by finding and removing any missing values and duplicate values. Missing values were removed versus imputed do the minimal impact to the overall analysis and dataset. I also removed any unnecessary white space in the column names for easier data manipulation. In addition to cleaning the data, I also processed the data to calculate the correct values for Body Mass Index (BMI), FSH/LH, and Waist:Hip Ratio columns. I converted binary columns to Boolean data types versus strings for easier analysis and dropped other columns in the dataset that were ill-defined from the data source (Sl. No,  Cycle (R/I), Fast Food (Y/N), Marriage Status (Yrs)). 

Exploratory Data Analysis

Now that the data has been cleaned and processed, we are ready to conduct our Exploratory Data Analysis with our three original questions in mind:

  1. Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
  2. Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
  3. Which symptoms do PCOS patients exhibit most frequently?

Starting with the first question, I chose to plot a correlation matrix to see if any of the features were linearly correlated with other features in the data set. I used the Python package Seaborn for my visualization of the correlation matrix, which by default uses the Pearson’s correlation coefficient. The Pearson’s correlation coefficient is a statistical measure that expresses the strength of the linear relationship between two variables which can range from -1 to +1. It is important to note that correlation does not necessarily imply causation. If one variable shows strong positive correlation (+1), this does not necessarily mean that exhibiting this symptom means that this symptom causes PCOS. 

Because the correlation matrix is so large, it is difficult to see what features are correlated with other features. From the heatmap above, we can make a couple of observations that answer our first question of interest:

  • Follicles in the left and right ovaries and symptoms such as skin darkening, hair growth and weight gain are highly correlated with PCOS. 
  • We can validate these correlation values with medical research published thus far. PCOS patients are more likely to have a larger number of follicles that likely will not mature and therefore will prevent the patient from ovulating which reduces the number of follicles found in the ovary. 
  • Skin darkening, hair growth and weight gain are all symptoms that are most frequent in PCOS patients due to overproduction of hormones that are exhibited in males. 

After the correlation matrix was generated, to advance our analysis, I set out  to see if any of the features that had a Pearson’s correlation coefficient of 0.7 or higher  showed significance in terms of a PCOS diagnosis by using inferential statistics and hypothesis testing. The Pearson’s correlation coefficient of 0.7 was chosen arbitrarily, though it was high enough to indicate a strong positive correlation with the PCOS dependent variable. The first step was to conduct Analysis of Variance (ANOVA) on each feature with the PCOS feature with a Pearson’s correlation coefficient of 0.7 or higher. ANOVA is a statistical test developed by the statistician Ronald Fisher that  is used to analyze if two or more population means are equal or different. Twenty-seven out of 45 features were chosen due to this threshold for  the cleaned data frame. 

After conducting ANOVA on the 27 features that met the 0.7 threshold, I set my hypothesis test which included defining the null and alternative hypotheses, confidence level and significance level. The null hypothesis was defined as such that  the mean of each variable is equal to the same mean of that variable to patients that have PCOS. I l set my significance level to 0.05 to be  95% confident in my conclusion and accept 5% error if the conclusions are incorrect.

Given that our hypothesis is true, if the probability of observing the average of that variable is extreme or as extreme as the one we observed is higher than the significance level, alpha = 0.05, then we fail to reject the null hypothesis. 

  • If the p-value is greater than alpha, we would retain the null hypothesis, meaning we have sufficient statistical evidence to assume that the variable we are observing is not correlated with a PCOS diagnosis. 
  • If the p-value is lower than alpha, we would reject the null hypothesis in favor of the alternative, meaning that we have enough statistical evidence to assume that particular variable is correlated with a PCOS diagnosis and is, therefore, significant.

The bar chart below shows the p-values of each PCOS feature from each ANOVA test that was conducted. The orange line shows the significance level of 0.05. If the feature is to the left of the line, this indicates that the feature’s p-value is less than 0.05, indicating significance. 

From the inferential statistics analysis, evidence suggests that the features that differ from PCOS diagnosed patients and non-PCOS patients are the following:

  • Foll_No_R
  • Cycle_Length
  • Anti_Mull_Horm
  • Waist_in
  • Age
  • Prolactin
  • Follicle_Stim_Horm
  • Waist_Hip_Ratio

If we were a patient or provider, we could start our examination with these characteristics to see if PCOS was a cause of infertility. 

After the first question was answered and thoroughly analyzed, I moved on answering the second question:

  • Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?

To answer this question, a simple categorical aggregation would help us answer this question:

According to this table, acne and hair loss were the most frequent symptoms exhibited by non-PCOS patients. Other than these two symptoms, most non-PCOS patients did not exhibit any other PCOS related symptoms in this data set. 

Now that we know that acne and hair loss are the most frequent symptoms exhibited by non-PCOS patients, let’s take a look at the most frequent symptoms in PCOS patients to look for similarities or differences. 

From the table above, it looks like the most frequent symptoms exhibited by PCOS patients are acne, skin darkening and weight gain. After doing this simple categorical variable aggregation, we can conclude that the only frequent symptom that PCOS and non-PCOS patients have in common is acne which may be caused by a variety of other conditions, not just PCOS. 


In conclusion, PCOS symptoms exist on a spectrum. They can vary from woman to woman, and even within the same woman, depending on the timing of her cycle or even her age. In this exploratory data analysis, I found that follicle count, cycle length, AMH, PRL, FSH, age and waist measurements are the highest possible indicators of PCOS. The most frequent symptoms exhibited by non-PCOS patients were acne and hair loss. The most frequent symptoms exhibited by PCOS patients were acne, weight gain and skin darkening. 

It is of vital importance that the provider and patient work together to narrow down symptoms, run diagnostic tests, and continuously monitor to treat PCOS. Further research and even machine learning techniques could be developed to help treat and diagnose such variable conditions in patients. 
Please note that additional visualizations such as distributions of features are provided in PCOS_Analysis.ipynb on Github.

About Author

Sarah Beth Powell

I'm a proven project manager, with curiosity in data science and solving problems through statistics, math and coding. I have over 8 years of experience ranging from people analytics in human resources to assortment optimization in retail. With...
View all posts by Sarah Beth Powell >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI