Exploring Polycystic Ovarian Syndrome (PCOS) Symptoms with Data
According to the Endocrine Society, Polycystic ovarian syndrome, PCOS, is one of the most common causes of female infertility, affecting as many as 5 million American women of childbearing age. Women affected by PCOS may produce higher than normal amounts of male hormones , which may impact their overall health, even past their childbearing years. Symptoms can be different for every woman, which makes it very difficult to diagnose. In this exploratory data analysis project, I explored three specific questions:
- Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
- Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
- Which symptoms do PCOS patients exhibit most frequently?
Data
The data that was used for this project includes all physical and clinical parameters from a group of patients collected from ten different hospitals across Kerala, India. The original data set and notebook can be found on Kaggle. The data contains one Comma Separated Values (CSV) file and one Excel file.
- PCOS_Data_without_infertility contains 45 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number).
- PCOS_infertility contains 6 columns (representing various physical and clinical parameters) and 541 rows (representing different patients identified by a Patient File Number).
The data in these two files are identical other than one containing partial data of 6 columns versus the full 45 columns. The Kaggle author did not specify why the two files were provided. Many of the features in the dataset require domain knowledge in the medical field. Please reference the Jupyter Notebook published on Github for a data dictionary to refer to better understand the feature set.
Initial Inspection & Data Preparation
Before I explored the questions of interest noted above, I inspected the data to get a sense of the data’s general construct. During this inspection, I completed the following tasks:
- Load the data.
- Describe the data.
- Inspect the data for missing values.
- Make initial observations about the data for subsequent steps such as data cleaning and pre-processing.
After the initial inspection was completed, I cleaned the data by finding and removing any missing values and duplicate values. Missing values were removed versus imputed do the minimal impact to the overall analysis and dataset. I also removed any unnecessary white space in the column names for easier data manipulation. In addition to cleaning the data, I also processed the data to calculate the correct values for Body Mass Index (BMI), FSH/LH, and Waist:Hip Ratio columns. I converted binary columns to Boolean data types versus strings for easier analysis and dropped other columns in the dataset that were ill-defined from the data source (Sl. No, Cycle (R/I), Fast Food (Y/N), Marriage Status (Yrs)).
Exploratory Data Analysis
Now that the data has been cleaned and processed, we are ready to conduct our Exploratory Data Analysis with our three original questions in mind:
- Because symptoms vary from woman to woman, are there any features that are correlated with PCOS?
- Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
- Which symptoms do PCOS patients exhibit most frequently?
Starting with the first question, I chose to plot a correlation matrix to see if any of the features were linearly correlated with other features in the data set. I used the Python package Seaborn for my visualization of the correlation matrix, which by default uses the Pearson’s correlation coefficient. The Pearson’s correlation coefficient is a statistical measure that expresses the strength of the linear relationship between two variables which can range from -1 to +1. It is important to note that correlation does not necessarily imply causation. If one variable shows strong positive correlation (+1), this does not necessarily mean that exhibiting this symptom means that this symptom causes PCOS.
Because the correlation matrix is so large, it is difficult to see what features are correlated with other features. From the heatmap above, we can make a couple of observations that answer our first question of interest:
- Follicles in the left and right ovaries and symptoms such as skin darkening, hair growth and weight gain are highly correlated with PCOS.
- We can validate these correlation values with medical research published thus far. PCOS patients are more likely to have a larger number of follicles that likely will not mature and therefore will prevent the patient from ovulating which reduces the number of follicles found in the ovary.
- Skin darkening, hair growth and weight gain are all symptoms that are most frequent in PCOS patients due to overproduction of hormones that are exhibited in males.
After the correlation matrix was generated, to advance our analysis, I set out to see if any of the features that had a Pearson’s correlation coefficient of 0.7 or higher showed significance in terms of a PCOS diagnosis by using inferential statistics and hypothesis testing. The Pearson’s correlation coefficient of 0.7 was chosen arbitrarily, though it was high enough to indicate a strong positive correlation with the PCOS dependent variable. The first step was to conduct Analysis of Variance (ANOVA) on each feature with the PCOS feature with a Pearson’s correlation coefficient of 0.7 or higher. ANOVA is a statistical test developed by the statistician Ronald Fisher that is used to analyze if two or more population means are equal or different. Twenty-seven out of 45 features were chosen due to this threshold for the cleaned data frame.
After conducting ANOVA on the 27 features that met the 0.7 threshold, I set my hypothesis test which included defining the null and alternative hypotheses, confidence level and significance level. The null hypothesis was defined as such that the mean of each variable is equal to the same mean of that variable to patients that have PCOS. I l set my significance level to 0.05 to be 95% confident in my conclusion and accept 5% error if the conclusions are incorrect.
Given that our hypothesis is true, if the probability of observing the average of that variable is extreme or as extreme as the one we observed is higher than the significance level, alpha = 0.05, then we fail to reject the null hypothesis.
- If the p-value is greater than alpha, we would retain the null hypothesis, meaning we have sufficient statistical evidence to assume that the variable we are observing is not correlated with a PCOS diagnosis.
- If the p-value is lower than alpha, we would reject the null hypothesis in favor of the alternative, meaning that we have enough statistical evidence to assume that particular variable is correlated with a PCOS diagnosis and is, therefore, significant.
The bar chart below shows the p-values of each PCOS feature from each ANOVA test that was conducted. The orange line shows the significance level of 0.05. If the feature is to the left of the line, this indicates that the feature’s p-value is less than 0.05, indicating significance.
From the inferential statistics analysis, evidence suggests that the features that differ from PCOS diagnosed patients and non-PCOS patients are the following:
- Foll_No_R
- Cycle_Length
- Anti_Mull_Horm
- Waist_in
- Age
- Prolactin
- Follicle_Stim_Horm
- Waist_Hip_Ratio
If we were a patient or provider, we could start our examination with these characteristics to see if PCOS was a cause of infertility.
After the first question was answered and thoroughly analyzed, I moved on answering the second question:
- Do non-PCOS patients exhibit similar symptoms to those who are diagnosed with PCOS?
To answer this question, a simple categorical aggregation would help us answer this question:
According to this table, acne and hair loss were the most frequent symptoms exhibited by non-PCOS patients. Other than these two symptoms, most non-PCOS patients did not exhibit any other PCOS related symptoms in this data set.
Now that we know that acne and hair loss are the most frequent symptoms exhibited by non-PCOS patients, let’s take a look at the most frequent symptoms in PCOS patients to look for similarities or differences.
From the table above, it looks like the most frequent symptoms exhibited by PCOS patients are acne, skin darkening and weight gain. After doing this simple categorical variable aggregation, we can conclude that the only frequent symptom that PCOS and non-PCOS patients have in common is acne which may be caused by a variety of other conditions, not just PCOS.
Conclusion
In conclusion, PCOS symptoms exist on a spectrum. They can vary from woman to woman, and even within the same woman, depending on the timing of her cycle or even her age. In this exploratory data analysis, I found that follicle count, cycle length, AMH, PRL, FSH, age and waist measurements are the highest possible indicators of PCOS. The most frequent symptoms exhibited by non-PCOS patients were acne and hair loss. The most frequent symptoms exhibited by PCOS patients were acne, weight gain and skin darkening.
It is of vital importance that the provider and patient work together to narrow down symptoms, run diagnostic tests, and continuously monitor to treat PCOS. Further research and even machine learning techniques could be developed to help treat and diagnose such variable conditions in patients.
Please note that additional visualizations such as distributions of features are provided in PCOS_Analysis.ipynb on Github.