NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > Capstone > Heart Cures: Predicting Congestive Heart Failure

Heart Cures: Predicting Congestive Heart Failure

Eric Meyers, Haoyun ZHANG and Maite Herrero Gorostiza
Posted on Oct 15, 2019
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introduction & Motivations

What if humans could battle disease before it even occurred?

Indeed, disease prediction is one of the foremost machine learning innovations coming out of the medical field right now. But there are several challenges along the way. Tying medical records for individuals across doctor and hospital visits in a holistic, standardized, and compliant way is difficult enough. On top of that, there is the difficulty of  leveraging the technologies available for accurate record and image recognition on those medical records and tying in centuries of medical research to reach conclusive decisions.

Separately managing predictive outcomes with patients in a practical and cost-efficient manner is no easy feat. Each one of these goals can take a team or company a decade to achieve, though we consider them interrelated and inevitable, as the health and medical fields are ripe for innovation.

With this project, we took the first steps into the convoluted medical record space, extracting and aggregating patient-level data. Our goal was to create a predictive model that can accurately identify patients with the presence of Congestive Heart Failure, one of the main causes of mortality in the developed world. 

It must be noted that this is just the beginning of creating a machine-learning-based engine that could accurately predict the inherited presence and/or future development of hundreds of diseases. The motivation for this work is to give doctors and healthcare professionals a low-cost tool that can spot the early onset of diseases while giving patientsโ€™ more proactive means to prevent or delay critical illnesses.

Our project shows that: 

  • There is a critical need for standardization and centralization of medical records
  • Modeling for disease prediction requires expertise in medical research and data science
  • Conflicts exist between current health policies and scientific research that must be resolved with collaboration and technological tools to clear obstacles for innovation in the medical field

Why Healthcare? 

The medical and healthcare industries are ripe for innovation. The U.S. healthcare industry is estimated to be a $14B industry in 2019, with a CAGR of 28.3%, and to grow to $50.5B by 2024). At the same time, patient registries and healthcare costs are rising in an increasingly aging society. Given the circumstances, medical professionals must work intelligently and efficiently to keep costs low while improving patient care.

We believe matching the domain expertise of medical professionals with the advances in machine learning can help the medical industry evolve and rapidly adapt to this reality.  

Why Congestive Heart Failure?

The CDC cites that:

โ€œAbout 5.7M adults in the United States have heart failure, one in nine deaths in 2009 included heart failure as contributing cause, about half of people who develop heart failure die within five years of diagnosis, and heart failure costs the nation an estimated $30.7B each year. This total includes the cost of healthcare services, medications to treat heart failure, and missed days of work.โ€ 

Moreover, congestive heart failure is notoriously difficult and expensive to diagnose as most symptoms overlap with many other common conditions, such as diabetes, coronary artery disease, stroke, and high blood pressure. The most common symptoms are difficulty breathing, general tiredness, rapid weight gain/loss, swelling stomach and limbs, and sometimes heart attack.

The main test indicators for identification are chest X-rays, Electrocardiograms, and laboratory blood tests. Within lab blood tests, the indicators for which we have access to for this project, there are 6 common blood gases, 12 blood chemistry, and 12 hematology measurements.

Data Source

Our data source for this project was the MIMIC-III database hosted by Physionet. โ€œMIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.โ€ The database was developed by the MIT Lab for Computational Physiology.

It consists of 26 tables holding patientโ€™s (anonymized) demographics, vital signs, laboratory tests, medications, diagnoses, and more. Although this database is publicly accessible, gaining access requires an eight-module training course ensuring users adequately comply with HIPAA, U.S. health record privacy policies, human subject training, and specimen research training standards.

Extracting a usable data set for disease prediction required intimate knowledge of the content hosted on each table, the relationships (primary keys/foreign keys) and types of relationships (one-to-one vs. one-to-many) of entities across the database, as well as the definitions associated to medical standards such as ICD-9 codes relating to diagnosed conditions or blood test measurement types.

We pivoted and aggregated blood test measurements grouped by individual patients, joined the diagnosis tables, and merged demographic information from admissions and patients tables. As a result, we extracted our dataset with roughly 44,000 patients, our โ€œobservations,โ€ each with 115 attributes. Fifteen of these attributes were demographic; i.e. numeric (age), and categorical (ethnicity, gender, language spoken, etc.).

The other hundred attributes were a reduced 25 (the most common out of about 700 possible) blood test measurements and their minimum, maximum, mean, and standard deviation numeric values by patient (25 measurements x 4 measures each = 100 features). 

Once we tied the presence of congestive heart failure to these patients, we had our binary target variable; designating โ€˜1โ€™ for patients having congestive heart failure, and โ€˜0โ€™ for those without. All together 9,829 (25.5%) of the adults in our data set had CHF, giving us a baseline accuracy of 74.5% assuming the majority class.

Data Cleaning

One aspect of HIPAA compliance and data privatization that presented a challenge was identifying patientโ€™s age from the dataset. To comply with HIPAA, patients under 15 and over 85 years of age must be entirely anonymized and their age records obscured within those age ranges. Obviously this presents a notable risk to our model, and an example where public policy and innovation are in conflict.

The database documentation mentioned that all date figures (i.e. โ€œDate of Birth,โ€ โ€œDate of Death,โ€ all admission dates, etc.) were shifted into the future to ensure proper anonymization of patient records; however, the derived age of patient was preserved. Age could be accurately derived, calculated as โ€œshifted date of first admission minus shifted DOB,โ€ with the exceptions of patients under 15 (where these two dates are the same; calculated as 0) and patients over 85 (where these two dates are calculated as 300 years apart).

These exceptions are likely due to legal barriers set for requiring proper consent. We were given info that patients over 85 averaged 91.5 years old, so all such patients were imputed as โ€˜92โ€™ years of age. While this direction allowed a method of meaningful imputation, the age ranges below 15 and above 85 presented notable risks to our model.

Once we calculated age from the date fields, we proceeded to remove variables from our dataset that introduced bias. For example, โ€œExpiration Flagโ€ and โ€œDate of Deathโ€ both represent a bias of death (an unfortunately common result of CHF), so these fields were removed to prevent this bias in our model. โ€œDate of Birthโ€ and โ€œDate of Admissionโ€ were randomized to future dates; so these features were removed as they were unnecessary once patientsโ€™ age was extracted from them as a separate field.

Certain hematology blood measurements, such as Hematocrit (ratio of the volume of red blood cells to the total volume of blood), Mean Corpuscular Hemoglobin Concentration (MCHC: the average concentration of hemoglobin in your red blood cells), and Red Cell Distribution Width (RDW: the range in the volume and size of your red blood cells) are blood tests calculated as mean or percentage measurements.

Therefore, these measurements could not be correctly aggregated to mean or standard deviation calculations and had to be removed as features, though  their valid maxima and minima aggregate measurements were retained for modeling.

The final cleaning procedure was recategorizing our demographic fields. Language, ethnicity, and religion fields, unsurprisingly, had longtails that would introduce unnecessary sparsity in our modeling phase. For that reason, most of these categories were cut off and reduced to โ€œOtherโ€ where deemed too sparse to stand on their own.

โ€˜Marital Statusโ€™ had numerous categories, so we combined similar values such as โ€œDivorced,โ€ โ€œWidowed,โ€ and โ€œSeparatedโ€ to a combined โ€œPost-Marriageโ€ field to ensure more even sampling with the assumption that all three life events create a similarly stressful lifestyle on an individual and both metaphorically and physically impact on the heart.

Missingness & Imputation

Fortunately, all demographic (age, ethnicity, marital status, etc.) data were present, so no imputation was required for those fields. All blood measurements had some degree of missing values; most with only a few hundred to a few thousand blood measurement values missing (less than 10% of patients). However, five blood measurements (20 of the 100 fields) had about 10,000 values missing (roughly 25% of patients).

Some measurement values were labeled as missing values but actually meant abnormal values (such as โ€˜>10โ€™, โ€˜<1โ€™,etc.). In this case, the missing values were imputed by the boundary value. Having the foresight to our modeling phase that we would use logistic regression and gradient descent, which require no missing values, and tree-based classification models that can still decision with missingness, we opted to split our dataset into two versions here. All cleaning decisions made up until this point would be used for tree-based models.

The dataset for logistic regression and gradient descent required these blood measurement values be imputed.

For our logistic/gradient descent data set, we took special care to understand why all blood measurements might be missing to identify the most appropriate imputation method, knowing this will certainly have an effect on our model accuracy down the road. After careful consideration and research, we found that most doctors in the U.S. would opt to โ€œovertestโ€ and run analyses on any relevant measurement for a sick individual to โ€œplay it safeโ€ and ensure that they cannot be held liable for misdiagnosis or malpractice.

In other healthcare systems, the opposite would be likely. Therefore, we concluded that missing blood measurements are missing at random (MAR), and not missing completely at random (MCAR), with the assumption that a patient is presumed to have normal, healthy values if they are not tested. This means that we would rely on medical research for what defines a normal, healthy minimum, maximum, mean and standard deviation for blood measurement to impute missing values instead of the hospital populationsโ€™ values which would likely skew towards those of ill individuals.

With additional research, we found that four blood measurements (Creatinine, Hematocrit, Hemoglobin, and Red Blood Cell count) even had different โ€œnormalโ€ levels for adult males vs. adult females. We therefore imputed all missing values as those of normal, healthy individuals; imputing by respective gender-associated values where known differences occurred. Once our data set was extracted, cleaned, and imputed we were ready to model.

Modeling

Our first important consideration  was that we should optimize our models toward sensitivity (or recall) over accuracy. In a confusion matrix, used for evaluating classification models, the accuracy score is the percent of all observations predicting both positive and negative results (having or not having CHF) correctly over all observations. Sensitivity, on the other hand, is the percent of all positive observations (patients having CHF) that are predicted correctly vs. incorrectly. 

We consider that, in the field of disease prediction, it is more important to build a sensitive model because we are more concerned with predicting when patients have CHF,and less concerned with when they do not. In other words, as a medical professional and patient, it is better to predict someone has CHF when they do not and run additional tests to find out they really donโ€™t rather than predicting they donโ€™t have CHF when they really do. Thatโ€™s because if they are undiagnosed and untreated, they likely will suffer a great deal more than if they are subject to an additional test. 

Our first model was regularized logistic regression. In logistic regression, we use an s-shaped sigmoid curve to assess the probability that an individual will have CHF (โ€˜1โ€™) or will not (โ€˜0โ€™). We fit our feature coefficients that represent the degree to which each blood measurement or categorical feature affects the patientsโ€™ likelihood to have CHF, similar to linear regression, however with logistic regression our coefficients are in the exponent of the denominator of the sigmoid function instead of a continuous linear line.

Linear Regression vs. Logistic Regression

Using the sigmoid function and thereby the inverse logarithmic scale enables us to bind our result between โ€˜0โ€™ and โ€˜1,โ€™ allowing us to classify our target. We thus identify our best feature coefficients and model hyperparameters that minimize inaccurate predictions by minimizing our โ€œlog lossโ€ as our loss function. 

To ensure we can do this correctly, we had two quick data preparations. First, we needed to โ€œdummifyโ€ our categorical (non-numeric) variables (i.e. Religion, Ethnicity, etc.) which allows the model to understand the presence of a non-numeric value in numerical terms. Dummification creates a separate column for the presence of a value where โ€˜1โ€™ represents โ€˜yesโ€™ and โ€˜0โ€™ represents โ€˜no.โ€™

Once all features were in numerical terms, we then had to scale all features. Min-max scaling takes the magnitude of variables against one another out of the equation by taking the range of each variable and fixing them between 0 and 1 while maintaining the distribution of individual variables. Consider a blood measurement that has one thousand times the effect than another at predicting CHF but is only one thousandth the scale of the other.

Without scaling, the model would not be able to identify the degree at which the micro fluctuations of the first variable could predict the presence of CHF due to its relatively small magnitude. Min-max scaling allows for the ranges of these two variables to be the same, enabling our model to identify fluctuations of any variable equally.

With our data ready for logistic regression, we had to tune our model for the best sensitivity. Before tuning, we ran a base model with no hyperparameters as a baseline for comparison. Our accuracy was 80.1% and sensitivity was 42.3%. The three most important parameters we had to tune were โ€œCโ€ (our penalization term), โ€œScoreโ€ (i.e. sensitivity, accuracy, etc.), and the cutt-off value (the point on the sigmoid curve we choose to decide if a patient is positive or negative).

Penalization allowed us to โ€œshrinkโ€ our feature coefficients to prevent overfitting and to remove multicollinearity from our model. Optimizing toward sensitivity score simply required we choose a โ€œROC-AUCโ€ scoring method instead of the default โ€œAccuracy.โ€

ROC-AUC Curve

To find the optimal sensitive C value, we used a grid search 5-fold cross validation and found that our optimal model was L1 regularization with a C penalty of 500. This improved our accuracy by 2pp and sensitivity by 1pp. Our last optimization method was to shift the cut-off value of the sigmoid curve. Logistic regression defaults to a 50-50 split as the cut-off point, which makes sense in the case that โ€˜Yesโ€™ and โ€˜Noโ€™ are equally likely like a fair coin flip.

However, our dataset is imbalanced with about 25% of patients with CHF; so adjusting our cut-off point to predict โ€˜Yesโ€™ with a higher probability than 50% enables our sensitivity to rise. With a little help from statistics, we identified that shifting the probability threshold to 26.3% shot our sensitivity up to 75.0% (a 42.3pp increase) while only reducing accuracy by 0.5pp (to 79.6%)!

One important benefit of Lasso (L1) regularization is that its penalization method shrinks coefficients that do not explain significant variance in the model to 0; so the magnitude of the remaining coefficients give us a ranking of feature importance (extreme negative/positive being the strongest).

Top 10 Features in Logistic Regression

Interpreting our model results, Red blood cell count (standard deviation and max) and MCH (standard deviation) blood tests ranked consistently strongest in feature importance. This makes sense because MCH measures the level of oxygen (hemoglobin) red blood cells carry from the lungs to the heart and brain. A wildly fluctuating range of hemoglobin with a consistently low count of red blood cells creates an inefficiency of oxygen retention in the blood which then causes the heart to inflame and not pump blood to the brain and body properly.

One last optimization we tried with our logistic regression model was rebalancing our dataset with NearMiss, which essentially under-samples the majority class (โ€˜Noโ€™ CHF) and creates a 50/50 split between โ€˜Yesโ€™ and โ€˜No.โ€™ This pushed our maximum sensitivity up to 79.7% with a cut-off threshold of 45%, L1 regularization, and a C penalization of 210.

Our next model was Stochastic Gradient Descent (SGD), which in summary descends upon the global minimum of accurate classification as its loss function. SGD, as opposed to standard gradient descent, descends upon the minimum in a shuffling manner instead of the step-wise descent. Much of the tuning in gradient descent is similar to logistic regression; however, penalization works more similarly to linear regressions (with an alpha value that increases along with the applied penalization) and a loss function must be provided.

Through grid search cross validation and an adjusted cut-off probability threshold, our optimal model achieved maximum sensitivity of 72.9% with L1 regularization, virtually no penalty, and using a โ€˜logโ€™ loss which the logistic form of SGD requires. This was about 6.8pp sensitivity below our optimal logistic model.

For tree-based classifier models, we firstly built a Random Forest Classifier, which randomly selects features based on the โ€˜giniโ€™ criterion. After the 5-fold grid search cross validation, the optimal random forest model gave us 85.5% accuracy and 85.4% sensitivity that increased by 5.7pp as compared with our optimal logistic model.

We then built a Gradient Boosting (GBoost) Classifier and EXtreme Gradient Boosting (XGBoost) Classifier, which in general uses decision trees as weak learners and then builds sequential tree models on top of them to minimize the loss function of the residual. Similarly, we used 5-fold grid search cross validation to identify our optimal hyperparameters.

The optimal models both used deviance as the loss function and Friedman Mean Square Error as criterion. For our GBoost Classifier, the accuracy was 85.9%; the sensitivity was 85.8%, which increased by 0.4pp compared with the Random Forest model. Our XGBoost Classifier gave us 85.8% accuracy and 86.1% sensitivity, which is the highest score among all models.

Feature Importance of XGBoost Model

The Feature Importance Score of XGBoost shows the five most important features are maximum value of glucose, maximum value of PTT (partial thromboplastin time), maximum value of urea nitrogen, maximum platelet counts, and the maximum value of PO2 (partial pressure of oxygen), which all corresponded to our medical research as extreme values that indicate CHF in a patient.

Risks

Given certain assumptions we made throughout our project, we are aware that there are potential risks in our model. In the case that we create this prediction engine as a product and use it as a real tool for doctors, we would perform additional research on each of the points below, and possibly alter or split our model to take our research results into account when utilizing it with doctors:

  • Imputing assumed healthy values for unknown blood measurements (logistic models)
  • People with โ€œextremeโ€ blood measurement values (pregnant women, anemia, etc.)
  • Compound conditions (i.e. diabetic and had a stroke but without CHF)
  • Legal pressure on doctors for misdiagnosis or malpractice
  • Age obscuration due to HIPAA compliance inevitably will affect modelโ€™s predictions

Conclusion

Our optimal performance model was XGBoost with 85.8% accuracy and 86.1% sensitivity. An added benefit of XGBoost is the output of feature importance, which both allows our connection to medical research and is insightful in drawing which attributes of a patient will be most decisive in recognizing patients with Congestive Heart Failure. Lastly, the decision-making  methodology of all tree-based models allows for missingness of data and age assumptions which helped us omit some of the risks inherent in our data set.

One thing we lose with XGBoost is interpretability to non-data scientists, as the model decision-making requires deeper statistical knowledge and randomized boosting methods not easily understood by most patients or medical staff. This may become an important risk to consider when rolling this product into hospitals and seeking legal compliance as both relevant parties would rightfully scrutinize such a product.

One success of the logistic models is the interpretability. The decision-making criteria for penalization and coefficients have a direct relationship to model results and feature importance; both of which can also have a clear connection back to medical research. A clear downside to logistic models is the imputation required and the assumptions on an individualโ€™s unknown health status. Additionally, these models were not as accurate nor as sensitive as any of our tree-based models.

Drawing insights from all models enables us to make an important connection between medical research and machine learning that confirms the medical field is ripe for this sort of innovation. Pushing the boundaries of prediction can enable early detection of disease, which can help prevent or prolong the onset of illness and extend life expectancy for individuals.

Future Work

In the future, we would like to enhance our project by taking some of the assumptions/risks laid out above into account. We would seek to build potentially separate models or build other facets to our same model given the below.

  • Childrenโ€™s models (age)
  • Old peopleโ€™s models (age)
  • Holistic analysis (X-rays, electrocardiograms, chart events)
  • Other hospitals in the US and abroad (model on different assumptions: under check)
  • Multiple-disease prediction engine

 

Thank you so much for reading about our capstone project! To see our project work and presentation, check out the github repository for this project here.

About Authors

Eric Meyers

Eric is a data science leader with a passion for predictive modeling, machine learning, and developing healthy and impactful teams. Seeking a Data Science manager or principal role where thoughtful technical innovation is at the forefront of product...
View all posts by Eric Meyers >

Haoyun ZHANG

View all posts by Haoyun ZHANG >

Maite Herrero Gorostiza

Maite is currently pursuing an MPA at Columbia while working at NYCEDC - the city's official economic development corporation. An advocate for the use of data science tools for public policy affairs, Maite has previously worked at CAF...
View all posts by Maite Herrero Gorostiza >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Capstone
The Convenience Factor: How Grocery Stores Impact Property Values
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
Machine Learning
Pandemic Effects on the Ames Housing Market and Lifestyle
Machine Learning
The Ames Data Set: Sales Price Tackled With Diverse Models

Leave a Comment

No comments found.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application