Data Study on Cancer Genes and their Mutations

Posted on Feb 5, 2017
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

Introductory Biology

DNA contains the data and information for a successfully functioning organism. Composed of the nucleotides adenine (A), guanine (G), cytosine (C) and thymine (T), the order and structure of DNA determines the structure and function of the proteins that make up our body.

Nucleotides are read in groups of three (ex: ATG), and each of these "frames" corresponds to a specific amino acid. These amino acids are then linked together to form proteins. The genes in our body are portions of DNA that code for different proteins, and our body is dependent on functional and balanced proteins for homeostasis. When proteins function incorrectly it can lead to a variety of problems, especially if the protein function holds a crucial role in a cell. These problems can result from mutations.


A mutation is a change in gene structure, or a change in nucleotides. There are also many different kinds of mutations. A substitution mutation involves replacing nucleotides with other mutations.

For instance, changing the sequence ACG to ATG. This changes the amino acid structure of a protein and therefore it's function. Insertion or deletion mutations are intuitive: involving the insertion or deletion of nucleotides from a sequence. A frameshift mutation involves a change in reading frame. Since nucleotides are read in groups of three, adding or deleting nucleotides can alter an entire sequence. The sequence ACG TGC ACA becomes CGT GCA AC if we remove the first nucleotide, not only altering one amino acid but every other one further down the line.

Some mutations are harmless, and proteins can still function under normal circumstances. However, certain mutations can lead to harmful changes in protein. If that protein is involved in cell growth or proliferation, it can lead to cancer. Depending on the type of mutation and its location, there are a variety of cancers that can result. But how do we view the relationships between these variables?

Data and Cleaning

I found my dataset in a comments section on kaggle. It contained over a million different mutations, the name of the gene, location, nucleotides, protein alterations, and the corresponding types of cancer. I wanted to see the relationship between all of these variables but it was difficult given the notation and that they were mostly categorical variables. I had to separate the nucleotide changes from the location, along with other text such as 'del' or 'ins' designating that these nucleotides were inserted at a given location.

Finally, using significant amount of tidyr, I had columns containing the nucleotides deleted, inserted and changed along with a column pertaining to location.

cleaning ex:     "C984T" ----> "C>T" and "984"    |  "643_insT" ----> "T" and "643"

Mutations types and Cancers

Using a simple shaded histogram, we can view how mutation type is correlated with different cancers, and how different genes are correlated with mutation types and cancers. Since there were over 30,000 unique genes, I wanted to focus on only 1 at a time, the idea being a scientist could view genes they're researching. By selecting a gene name on the left, you can view the number of mutations it has for each different cancer.

Data Study on Cancer Genes and their Mutations

Gene Map

However, I also wanted a more visual representation of the data. I decided to simply create a line of the distance between between the minimum and maximum mutation locations. Then, by plotting the other mutation locations as ticks on this line, you could "see" the gene. By mapping colors of the ticks to the specific nucleotides being altered, you can view common nucleotide changes.Data Study on Cancer Genes and their Mutations

Below the map you can see a quick blurb about the gene/protein and its relative pathway. I custom added these for only two of the genes, as it would've been excessive to do it for hundreds or thousands. However, it gives a nice picture as to my future goal for this app to be used as a research tool, even if its beyond the scope of my current capability.

Additionally, using the selection on the left you can view mutations by nucleotide change, deletions or insertions.

Data Study on Cancer Genes and their Mutations

DNA Repair Mechanisms

While the visuals are all well and good, I wanted to tie these back to research significance, if possible. Unfortunately, given limited time it was difficult, but if you go over to the 'DNA Repair Mechanisms' tab you can see an image summarizing certain mutations, their cause and how the body deals with them.

Screen Shot 2017-02-05 at 2.26.40 PM

The idea was to create a resource for scientists or students to use in genetics research, and to create a link between clinical science and molecular science, two areas which do not always go hand in hand. Instead of telling one specific story, this app contains a myriad of them, and you can use it to find the one you're looking for. Genetics research and cancer are both large fields with a variety of information, my goal was to make that information for accessible and visual.

Future Goals

I would like to continue to add to this app. An obvious flaw was that I lacked gene lengths, so the 'gene map' is only as large as the difference between its maximum and minimum mutations. Additionally, the nucleotide information can be overwhelming, so a filter would make this information more accessible.

Overall, more clinical significance would be beneficial. If clinicians were actually to make use of this, I would have to append more medicinal information to each gene. If molecular biologists were to make use of it, I would want to add more information on the molecular mechanisms for each protein and gene. The scope of the project could feasibly be endless, but worthwhile to provide another resource in the fight against cancer.




About Author

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...
View all posts by Kyle Gallatin >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI