Cancer Genes and their Mutations

Kyle Gallatin
Posted on Feb 5, 2017

Introductory Biology

DNA contains the information for a successfully functioning organism. Composed of the nucleotides adenine (A), guanine (G), cytosine (C) and thymine (T), the order and structure of DNA determines the structure and function of the proteins that make up our body. Nucleotides are read in groups of three (ex: ATG), and each of these "frames" corresponds to a specific amino acid. These amino acids are then linked together to form proteins. The genes in our body are portions of DNA that code for different proteins, and our body is dependent on functional and balanced proteins for homeostasis. When proteins function incorrectly it can lead to a variety of problems, especially if the protein function holds a crucial role in a cell. These problems can result from mutations.

A mutation is a change in gene structure, or a change in nucleotides. There are also many different kinds of mutations. A substitution mutation involves replacing nucleotides with other mutations. For instance, changing the sequence ACG to ATG. This changes the amino acid structure of a protein and therefore it's function. Insertion or deletion mutations are intuitive: involving the insertion or deletion of nucleotides from a sequence. A frameshift mutation involves a change in reading frame. Since nucleotides are read in groups of three, adding or deleting nucleotides can alter an entire sequence. The sequence ACG TGC ACA becomes CGT GCA AC if we remove the first nucleotide, not only altering one amino acid but every other one further down the line.

Some mutations are harmless, and proteins can still function under normal circumstances. However, certain mutations can lead to harmful changes in protein. If that protein is involved in cell growth or proliferation, it can lead to cancer. Depending on the type of mutation and its location, there are a variety of cancers that can result. But how do we view the relationships between these variables?

Data and Cleaning

https://www.kaggle.com/c/introducing-kaggle-scripts/forums/t/15139/is-there-a-way-of-adding-data?forumMessageId=83956#post83956

I found my dataset in a comments section on kaggle. It contained over a million different mutations, the name of the gene, location, nucleotides, protein alterations, and the corresponding types of cancer. I wanted to see the relationship between all of these variables but it was difficult given the notation and that they were mostly categorical variables. I had to separate the nucleotide changes from the location, along with other text such as 'del' or 'ins' designating that these nucleotides were inserted at a given location. Finally, using significant amount of tidyr, I had columns containing the nucleotides deleted, inserted and changed along with a column pertaining to location.

cleaning ex:     "C984T" ----> "C>T" and "984"    |  "643_insT" ----> "T" and "643"

Mutations types and Cancers

Using a simple shaded histogram, we can view how mutation type is correlated with different cancers, and how different genes are correlated with mutation types and cancers. Since there were over 30,000 unique genes, I wanted to focus on only 1 at a time, the idea being a scientist could view genes they're researching. By selecting a gene name on the left, you can view the number of mutations it has for each different cancer.

Screen Shot 2017-02-05 at 1.17.41 PM

 

However, I also wanted a more visual representation of the data. I decided to simply create a line of the distance between between the minimum and maximum mutation locations. Then, by plotting the other mutation locations as ticks on this line, you could "see" the gene. By mapping colors of the ticks to the specific nucleotides being altered, you can view common nucleotide changes.Screen Shot 2017-02-05 at 1.18.09 PM

Below the map you can see a quick blurb about the gene/protein and its relative pathway. I custom added these for only two of the genes, as it would've been excessive to do it for hundreds or thousands. However, it gives a nice picture as to my future goal for this app to be used as a research tool, even if its beyond the scope of my current capability.

Additionally, using the selection on the left you can view mutations by nucleotide change, deletions or insertions.

Screen Shot 2017-02-05 at 1.18.20 PM

While the visuals are all well and good, I wanted to tie these back to research significance, if possible. Unfortunately, given limited time it was difficult, but if you go over to the 'DNA Repair Mechanisms' tab you can see an image summarizing certain mutations, their cause and how the body deals with them.

Screen Shot 2017-02-05 at 2.26.40 PM

The idea was to create a resource for scientists or students to use in genetics research, and to create a link between clinical science and molecular science, two areas which do not always go hand in hand. Instead of telling one specific story, this app contains a myriad of them, and you can use it to find the one you're looking for. Genetics research and cancer are both large fields with a variety of information, my goal was to make that information for accessible and visual.

Future Goals

I would like to continue to add to this app. An obvious flaw was that I lacked gene lengths, so the 'gene map' is only as large as the difference between its maximum and minimum mutations. Additionally, the nucleotide information can be overwhelming, so a filter would make this information more accessible. Overall, more clinical significance would be beneficial. If clinicians were actually to make use of this, I would have to append more medicinal information to each gene. If molecular biologists were to make use of it, I would want to add more information on the molecular mechanisms for each protein and gene. The scope of the project could feasibly be endless, but worthwhile to provide another resource in the fight against cancer.

 

https://gist.github.com/kylegallatin/012f67e623eac0cec0bd8dfd5ac406e6

 

 

https://gist.github.com/kylegallatin/ed452d053fc5b4ab206ab8de54bfb2f5

About Author

Kyle Gallatin

Kyle Gallatin

Kyle Gallatin graduated from Quinnipiac University with a biology degree in 2015. Following, he continued on for his Master's in Molecular and Cellular Biology, received in 2016. Cultivating high level skills in data science through his analytical work...
View all posts by Kyle Gallatin >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Online Bootcamp Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp