Data Study on Cancer Genes and their Mutations
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
DNA contains the data and information for a successfully functioning organism. Composed of the nucleotides adenine (A), guanine (G), cytosine (C) and thymine (T), the order and structure of DNA determines the structure and function of the proteins that make up our body.
Nucleotides are read in groups of three (ex: ATG), and each of these "frames" corresponds to a specific amino acid. These amino acids are then linked together to form proteins. The genes in our body are portions of DNA that code for different proteins, and our body is dependent on functional and balanced proteins for homeostasis. When proteins function incorrectly it can lead to a variety of problems, especially if the protein function holds a crucial role in a cell. These problems can result from mutations.
A mutation is a change in gene structure, or a change in nucleotides. There are also many different kinds of mutations. A substitution mutation involves replacing nucleotides with other mutations.
For instance, changing the sequence ACG to ATG. This changes the amino acid structure of a protein and therefore it's function. Insertion or deletion mutations are intuitive: involving the insertion or deletion of nucleotides from a sequence. A frameshift mutation involves a change in reading frame. Since nucleotides are read in groups of three, adding or deleting nucleotides can alter an entire sequence. The sequence ACG TGC ACA becomes CGT GCA AC if we remove the first nucleotide, not only altering one amino acid but every other one further down the line.
Some mutations are harmless, and proteins can still function under normal circumstances. However, certain mutations can lead to harmful changes in protein. If that protein is involved in cell growth or proliferation, it can lead to cancer. Depending on the type of mutation and its location, there are a variety of cancers that can result. But how do we view the relationships between these variables?
Data and Cleaning
I found my dataset in a comments section on kaggle. It contained over a million different mutations, the name of the gene, location, nucleotides, protein alterations, and the corresponding types of cancer. I wanted to see the relationship between all of these variables but it was difficult given the notation and that they were mostly categorical variables. I had to separate the nucleotide changes from the location, along with other text such as 'del' or 'ins' designating that these nucleotides were inserted at a given location.
Finally, using significant amount of tidyr, I had columns containing the nucleotides deleted, inserted and changed along with a column pertaining to location.
cleaning ex: "C984T" ----> "C>T" and "984" | "643_insT" ----> "T" and "643"
Mutations types and Cancers
Using a simple shaded histogram, we can view how mutation type is correlated with different cancers, and how different genes are correlated with mutation types and cancers. Since there were over 30,000 unique genes, I wanted to focus on only 1 at a time, the idea being a scientist could view genes they're researching. By selecting a gene name on the left, you can view the number of mutations it has for each different cancer.
However, I also wanted a more visual representation of the data. I decided to simply create a line of the distance between between the minimum and maximum mutation locations. Then, by plotting the other mutation locations as ticks on this line, you could "see" the gene. By mapping colors of the ticks to the specific nucleotides being altered, you can view common nucleotide changes.
Below the map you can see a quick blurb about the gene/protein and its relative pathway. I custom added these for only two of the genes, as it would've been excessive to do it for hundreds or thousands. However, it gives a nice picture as to my future goal for this app to be used as a research tool, even if its beyond the scope of my current capability.
Additionally, using the selection on the left you can view mutations by nucleotide change, deletions or insertions.
DNA Repair Mechanisms
While the visuals are all well and good, I wanted to tie these back to research significance, if possible. Unfortunately, given limited time it was difficult, but if you go over to the 'DNA Repair Mechanisms' tab you can see an image summarizing certain mutations, their cause and how the body deals with them.
The idea was to create a resource for scientists or students to use in genetics research, and to create a link between clinical science and molecular science, two areas which do not always go hand in hand. Instead of telling one specific story, this app contains a myriad of them, and you can use it to find the one you're looking for. Genetics research and cancer are both large fields with a variety of information, my goal was to make that information for accessible and visual.
I would like to continue to add to this app. An obvious flaw was that I lacked gene lengths, so the 'gene map' is only as large as the difference between its maximum and minimum mutations. Additionally, the nucleotide information can be overwhelming, so a filter would make this information more accessible.
Overall, more clinical significance would be beneficial. If clinicians were actually to make use of this, I would have to append more medicinal information to each gene. If molecular biologists were to make use of it, I would want to add more information on the molecular mechanisms for each protein and gene. The scope of the project could feasibly be endless, but worthwhile to provide another resource in the fight against cancer.