Data in Shiny Parsing and Visualizing Transcriptomics

Posted on Apr 28, 2019

Project GitHub | LinkedIn:   Niki   Moritz   Hao-Wei   Matthew   Oren

The skills we demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.


The genome represents the history book and instruction set of all organisms on Earth. Over the last century, genetic data analysis has provided deep insight into the underpinnings of biological processes and the molecular etiology of disease. A crucial principle to understanding how genes function is that defects can arise due to mutation of the genes themselves, but also in the with the way that they are regulated and the amount of gene products they produce.

Messenger RNA (mRNA) copies of DNA genes are copies in a process called "transcription". These mRNA are then used in turn as templates to produce proteins in the cell in a process called "translation". The number of copies of the mRNA copies transcribed from genes and the number of copies of protein translated from mRNA are referred to as the degree to which these molecules are "expressed".

Figure 1. Brief discussion of gene expression

Expression of genes dynamically changes in response to cues from the environment and these changes coordinate cellular biochemistry, signaling and behavior. Aberrant gene expression changes are also associated with pathophysiology of many disease, for example numerous cancers, in which key genes are expressed at too high or too low of a level, which decouples certain pathways and behaviors from the normal function of the cell.

The study of transcriptional changes, known as transcriptomics, is an important component of the business model of biotech and pharmaceutical companies, often applied to the areas of synthetic biology or drug target discovery. The data analyzed in this project come from a next generation sequencing (NGS) technique called RNA sequencing (RNA-seq), which sequences and quantifies the RNA molecules in a cell.

Each of the samples analyzed in this set are from yeast, which is a very common model organism used for genetic research. Yeast are simple and inexpensive to work with, are eukaryotic cells (like the cells in our body), and have fewer genes compared to humans (a little over 6000 compared to our >20,000). Although they seem humble and simple, yeast genetics has led to the discovery of numerous crucial genetic pathways in our own cells, such as the molecules responsible for controlling and mediating cell division.

Link to the dataset on Kaggle
Github source code
App link

Data & Methods

Dataset Overview
This dataset consisted of several CSV files describing the experimental conditions of each sample, some information about the genes, and a table of RNA-seq reads (normalized in reads of the specific gene per million reads of mRNA) for each gene in each sample.

The total list of experimental condition in the sample set were extremely varied with many of them having < 3 replicate samples for the condition. To simplify presentability to the user and increase individual group data, only 4 comparisons were evaluated in this analysis:

Experimental Conditions:
• Yeast samples grown at high temperature (37 C) vs ideal growth conditions (30 C)
• Yeast samples grown at lower temperature (30 C) vs ideal growth conditions (15 C)
• Yeast samples growth with either glucose or ethanol as their carbon source
• Wildtype yeast samples or strains optimized for biofuel production

The other CSV files for intracellular localization and additional information about the genes were incomplete. This information will be filled in with web scraping of the SGD database ( at a later time and this information added to the analysis.

Data Cleaning
The columns for expression level were read in as strings and needed to be converted to numeric. Experimental conditions were converted to factors to provide categorical levels for grouping analysis.

For each data comparison below, the mean of each gene was calculated for the replica samples (experimental and control) and a 2-sample t-test for the comparison groups was carried out for the p-value of significant differences between of their distributions. Filtering, preprocessing and calculation of these data subsets were carried out ahead of time and provide for the webpage, because calculations and loading of larger datasets were initially impacting site performance. When users select the data, only the relevant expression and pre-calculated expression values are loaded for the experimental comparison of interest.

Volcano Plot

This data visualization was chosen because it is a common method of displaying gene expression data. The reason for this is because ratios of higher expression of genes (experimental/control) can range from 1 to no upper bound, whereas lower levels of comparable gene expression range from 0 to 1. This gives a skewed impression of expression change in a plot of raw expression ratios and undue weight to genes that have higher expression in the sample.

The log2(experimental) - log2(control) transformation of the data provides a more symmetrical and even-weighted display of the change in genes of both higher and lower expression level. Statistical significance (t-test: p-value) are plotted as -log10(p-value), so higher values on the y-axis reflect a higher degree of significance of that gene expression. The user may select the threshold for significance of gene expression difference (fold-change of expression and the p-value for t-test comparison of experimental variable vs control), and the genes are dynamically displayed in red on the plot.

Clustering Analysis and Heatmap

The gene names from the Gene List selected by user-defined thresholds were used to filter the raw dataset to collect the expression data from each replicate for the significantly expressed genes. An expression matrix of experimental replicas were applied to hierarchical clustering and visualization by an interactive heatmap. This visualization is useful to assess the consistency of gene expression between experimental conditions and whether the primary source of variance is the experimental conditions themselves or between experimental repeats or individual strains used for the samples.

Gene List

This tab shows a panel with a list of all of the significant genes that match the user threshold criteria along with their expression values for both experimental conditions, the fold gene expression change, and p-value. Additional information about these genes will be implemented in the future. The full list of these genes may be downloaded by the user with a click of a button in the side panel.

How to Use

The user may select 4 different datasets in which RNA expression from yeast grown in a comparison condition is compared against samples from control  growth conditions. Slider bars on the left side panel allow the user to select the threshold cutoffs for significantly different genes of interest. The Gene Expression tab displays a volcano plot of mean gene fold change (experiment vs control) against statistical significance of difference of the means (p-values). Genes of interest matching the thresholds set by the user are displayed in red.

The Clustering Analysis tab, displays a heat map of all user selected genes with gene expression from each experimental sample from the 2 comparison conditions. The margins of the heat map also have a dendrogram showing hierarchical clustering of the samples based on their gene expression profiles. This enables the user to view trends in genes across samples and experimental conditions, as well as determine whether the comparison of variance between their sample replicates and experimental conditions.

In the Gene List tab lists all of the user selected genes along with their mean normalized expression level (experimental condition vs control), ratio of gene expression (experimental/control) and the t-test p-value of the difference between experimental conditions. This list may be downloaded with the click of a button in the user side panel.


Using the app, I collected the genes with a gene expression change of ≥ 2-fold and p ≤ 0.02 for each of the experimental conditions featured in the app. I manually browsed the SGD database for these significant genes to get an impression of the biological processes, pathways and functions that maybe affected in these conditions (summarized below).

Future Directions

This project was also chosen because it will be easily extensible with the webscraping and machine learning functionality. Web scraping will be used to collect additional information on each gene from publicly available genetic repositories, such as standard name, description, functional/pathway information, gene ontology terms, and sequence data. Machine learning methods will be used on these additional data to glean further biological insights into the mechanism of genetic networks under these experimental conditions.

About Author

Ryan Willett

Ryan completed the NYCDSA program in June 2019. He holds a PhD in Pharmacology and Molecular Signaling from Columbia University, and BS and BA in Biology and Biochemistry, respectively, from Brandeis University. After a postdoctoral research fellowship in...
View all posts by Ryan Willett >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI