NYC Data Science Academy| Blog
Bootcamps
Lifetime Job Support Available Financing Available
Bootcamps
Data Science with Machine Learning Flagship ๐Ÿ† Data Analytics Bootcamp Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lesson
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories Testimonials Alumni Directory Alumni Exclusive Study Program
Courses
View Bundled Courses
Financing Available
Bootcamp Prep Popular ๐Ÿ”ฅ Data Science Mastery Data Science Launchpad with Python View AI Courses Generative AI for Everyone New ๐ŸŽ‰ Generative AI for Finance New ๐ŸŽ‰ Generative AI for Marketing New ๐ŸŽ‰
Bundle Up
Learn More and Save More
Combination of data science courses.
View Data Science Courses
Beginner
Introductory Python
Intermediate
Data Science Python: Data Analysis and Visualization Popular ๐Ÿ”ฅ Data Science R: Data Analysis and Visualization
Advanced
Data Science Python: Machine Learning Popular ๐Ÿ”ฅ Data Science R: Machine Learning Designing and Implementing Production MLOps New ๐ŸŽ‰ Natural Language Processing for Production (NLP) New ๐ŸŽ‰
Find Inspiration
Get Course Recommendation Must Try ๐Ÿ’Ž An Ultimate Guide to Become a Data Scientist
For Companies
For Companies
Corporate Offerings Hiring Partners Candidate Portfolio Hire Our Graduates
Students Work
Students Work
All Posts Capstone Data Visualization Machine Learning Python Projects R Projects
Tutorials
About
About
About Us Accreditation Contact Us Join Us FAQ Webinars Subscription An Ultimate Guide to
Become a Data Scientist
    Login
NYC Data Science Acedemy
Bootcamps
Courses
Students Work
About
Bootcamps
Bootcamps
Data Science with Machine Learning Flagship
Data Analytics Bootcamp
Artificial Intelligence Bootcamp New Release ๐ŸŽ‰
Free Lessons
Intro to Data Science New Release ๐ŸŽ‰
Find Inspiration
Find Alumni with Similar Background
Job Outlook
Occupational Outlook
Graduate Outcomes Must See ๐Ÿ”ฅ
Alumni
Success Stories
Testimonials
Alumni Directory
Alumni Exclusive Study Program
Courses
Bundles
financing available
View All Bundles
Bootcamp Prep
Data Science Mastery
Data Science Launchpad with Python NEW!
View AI Courses
Generative AI for Everyone
Generative AI for Finance
Generative AI for Marketing
View Data Science Courses
View All Professional Development Courses
Beginner
Introductory Python
Intermediate
Python: Data Analysis and Visualization
R: Data Analysis and Visualization
Advanced
Python: Machine Learning
R: Machine Learning
Designing and Implementing Production MLOps
Natural Language Processing for Production (NLP)
For Companies
Corporate Offerings
Hiring Partners
Candidate Portfolio
Hire Our Graduates
Students Work
All Posts
Capstone
Data Visualization
Machine Learning
Python Projects
R Projects
About
Accreditation
About Us
Contact Us
Join Us
FAQ
Webinars
Subscription
An Ultimate Guide to Become a Data Scientist
Tutorials
Data Analytics
  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
Machine Learning
  • Boosting
  • Random Forest
  • Linear Regression
  • Decision Tree
  • PCA
Interview by Companies
  • JPMC
  • Google
  • Facebook
Artificial Intelligence
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
Coding
  • Learn Python
  • Learn SQL
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
Interview Questions
  • Python Hard
  • R Easy
  • R Hard
  • SQL Easy
  • SQL Hard
  • Python Easy
Data Science Blog > R > On The Rise of Data Science Startups

On The Rise of Data Science Startups

Joseph Lee and Avi Yashchin
Posted on Nov 24, 2015

Contributed by Avi Yashchin and Joseph Lee. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their second class project(due at 4th week of the program).

Collaborators:

Joseph Lee

Avi Yashchin

 

Introduction

Our project is centralized around the development of an open source workbench that is focused on providing data scientists with automated tools for exploratory analysis and model selection.  The full stack design is made in R, a statistical programming language. Before getting into the low-level details, let's take a step back and think about the trending term "Data Science.โ€

 

Part1:

The Startup Phenomena:  Making the next Netflix of Machine Learning, Uber of Data Modeling, or Chipotle of Data?

It seems that everywhere you turn these days thereโ€™s someone starting a โ€œData Scienceโ€ company.  Are you a PhD Dropout from Berkeley? Start a Data Science company. Are you a programmer that knows how to use MongoDB? Start a Data Science company.  Did you study english at Yale? Thatโ€™s right - Data Science.  The good people cbinsights.com made a chart about Venture Investment in AI over the last 5 years:

 

image01

 

 

What gives, and why now?

  1. Better Hardware: The advance of Mooreโ€™s Law has radically reduced Memory, Networking, and Data Storage costs.
  2. Better Software: Open Source tools provide free tools to train models of any flavor.  Previous statistics packages used to require expensive contracts with private companies.
  3. Better Algorithms: Feature Engineering used to be a large part of the data science experience.  New Algorithms are able to learn the most predictive features automatically, without transformations or assumptions about the correct distribution of data needed.
  4. Education and Training: You can learn Data Science in R and Python free on coursera, and learn Big Data tools like Apache Spark free on edx.  Seriously.

Thereโ€™s been a lot of attention on data science platforms and workbenches that attempt to improve the data scientistโ€™s workflow or allow non data-scientists to perform data science through an immersive user interface.  Weโ€™re going to show you how to build your very own, open source, machine learning workbench in R.  Please steal it.

The Hardware Perspective 

Everyone has seen some form of the below chart with Computer Processing power rising exponentially.  While the fast CPU has driven many innovations, the inexpensive CPU is not the critical factor in Machine Learning.

CPU Prices:

The collapse in prices of Hard Disk Space, Memory, and Network Capacity.

The collapse in prices of Hard Disk Space, Memory, and Network Capacity.

 

Hard Disk Space:

Hard Drive Cost Chart

 

 

 

Memory Prices:

 

Source: http://www.slideshare.net/IPExpo/slideshare1-0950-pathfinder

Negative trend between the year and memory price.

 

Network Prices:

Source: https://mentaleffort.wordpress.com/tag/technical-debt/

 

 

Itโ€™s not just CPUโ€™s that are dropping in price, but every part of the PC.  Distributed Machine Learning algorithms depend as much on memory, network speed, and to a smaller degree hard disk speed, as much as they rely on CPU speed.  Itโ€™s the aggregation of multiple exponential trends that is democratizing access.  Many people think, despite these price drops, that tools like AWS are still โ€œtoo expensive.โ€  This couldnโ€™t be further from the truth, I want to explore the AWS pricing.

Old model:  Buying โ€œBig Ironโ€

  • Buy or Rent Mainframe Computers from IBM, Unisys, BMC, etc
  • Buy a large network of custom servers, only use them for a few days a year while modeling. Try to sell excess server time to, umm ... Pixar?  Maybe the Weather Channel.
  • Huge excess capacity and large upfront costs.

Modern Alternative - Renting from โ€œThe Cloudโ€

  • AWS Reserved instances cost less than 13 cents per hour.  Spot instances can be acquired for pennies per hour.
  • Train your models, then *shut down your servers*.
  • No excess capacity, no upfront costs.

 

Hereโ€™s an AWS pricing list as of 11/20/15.  The critical factor here is the $0.126 Per Hour Pricing.

 

Amazon Pricing List

 

Assuming that you live in the NorthEastern corridor, California, or the midwest, and assuming that your computer draws at least 1 kWh, renting server space from amazon is less expensive than just paying for computer electricity in your home state.  I live in NYC, pay my own electricity, and I was able to save money by moving my computation demand onto AWS.

Source: http://goanuj.freeshell.org/e/index.html

Source: http://goanuj.freeshell.org/e/index.html

With AWS, you can get a server with up to 244 GB of main memory and up to 40 CPUs; no more limitations by hardware and computation time for your R-based analyses.  While hardware price reductions are nice, we will see that Machine Learning software prices have collapsed even further.

 

The Software Perspective

Open Source machine learning libraries has been a revolution for the machine learning community.  The once obscure and specialized topics of machine learning and statistical learning can now be leveraged by a much larger demographic, globally. Machine learning workbenches have been implemented in academia and industry.  The following is a short list of many popular platforms.

Old Software Pricing:

SAS enterprise miner ($140,000 for the first year)

IBM SPSS Statistics ($51,000 per user)

Alteryx Server ($58,500 per user)

Dataiku DSS & etc (~$10,000 per user per year)

You get the idea.

The worst part about the SAS/IBM/H2o prices, is that much of the software that runs these machine learning libraries is open source to begin with.  These companies have a business model of taking freely available open source tools, building a GUI on top of the system, and charging tens of thousands of dollars per year for support.

Wekaโ€™s software design is centralized around java. Dataiku DSS uses primarily Java, Python, and a kernel design compatible with multiple languages including R.  H2o is Java and R.  IBMโ€™s software is built on SPSS, an older programming language that was originally punch-card based. Most of these expensive products have proprietary model formats and data cleaning requirements, making interoperability and portability of code a near impossibility.

New Software Pricing:

Scikit-Learn - Python libraries for Machine Learning (Free)

Weka - Java libraries for Machine Learning (Free)

TensorFlow (Google) Open Source Machine Learning (Free)

FAIR (Facebook) Open Source Machine Learning (Free)

R, Python, Spark, Hadoop, Caret (Free)

The Machine Learning servers and tools that used to be exclusively the domain of Hedge Funds, Fortune 1000 companies, and large drug manufacturers, are now accessible to anyone. The data science workbench tool that we built over a week is meant to illustrate how easy it is to duplicate the features of the more expensive institutional packages, using completely free software.  Our code is also hundreds of lines of code, something that should be relatively easy to maintain by an enterprise.

 

The Algorithms Perspective

Our project is focused on the full stack implementation of R in order to not only explore its computational nuances in a data science setting, but also explore how Rโ€™s UI capabilities can contribute to a positive workflow and user experience.

Old design paradigms for Machine Learning required a developer to learn many different modeling packages, usually written by different people, with inconsistencies in how models are specified and predictions are made.

In the past, each row of the below table was a completely different workflow.  Each Model Class had its own input, format, and tuning parameters. Running a single data set through multiple models used to require hundreds of lines of code.  However, there is an R-based Open Source Alternative - Caret, which has Standardized model tuning.  Weโ€™ve been calling this the โ€œScikit-Learnโ€ for R.  The Caret function Syntax is a dream to work with, and anyone can create, tune, and compare the results from multiple models with ease.  Caret is 100% free.

Model Class Package Caret Function Syntax

lda MASS predict(obj) (no options needed)

glm stats predict(obj, type = "response")

gbm gbm predict(obj, type = "response", n.trees)

made mda predict(obj, type = "posterior")

rpart rpart predict(obj, type = "prob")

Weka RWeka predict(obj, type = "probability")

LogitBoost caTools predict(obj, type = "raw", nIter)

 

Caret Homepage

 

The Open Source Alternative - Shiny

Shiny is not the only new tool for computer visualizations, but is a fully functional web app development package that can streamline R code directly into an interactive frame without the need to know know javascript or html.  The Shiny package is compatible with many other interfaces including Google Viz, Tableau, matplotlib, bokeh (and a ton of others).  With R and Shiny, you can setup a webserver, and provide visualization tools to your BI teams in real-time.  Did we mention this tool is free?   Weโ€™re broke students, and our classroom is at WeWork, where we get free coffee and beer.  Shiny and R fit right in.

 

Part2:

Our Data Science App

Letโ€™s now delve into the app we made.  As we mentioned before, we wanted to make an open-source application in theme with the growth of data science startups.  Using a combination of Shiny, caret, and other great open source tools, we made a fairly workable platform that can perform basic data analysis, preprocessing, modeling, and validation.  We spent only three days developing this app and there will be surely many bugs and glitches with the code.  Keep in mind that our main intention was to create a functional prototype to showcase a small fraction of creative possibilities available to us through the open source community.  

Brief Tutorial

We used R Studio as the main IDE for our app.  R console works fine as well.  

This blog will give a general overview of our development process.  The code is available here if you wish to play around with it and learn more about our full stack.

To begin, we created two blank r files: ui.r and server.r within a new project directory or folder.  You can name this folder anything.  For the ui.r file you will need to install and load the following packages below.

require(shiny);
require(shinyIncubator);
library(shinydashboard);

In the server.r file, you will need to install and load the following packages as well.

require(shiny);
require(caret);
require(e1071);
require(randomForest);
require(nnet);
require(glmnet);
require(gbm);
library(mice);
library(VIM);
require(fastICA);
require(pastecs);
library(googleVis);
library("PASWR");
require("doMC")
source("helpers.R")

Once these packages are loaded and declared in their respective files, we can proceed with the UI phase.

For those new to Shiny and R, we recommend playing around with these tutorials to provide introduction to Shiny to gain a better understanding of the relationship between the ui.R and server.R files.  You can find the link here to the main shiny page.

The UI & Server Code

In order to make a visually appealing and straightforward interface design, we implemented the shiny dashboard package.  This package had a very straightforward documentation that can be found here.  The shiny package comes with a large pool of high quality icons, css themes, and other bootstrap quality elements.  I recommend following the shiny dashboard tutorials and then cross reference your learning with our code to get the most out of this blog post.  

The server.r code was tricker than the UI for a couple of reasons, reactive shiny features is a must for creating an interactive shiny app.  The main shiny blog does a fantastic job explaining dynamic and reactive scripting in R so I will leave the explanation to them in this link.  Essentially, reactive functions in shiny means that you are creating smaller โ€œpseudo-functionsโ€ that automatically receives user input when interacting with features such as a check-box or slider.  We wanted reactive functionality to allow the users to customize their tuning parameters for the modeling part of the app.  

Due to the size of the project, we won't be going into the details of the code.  However, if enough requests are made, we may consider creating a tutorial blog post to go more in depth in the development of our app.  You will find the main files in this blogpost below.  Again, feel free to access our github if you wish to play around with our app and source code.

UI.R

require(shiny);
require(shinyIncubator);
library(shinydashboard);
require(pastecs);

shinyUI(dashboardPage(
  skin = "blue",

  dashboardHeader(title = "ML Explorer Beta 1.0"),
  dashboardSidebar(
    sidebarMenu(
      sidebarSearchForm(textId = "searchText", buttonId = "searchButton",
                    label = "Search..."),
      menuItem("Summary", tabName = "summary", icon = icon("dashboard")),
      #menuItem("Upload", tabName = "dataupload", icon = icon("upload")),
      #menuItem("Database", tabName = "database", icon = icon("database")),
      menuItem("Data Preparation", tabName = "datapreparation", icon = icon("wrench")),
      menuItem("Analysis", tabName = "analysis", icon = icon("cogs"),
        menuSubItem("Train & Validation",icon = icon("cog"), tabName = "trainvalidation"),
        menuSubItem("Features",icon = icon("cog"), tabName = "features"),
        menuSubItem("Algorithm",icon = icon("cog"), tabName = "algorithm")
        ),

      menuItem("Results", tabName = "results", icon = icon("dashboard")),
      menuItem("About", tabName = "about", icon = icon("info")),
      menuItem("Code", tabName = "code", icon = icon("code"))
    )

    ),
  dashboardBody(

    tabItems(
      # First tab content
      tabItem(tabName = "summary", 
        fluidRow(

          box(title = "How To Use", status = "primary", solidHeader = TRUE,
            collapsible = TRUE, width = 8,
            h4("Step 1: Upload Dataset"),
            h5("Ideally any csv file is useable.  It is recommended to perform cleaning and munging methods prior to the upload though. We intend to apply data munging/cleaning methods in this app in the near future."),
            h4("Step 2: Analyze Data"),
            h5("Current version allows the user to perform basic missing analysis."),
            h4("Step 3: Choose Pre-processing Methods"),
            h5("Basic K-Cross Validation Methods are applicable. "),
            h4("Step 4: Choose Model"),
            h5("Choose from a selection of machine learning models to run.  Selected parameters for each corresponding model are available to tune and manipulate."),
            h4("Step 5: Run Application"),
            h5("Once the model(s) have been executed, the results for each model can be viewed in the results tab for analysis."),
            imageOutput("image2"))),
        fluidRow(
          box(title = "Libraries/Dependencies",status = "primary", solidHeader = TRUE,
            collapsible = TRUE, width = 8,
            h4("- The caret package was used for the backend machine learning algorithms."),
            h4("- Shiny Dashboard was used for the front end development."),
            h4("- The application is compatiable with AWS for server usage.")))),

      ######################################
      # Data Preparation Tab Contents
      ######################################

      # Second tab content
      tabItem(tabName = "datapreparation",
          fluidPage(
            tabBox(
             id = "datapreptab", 
          

            tabPanel(h4("Data"), 

              fileInput('rawInputFile','Upload Data File',accept=c('text/csv', 'text/comma-separated-values,text/plain', '.csv')),
                            uiOutput("labelSelectUI"),
                            checkboxInput('headerUI','Header',TRUE),
                            radioButtons('sepUI','Seperator',c(Comma=',',Semicolon=';',Tab='\t'),'Comma'),
                            radioButtons('quoteUI','Quote',c(None='','Double Quote'='"','Single Quote'="'"),'Double Quote')),

            tabPanel(h4("Data Analysis"), verbatimTextOutput("textmissing"), dataTableOutput("colmissing")),

            tabPanel(h4("View Data"), dataTableOutput("pre.data"))),
            infoBoxOutput("missingBox"))),

      ##################################################################################
      ####   Training/Splitting Tab Set Contents
      ##################################################################################

      tabItem(tabName = "trainvalidation", 

        radioButtons("crossFoldTypeUI","Cross Validation Type",c("K-Fold CV"='cv',"Repeated KFold CV"="repeatedcv"),"K-Fold CV"),
        numericInput("foldsUI","Number of Folds(k)",5),
        conditionalPanel(condition="input.crossFoldTypeUI == repeatedcv",
        numericInput("repeatUI","Number of Repeats",5)),
                uiOutput("CVTypeUI"),
                radioButtons("preprocessingUI","Pre-processing Type",c('No Preprocessing'="",'PCA'="pca",'ICA'="ica"),'No Preprocessing'),
                          uiOutput("ppUI")
        ),

      ##################################################################################
      ####   Algorithm Tab Set Contents
      ##################################################################################

      tabItem(tabName = "algorithm",
        fluidRow(
          box(title = "K- Nearest Neighbor", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,

              checkboxInput("KNNmodelSelectionUI", "On/Off", value = FALSE),
              h4("KNN is a non-parametric method used for classification and regression.  In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression."),
              uiOutput("KNNmodelParametersUI"),
                         tags$hr()
                         )
          ),

        fluidRow(
          box(title = "Boosted Logistic Regression", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
              checkboxInput("LGRmodelSelectionUI", "On/Off", value = FALSE),
              h4("LogitBoost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The original framework use the ADA boosting method in context with logistic regression."),
                  uiOutput("LGRmodelParametersUI"),
                         tags$hr()
                         )
          ),

        fluidRow(
          box(title = "Gradient Boosting Method", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
              checkboxInput("GBMmodelSelectionUI", "On/Off", value = FALSE),
              h4("Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function."),
                         uiOutput("GBMmodelParametersUI"),
                         tags$hr()
                         )
          ),


        fluidRow(
          box(title = "Neural Network", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
              checkboxInput("modelSelectionUI", "On/Off", value = FALSE),
              h4("Artifical Neural Networks are a family of statistical learning models inspired by biological neural networks (the central nervous systems of animals, in particular the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown."),
                          uiOutput("modelParametersUI"),
                          tags$hr()
            )
          ),


        uiOutput("dummyTagUI"),
        uiOutput("GBMdummyTagUI"),
        uiOutput("KNNdummyTagUI"),
        uiOutput("LGRdummyTagUI"),
        actionButton("runAnalysisUI", " Run", icon = icon("play"))),

      ############################################

      tabItem(tabName = "features", 
        fluidPage(plotOutput("caretPlotUI", width = "950px", height = "750px"))),

      ##################################################################################
      ####   Algorithm Tab Set Contents
      ##################################################################################

      tabItem(tabName = "results",
        fluidRow(
          box(title = "K-Nearest Neighbor", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,

              tabBox(
              tabPanel("Best Results",tableOutput("KNNbestResultsUI")),
              tabPanel("Train Results",tableOutput("KNNtrainResultsUI")),
              tabPanel("Accuracy Plot",plotOutput("KNNfinalPlotUI")))
            )
          ),

        fluidRow(
          box(title = "Logistic Regression", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
           tabBox(

              tabPanel("Best Results",tableOutput("LGRbestResultsUI")),
              tabPanel("Train Results",tableOutput("LGRtrainResultsUI")),
              tabPanel("Accuracy Plot",plotOutput("LGRfinalPlotUI"))
              )
            )
          ),

        fluidRow(
          box(title = "Gradient Boosting Method", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
            tabBox(

              tabPanel("Best Results",tableOutput("GBMbestResultsUI")),
              tabPanel("Train Results",tableOutput("GBMtrainResultsUI")),
              tabPanel("Accuracy Plot",plotOutput("GBMfinalPlotUI")))
            )
          ),


        fluidRow(
          box(title = "Neural Network", status = "primary", solidHeader = TRUE, collapsible = TRUE, width = 11,
            tabBox(

              tabPanel("Best Results",tableOutput("bestResultsUI")),
              tabPanel("Train Results",tableOutput("trainResultsUI")),
              tabPanel("Accuracy Plot",plotOutput("finalPlotUI"))
              )
            )
          )),

      ############################################

      tabItem(tabName = "about",
        fluidRow(
          box(title = "Contact", status = "primary", solidHeader = TRUE,
            collapsible = TRUE, width = 8,
            h4("Joseph Lee"), 
            h5("Programmer, Full Stack Developer"), 
            h4("Avi Yashchin"),
            h5("Co-programmer, Server Architect"),
            h4("NYC Data Science Academy")
            )),

        fluidRow(
          box(title = "Beta 1.0", status = "primary", solidHeader = TRUE,
            collapsible = TRUE, width = 8,
            h4("Version 1.0 Notes"), 
            h5("- Next version iteration will focus on data munging and cleaning as well as implementing more UI functions for feature engineering.
              The version should work relatively well with clean data."),
            h5("-Data that is not clean with mislabeled levels and factors will likely break the application or produce highly innacurate results."),
            h5("-Current version only uses Accuracy metric, next version will ideally incorporate ROC evaluation"),
            h5("-Next version will incorporate using test data to produce prediction data for kaggle competitions")
            ))),


      tabItem(tabName = "code",
        fluidRow(
          box(title = "Code", status = "primary", solidHeader = TRUE,
            collapsible = TRUE, width = 8,
            h5("The code is open source and available at the github link: [To Be Posted Soon]"))))
      ))


))

 


SERVER.R

require(shiny);
require(caret);
require(e1071);
require(randomForest);
require(nnet);
require(glmnet);
require(gbm);
library(mice);
library(VIM);
require(fastICA);
require(pastecs);
library(googleVis);
library("PASWR");
require("doMC")
source("helpers.R")

registerDoMC(cores = 2)

options(shiny.maxRequestSize = 100*1024^2)

shinyServer(function(input,output,session)
{
  
  #reactive object, responsible for loading the main data
  rawInputData = reactive({
    rawData = input$rawInputFile
    headerTag = input$headerUI;
    sepTag = input$sepUI;
    quoteTag = input$quoteUI;
    
    
    if(!is.null(rawData)) {
      data = read.csv(rawData$datapath,header=headerTag,sep=sepTag,quote=quoteTag);
    } else {
      return(NULL);
    }
    
  });


output$image2 <- renderImage({
  return(list(src = "images/reuse_these_tools.png", contentType = "image/png", align = "center"))
  }, deleteFile = FALSE)



 output$missingBox <- renderValueBox({
    data = rawInputData()
    missing = sapply(data, function(x) sum(is.na(x)))
    df.missing = data.frame(missing)
    total = sum(df.missing$missing)

    valueBox(
      paste0(total), "Missing Value(s)", icon = icon("calculator"),
      color = "purple"
    )
  })

  output$stat_sum <- renderDataTable({ 
    data = rawInputData()
    df.data = stat.desc(data)
    df.data
    
  })

  output$test.clean = renderDataTable({

    pre.data = rawInputData()

    missing.data = Missingness_Analysis(data,5,5, "mean")
    clean.data = missing.data

    df.clean = data.frame(clean.data)
    df.clean

  })

   output$str.data = renderText({

    pre.data = rawInputData()

    df.clean = str(pre.data)
    df.clean

  })

  #responsible for building the model, responds to the button
  #REQUIRED, as the panel that holds the result is hidden and trainResults will not react to it, this one will  
  output$dummyTagUI = renderUI({
    dataInput = trainResults()
    if(is.null(dataInput))
      return();
    activeTab = updateTabsetPanel(session,"mainTabUI",selected="Model Results View");
    return();
  })

  output$GBMdummyTagUI = renderUI({
    dataInput = GBMtrainResults()
    if(is.null(dataInput))
      return();
    activeTab = updateTabsetPanel(session,"mainTabUI",selected="Model Results View");
    return();

  })

  output$LGRdummyTagUI = renderUI({
    dataInput = LGRtrainResults()
    if(is.null(dataInput))
      return();
    activeTab = updateTabsetPanel(session,"mainTabUI",selected="Model Results View");
    return();

  })

  output$KNNdummyTagUI = renderUI({
    dataInput = KNNtrainResults()
    if(is.null(dataInput))
      return();
    activeTab = updateTabsetPanel(session,"mainTabUI",selected="Model Results View");
    return();

  })
  
  
########### Aux Functions ##############
  output$textmissing <- renderText({ 
    
    data = rawInputData()
    
    missing = sapply(data, function(x) sum(is.na(x)))
    df.missing = data.frame(missing)
    total = sum(df.missing$missing)
    total
    
    paste("Number of Missing Values: ",total)
    
  })


  output$textmissing <- renderText({ 
    
    data = rawInputData()
    
    missing = sapply(data, function(x) sum(is.na(x)))
    df.missing = data.frame(missing)
    total = sum(df.missing$missing)
    total
    
  })
  
  output$colmissing <- renderDataTable({ 
    data = rawInputData()
    missing = sapply(data, function(x) sum(is.na(x)))
    frame.missing = data.frame(missing)
    observation = rownames(frame.missing)
    df.missing = cbind(observation, frame.missing)
    df.missing
    
  })
  
  output$pre.data <- renderDataTable({ 
    data = rawInputData()
    df.data = data.frame(data)
    df.data
    
  })

  output$summary.data <- renderDataTable({

    data = rawInputData()
    df.sum = data.frame(summary(data))
    df.sum
  })
  
###############################################

  
  #this is the function that responds to the clicking of the button
  trainResults = eventReactive(input$runAnalysisUI,{
#     #respond to the button
#     input$runAnalysisUI;
    
    #the model we are interested in
    modelTag = isolate(input$modelSelectionUI);
    
    #make sure the data are loaded
    newData = isolate(rawInputData());
    if(is.null(newData))
      return();
    
    #grab the column
    column = isolate(input$modelLabelUI);
    
    columnElement = which(colnames(newData) == column);
    
    foldsType = isolate(input$crossFoldTypeUI);
    
    folds = isolate(input$foldsUI);
    
    control = trainControl(method=foldsType,number=folds)
    
    if(foldsType == "repeatedcv")
    {
      numberOfRepeats = isolate(input$repeatUI);
      control = trainControl(method=foldsType,number=folds,repeats=numberOfRepeats);
    }
    
    preprocessType = isolate(input$preprocessingUI);
    
    #build the equation
    form = as.formula(paste(column," ~ .",sep=""));
    
    kFolds = isolate(input$foldsUI);
    
    foldType = isolate(input$crossFoldTypeUI);
    
    if(preprocessType == "")
      preprocessType = NULL;
    
    results = NULL;
    
    results = withProgress(session, min=1, max=2, {
      setProgress(message = 'Neural Net in progress...')
    
      setProgress(value = 1)
      
      #choose the view based on the model
     if(modelTag == TRUE) {
        
        #familyData = isolate(input$nnModelTypeUI);
        nnRange = isolate(input$nnSizeUI);
        numNN = isolate(input$nnSizeRangeUI);
        nnDecayRange = isolate(input$nnDecayUI);
        numnnDecayRange = isolate(input$nnDecayRangeUI);
        
        gridding = expand.grid(.size=seq(nnRange[1],nnRange[2],length.out=numNN),.decay=seq(nnDecayRange[1],nnDecayRange[2],length.out=numnnDecayRange));
        
        results = train(form,data=newData,tuneGrid=gridding,method="nnet",trControl=control,preProcess=preprocessType);
        return(results);
        
      }

      setProgress(value = 2);
    });
    
    return(results);
    
  })


  GBMtrainResults = eventReactive(input$runAnalysisUI,{
#     #respond to the button
#     input$runAnalysisUI;
    
    #the model we are interested in
    modelTag = isolate(input$GBMmodelSelectionUI);
    
    #make sure the data are loaded
    newData = isolate(rawInputData());
    if(is.null(newData))
      return();
    
    #grab the column
    column = isolate(input$modelLabelUI);
    
    columnElement = which(colnames(newData) == column);
    
    foldsType = isolate(input$crossFoldTypeUI);
    
    folds = isolate(input$foldsUI);
    
    control = trainControl(method=foldsType,number=folds)
    
    if(foldsType == "repeatedcv")
    {
      numberOfRepeats = isolate(input$repeatUI);
      control = trainControl(method=foldsType,number=folds,repeats=numberOfRepeats);
    }
    
    preprocessType = isolate(input$preprocessingUI);
    
    
    
    #build the equation
    form = as.formula(paste(column," ~ .",sep=""));
    
    kFolds = isolate(input$foldsUI);
    
    foldType = isolate(input$crossFoldTypeUI);
    
    if(preprocessType == "")
      preprocessType = NULL;
    
    results = NULL;
    
    results = withProgress(session, min=1, max=2, {
      setProgress(message = 'GB Method in progress...')
    
      setProgress(value = 1)
      
      
      #choose the view based on the model
     if(modelTag == TRUE) {
        
        #familyData = isolate(input$gbmModelTypeUI);
        n.trees = isolate(input$gbmNTrees);
        shrinkage = isolate(input$gbmShrinkage);
        n.minobsinnode = isolate(input$gbmMinTerminalSize);
        interaction.depth = isolate(input$gbmInteractionDepth);

        
        gridding = expand.grid(n.trees = seq(1:n.trees),interaction.depth = c(1, 5, 9), shrinkage = shrinkage, n.minobsinnode = n.minobsinnode)
        
        
      
        
        results = train(form,data=newData,tuneGrid=gridding,method="gbm",trControl=control,preProcess=preprocessType);
        return(results);
        
      }

      setProgress(value = 2);
    });
    
    return(results);
     
  })

LGRtrainResults = eventReactive(input$runAnalysisUI,{
#     #respond to the button
#     input$runAnalysisUI;
    
    #the model we are interested in
    modelTag = isolate(input$LGRmodelSelectionUI);
    #make sure the data are loaded
    newData = isolate(rawInputData());
    if(is.null(newData))
      return();
    
    #grab the column
    column = isolate(input$modelLabelUI);
    columnElement = which(colnames(newData) == column);
    foldsType = isolate(input$crossFoldTypeUI);
    folds = isolate(input$foldsUI);
    control = trainControl(method=foldsType,number=folds)
    
    if(foldsType == "repeatedcv")
    {
      numberOfRepeats = isolate(input$repeatUI);
      control = trainControl(method=foldsType,number=folds,repeats=numberOfRepeats);
    }
    
    preprocessType = isolate(input$preprocessingUI);
    
    #build the equation
    form = as.formula(paste(column," ~ .",sep=""));
    
    kFolds = isolate(input$foldsUI);
    
    foldType = isolate(input$crossFoldTypeUI);
    
    if(preprocessType == "")
      preprocessType = NULL;
    
    results = NULL;
    
    results = withProgress(session, min=1, max=2, {
      setProgress(message = 'Boosted Logit in progress...')
    
      setProgress(value = 1)
      
      
      #choose the view based on the model
     if(modelTag == TRUE) {
        nIter = isolate(input$logregIter);
        gridding = expand.grid(nIter = seq(1:nIter))
        results = train(form,data=newData,tuneGrid=gridding,method="LogitBoost",trControl=control,preProcess=preprocessType);
        return(results);
      }

      setProgress(value = 2);
    });
    
    return(results);
  })


KNNtrainResults = eventReactive(input$runAnalysisUI,{
#     #respond to the button
#     input$runAnalysisUI;
    
    #the model we are interested in
    modelTag = isolate(input$KNNmodelSelectionUI);
    
    #make sure the data are loaded
    newData = isolate(rawInputData());
    if(is.null(newData))
      return();
    
    #grab the column
    column = isolate(input$modelLabelUI);
    
    columnElement = which(colnames(newData) == column);
    
    foldsType = isolate(input$crossFoldTypeUI);
    
    folds = isolate(input$foldsUI);
    
    control = trainControl(method=foldsType,number=folds)

    #ctrl = trainControl(method = "repeatedcv", repeats = 5)
    #ctrl <- trainControl(method="repeatedcv",repeats = input$repeatUI)
    
    if(foldsType == "repeatedcv")
    {
      numberOfRepeats = isolate(input$repeatUI);
      control = trainControl(method=foldsType,number=folds,repeats=numberOfRepeats);
    }
    
    preprocessType = isolate(input$preprocessingUI);
    
    #build the equation
    form = as.formula(paste(column," ~ .",sep=""));
    
    kFolds = isolate(input$foldsUI);
    
    foldType = isolate(input$crossFoldTypeUI);
    
    if(preprocessType == "")
      preprocessType = NULL;
    
    results = NULL;
    
    results = withProgress(session, min=1, max=2, {
      setProgress(message = 'KNN in progress...')
    
      setProgress(value = 1)
      
      #choose the view based on the model
     if(modelTag == TRUE) {
        tuneLength = isolate(input$knnTuneLength);
        ctrl <- trainControl(method = "repeatedcv", repeats = 5)
        results = train(form, data = newData, method = "knn", tuneLength = tuneLength, trControl = ctrl)
        return(results);  
      }
      setProgress(value = 2);
    });
    return(results);
  })


  #responsible for displaying the full results
  output$trainResultsUI = renderTable({
    data = trainResults();
    if(is.null(data))
      return();
    data$results
  })

  output$GBMtrainResultsUI = renderTable({
    data = GBMtrainResults();
    if(is.null(data))
      return();
    data$results
  })

  output$LGRtrainResultsUI = renderTable({
    data = LGRtrainResults();
    if(is.null(data))
      return();
    data$results
  })


  output$KNNtrainResultsUI = renderTable({
    data = KNNtrainResults();
    if(is.null(data))
      return();
    data$results
  })
  
  #the one that matches the best
  output$bestResultsUI = renderTable({
    data = trainResults();
    if(is.null(data))
      return();
    data$results[as.numeric(rownames(data$bestTune)[1]),];
  })

  #the one that matches the best
  output$GBMbestResultsUI = renderTable({
    data = GBMtrainResults();
    if(is.null(data))
      return();
    data$results[as.numeric(rownames(data$bestTune)[1]),];
  })

  output$LGRbestResultsUI = renderTable({
    data = LGRtrainResults();
    if(is.null(data))
      return();
    data$results[as.numeric(rownames(data$bestTune)[1]),];
  })


    #the one that matches the best
  output$KNNbestResultsUI = renderTable({
    data = KNNtrainResults();
    if(is.null(data))
      return();
    data$results[as.numeric(rownames(data$bestTune)[1]),];
  })
  
  #a feature plot using the caret package
  output$caretPlotUI = renderPlot({
    data = rawInputData();
    column = input$modelLabelUI;
    
    
    #check if the data is loaded first
    if(is.null(data)){
      return()
    } else {
      columnElement = which(colnames(data) == column);  
      
      p = featurePlot(x=data[,-columnElement],y=data[,columnElement],plot="pairs",auto.key=T);
      print(p);
    }
  })
  
  #the results graph of the caret output
  output$finalPlotUI = renderPlot({
    data = trainResults();
    if(is.null(data)){
      return();
    } else {
      
      #the model we are interested in
      modelTag = isolate(input$modelSelectionUI);

      if (modelTag == TRUE){
      
      
      #grab the column
        column = isolate(input$modelLabelUI);
      
      #build the equation
        form = as.formula(paste(column," ~ .",sep=""));
        par(mfrow=c(2,1));
        p = plot(data);
        print(p);
        }else{

        return()
    }
      
    }
  })

  output$LGRfinalPlotUI = renderPlot({
    data = LGRtrainResults();
    if(is.null(data)){
      return();
    } else {
      
      #the model we are interested in
      modelTag = isolate(input$LGRmodelSelectionUI);

      if (modelTag == TRUE) {
      
      
      #grab the column
        column = isolate(input$modelLabelUI);
      
      #build the equation
        form = as.formula(paste(column," ~ .",sep=""));
        par(mfrow=c(2,1));
        p = plot(data);
        print(p);
    } else {
      return()
    }
      
    }
  })

  output$GBMfinalPlotUI = renderPlot({
    data = GBMtrainResults();
    if(is.null(data)){
      return();
    } else {
      
      #the model we are interested in
      modelTag = isolate(input$GBMmodelSelectionUI);
      
      if(modelTag == TRUE){
      #grab the column
        column = isolate(input$modelLabelUI);
      
      #build the equation
        form = as.formula(paste(column," ~ .",sep=""));
        par(mfrow=c(2,1));
        p = plot(data);
        print(p);
      
      }else{
        return()
      }
      #       if(modelTag == "nn")
      #       {
      #       data$finalModel$call$formula = form;
      #       
      #       
      #       plot(data$finalModel);
      #       
      #       } else if(modelTag == "rf")
      #       {
      #         plot(data$finalModel);  
      #       }
      
    }
  })
   output$KNNfinalPlotUI = renderPlot({
    data = KNNtrainResults();
    if(is.null(data)){
      return();
    } else {
      
      #the model we are interested in
      modelTag = isolate(input$KNNmodelSelectionUI);

      if(modelTag ==TRUE){
      
      #grab the column
       column = isolate(input$modelLabelUI);
      
      #build the equation
       form = as.formula(paste(column," ~ .",sep=""));
       par(mfrow=c(2,1));
       p = plot(data);
       print(p);
     }else{
      return()
     }
      
      #       if(modelTag == "nn")
      #       {
      #       data$finalModel$call$formula = form;
      #       
      #       
      #       plot(data$finalModel);
      #       
      #       } else if(modelTag == "rf")
      #       {
      #         plot(data$finalModel);  
      #       }
      
    }
  })

  #simple datatable of the data
  output$rawDataView = renderDataTable({
    newData = rawInputData();
    if(is.null(newData))
      return();
    newData;
  });
  
  #responsible for selecting the label you want to regress on
  output$labelSelectUI = renderUI({
    
    data = rawInputData();
    #check if the data is loaded first
    if(is.null(data)){
      return(helpText("Choose a file to load"))
    } else {
      return(selectInput("modelLabelUI","Select Target Feature",colnames(data),colnames(data)[1]));
    }
  });
  
  #a dynamic table responsible for building the input types to the model
  output$modelParametersUI = renderUI({
    
    modelTag = input$modelSelectionUI;
    
    if (modelTag == TRUE) {
      tagList(
              sliderInput("nnSizeUI","NN Size",min=1,max=25,value=c(1,5)),
              numericInput("nnSizeRangeUI","NN Size Range",5),
              sliderInput("nnDecayUI","NN Decay",min=0.0,max=1.0,value=c(0,0.1),step=0.001),
              numericInput("nnDecayRangeUI","NN Decay Range",5))      
    }
    
  })

  output$GBMmodelParametersUI = renderUI({
    
    modelTag = input$GBMmodelSelectionUI;
    
    if (modelTag == TRUE) {
      tagList(
              sliderInput("gbmNTrees","NN Size",min=1,max=25,value=c(1,5)),
              numericInput("gbmInteractionDepth","Interaction Depth",5),
              sliderInput("gbmShrinkage","Shrinkage",min=0.0,max=5.0,value=c(0,1), step = 0.1),
              numericInput("gbmMinTerminalSize","n.minobsinnode",0))      
    }
    
  })

  output$LGRmodelParametersUI = renderUI({
    
    modelTag = input$LGRmodelSelectionUI;
    
    if (modelTag == TRUE) {
      tagList(
              numericInput("logregIter","Number of Iterations",5))   
    }
    
  })

  output$KNNmodelParametersUI = renderUI({
    
    modelTag = input$KNNmodelSelectionUI;
    
    if (modelTag == TRUE) {
      tagList(
              numericInput("knnTuneLength","Tune Length",20))
    }
    
  })
})

 

 

How To Use Our App

Our app has three main parts.  The first part is the data preparation feature.  You can upload almost any csv file and the app will automatically perform missing analysis.  We mainly used the iris.csv data set as the main test set.

The second part is the analysis feature.  The analysis features contains three sub-features: preprocessing, feature graphing, and modeling.  For the pre-processing sub-feature, we kept the options minimal and included cross validation, IPA, and ICA options that can be activated through shiny widgets.  The feature graphing is a simple graphing application that produces a lattice plot of all data features in the data set.  For the modeling sub-feature, we made four algorithms available for the user to choose and tune: KNN, logit boost, gradient boosting method, and neural networks.  All of the modeling algorithms were from the caret package and is a fantastic package to familiarize yourself with if you want to pursue more machine learning application in R.   

The third part of the app is the results feature.  After selecting the models to use for the uploaded data set, you can see and compare results from the different models on one page.

The demo video above provides a visual overview on how to operate the shiny application.  Feel free to reach out to us if you have any questions or comments about the code.  Again, this is completely open source so please take it and play around with it!

 

 

Conclusion

We are living in an exciting era of data science.  Looking into different perspectives of the open source resources available to any data enthusiast, it's easy to see how so many startups are gaining traction in industry.  Everything you need to start a data science startup are only a few keystrokes away.  For fun, we loaded our app onto an AWS instance and compared its computational runtime with a local instance of the app.  We found that our app was x10 faster on an AWS instance than my local machine (Macbook Air, 8 GB ram, 512 SSD).  This only shows that anyone can create a budget friendly data science startup as long as you are creative and determined to see it through.  Thank you for reading our blog post and as a bonus please enjoy our deep learning art below!

 

Bonus Deep Learning Art

WE HAD SO MUCH EXTRA AWS HORSEPOWER WE MADE DEEP LEARNING ART USING PICTURES OF JOEโ€™S CAT:

cat1 cat2 cat3

 

 

About Authors

Joseph Lee

A recent graduate from Northwestern University with a B.S. in Biomedical Engineering and a Minor in computer science, Joseph has a strong background in computer engineering and programming concepts. His previous work and academic studies contains a panoply...
View all posts by Joseph Lee >

Avi Yashchin

Avi Yashchin is a serial entrepreneur in the technology, finance and education businesses. After a summer of working on the Sloan Digital Sky Survey at NASA, he began his career as a high frequency algorithmic trader in the...
View all posts by Avi Yashchin >

Related Articles

Capstone
Catching Fraud in the Healthcare System
Data Analysis
Car Sales Report R Shiny App
Data Analysis
Injury Analysis of Soccer Players with Python
Capstone
Acquisition Due Dilligence Automation for Smaller Firms
R Shiny
Forecasting NY State Tax Credits: R Shiny App for Businesses

Leave a Comment

Cancel reply

You must be logged in to post a comment.

Hank January 12, 2016
thanks for the good article. minor feedback: h2o is completely FOSS, licensed under apache 2.0 license.

View Posts by Categories

All Posts 2399 posts
AI 7 posts
AI Agent 2 posts
AI-based hotel recommendation 1 posts
AIForGood 1 posts
Alumni 60 posts
Animated Maps 1 posts
APIs 41 posts
Artificial Intelligence 2 posts
Artificial Intelligence 2 posts
AWS 13 posts
Banking 1 posts
Big Data 50 posts
Branch Analysis 1 posts
Capstone 206 posts
Career Education 7 posts
CLIP 1 posts
Community 72 posts
Congestion Zone 1 posts
Content Recommendation 1 posts
Cosine SImilarity 1 posts
Data Analysis 5 posts
Data Engineering 1 posts
Data Engineering 3 posts
Data Science 7 posts
Data Science News and Sharing 73 posts
Data Visualization 324 posts
Events 5 posts
Featured 37 posts
Function calling 1 posts
FutureTech 1 posts
Generative AI 5 posts
Hadoop 13 posts
Image Classification 1 posts
Innovation 2 posts
Kmeans Cluster 1 posts
LLM 6 posts
Machine Learning 364 posts
Marketing 1 posts
Meetup 144 posts
MLOPs 1 posts
Model Deployment 1 posts
Nagamas69 1 posts
NLP 1 posts
OpenAI 5 posts
OpenNYC Data 1 posts
pySpark 1 posts
Python 16 posts
Python 458 posts
Python data analysis 4 posts
Python Shiny 2 posts
R 404 posts
R Data Analysis 1 posts
R Shiny 560 posts
R Visualization 445 posts
RAG 1 posts
RoBERTa 1 posts
semantic rearch 2 posts
Spark 17 posts
SQL 1 posts
Streamlit 2 posts
Student Works 1687 posts
Tableau 12 posts
TensorFlow 3 posts
Traffic 1 posts
User Preference Modeling 1 posts
Vector database 2 posts
Web Scraping 483 posts
wukong138 1 posts

Our Recent Popular Posts

AI 4 AI: ChatGPT Unifies My Blog Posts
by Vinod Chugani
Dec 18, 2022
Meet Your Machine Learning Mentors: Kyle Gallatin
by Vivian Zhang
Nov 4, 2020
NICU Admissions and CCHD: Predicting Based on Data Analysis
by Paul Lee, Aron Berke, Bee Kim, Bettina Meier and Ira Villar
Jan 7, 2020

View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day ChatGPT citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay football gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income industry Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI

NYC Data Science Academy

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our
amazing bootcamp!

Please enter a valid email address
Sign up completed. Thank you!

Offerings

  • HOME
  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS
  • About

  • About Us
  • Alumni
  • Blog
  • FAQ
  • Contact Us
  • Refund Policy
  • Join Us
  • SOCIAL MEDIA

    ยฉ 2025 NYC Data Science Academy
    All rights reserved. | Site Map
    Privacy Policy | Terms of Service
    Bootcamp Application