The tutorials in this NISS series involve the Top 10 analytics approaches of the key topics that are used in business today! Students and faculty, these are perhaps the top ten most important and practical topics that may not be covered in your program of study. (Review the Overview Presentation about all 10 Sessions).
Data Science Workflows
This tutorial focuses on data science workflows (or pipelines) for building machine learning models on structured data using R and Spark. R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment that treats R as a first-class programming language. The first part of the tutorial introduces data science tools supporting reproducible reports, including Linux, bash, Git, Docker, and make. R is then presented as a functional programming language using R markdown documents.
The course covers the initial steps in the data science process using R and Spark: extracting data from source systems; transforming data into a tidy form; and loading data into a relational database (PostgreSQL), a distributed file system (HDFS), a distributed data warehouse (Hive), and into Spark SQL, i.e., ETL.
These R-based workflows are illustrated by using `dplyr` directly and as a frontend to SQL databases. The `sparklyr` package with its `dplyr` interface to Spark is then used for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered. Finally, methods for analyzing streaming data are presented.
The course content is available as R markdown and PDF files on GitHub here: https://github.com/jharner/rspark-tutorial
This content is embedded within an Amazon Machine Image (AMI), which provides a complete R and Spark computing environment. Students are encouraged to sign up for an AWS account to run the R markdown files interactively. Follow the instructions in the README file to create an account and launch an AWS instance here: https://github.com/jharner/rspark/tree/master/AWS
The content is also available as a Dockerized application called `rspark` here: https://github.com/jharner/rspark-docker
Follow the instructions in the README file to run `rspark` on supported laptops.
E. James Harner (Professor Emeritus and Former Chair in Statistics, West Virginia University)
NISS is interested in sharing knowledge. To this end, these webinars have been geared to provide practical information that you can use tomorrow. Examples, projects and code sharing are a part of these sessions wherever possible.
Participants require a working knowledge of probability distributions, statistical inference, statistical modeling and time series analysis as a prerequisite. Students who do not have this foundation or have not reviewed this material within the past couple of years will struggle with the concepts and methods that build on this foundation.
Select a registration/payment option above the 'Register for this Event' button ($35/session, $250 for all 10 Essential Data Science for Business tutorial sessions.). NISS affiliates, (https://www.niss.org/affiliates-list), please send an email to email@example.com.).
About the Instructor
E. James Harner is Professor Emeritus of Statistics at West Virginia University (WVU). He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years at WVU. Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS) beginning in May, 2018. The areas of his technical and research expertise include: bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling and statistical machine learning.