R & Spark: Tools for Data Science Workflows

Monday and Tuesday, November 25-26, 2019, 9-4:30 pm

R is a flexible, extensible statistical computing environment, but it is limited to single-core execution.  Spark is a distributed computing environment which treats R as a first-class programming language.  This course introduces data structures in R and their use in functional programming workflows relevant to data science. 

The course covers the initial steps in the data science process: 

- extracting data from source systems,
- transforming data into a tidy form,
- loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL. 

These R-based workflows are illustrated by using dplyr directly and as a frontend to SQL databases.

The sparklyr package with its dplyr interface to Spark is then used for modeling big data using regression and classification supervised learning methods.  Unsupervised learning methods, such as clustering and dimension reduction, are also covered.  Finally, methods for analyzing streaming data are presented.  The course finishes with two in-depth examples. Student accounts are provided to allow attendees to interactively run the R Markdown content in Amazon’s cloud (AWS). The computing infrastructure and the content is containerized which allows the complete course environment to be downloaded and run on Docker-supported laptops.

PREREQUISITES FOR THIS COURSE

Differential calculus, basic matrix algebra, a statistics course covering regression, basic R.
Operating Systems:  MacOS or Windows with an updated current browser: e.g. Chrome, or Firefox.
Students must bring their own laptops.

FEES:

$990 for the public
$760 for NISS affiliates
$590 for any students with an ID
$380 for students at NISS affiliated institutions

EVENT LOCATION

Bureau of Labor Statistics Conference and Training Center, Conference Rooms 7-8, Postal Square Building, 1st Street, NE, Washington, DC 20212-0001.  Attendees are required to enter through the visitor entrance on First Street, NE (between Massachusetts Avenue and G Street, NE) across from Union Station.  Do not use the main entrance on 2 Massachusetts Avenue.  Please note that food is not allowed in any of the classrooms.  Only drinks with caps or lids are allowed into the classroom.

ID REQUIREMENT AND SCREENING

All visitors must present a valid photo ID at the visitor’s entrance and pick up a visitor’s badge.  Visitors and packages will be processed through the x-ray and metal detector screening equipment.  Equipment brought into BLS requires property passes.  Equipment passes can be picked up from the receptionist.  Please arrive 10 minutes early to allow enough time to go through security. 

INSTRUCTOR

E. James Harner is Professor Emeritus of Statistics at West Virginia University (WVU). He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years at WVU.  Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS) beginning in May, 2018.  The areas of his technical and research expertise include:  bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling and statistical machine learning.  You may direct questions about this course to the Instructor E. James Harner at eharner@mail.wvu.edu or call him on his cell phone at (304) 376-4170.

HOW TO REGISTER 

Pay Online with a Credit Card:  Please select one of the options from the top right drop-down menu on the right side of your screen to start.
Call or Email:  You can call (202) 800-3880 or email officeadmin@niss.org to have us register you.

Event Type

Host

National Institute of Statistical Sciences

Location

Bureau of Labor Statistics Conference and Training Center
Conference Rooms 7-8, Postal Square Building
1st Street, NE
Washington
,
District of Columbia
,
20212-0001
United States