NISS/CANSSI Workshop on R & Spark: Tools for Data Science Workflows

Thursday, April 12, 2018 - 9:00am to Friday, April 13, 2018 - 5:00pm, in Toronto, Canada

COURSE OUTLINE
R is a flexible, extensible statistical computing environment, but it is limited to single-core execution. Spark is a distributed computing environment which treats R as a first-class programming language. This course introduces data structures in R and their use in functional programming workflows relevant to data science. 

The course covers the initial steps in the data science process: 

  • extracting data from source systems
  • transforming data into a tidy form
  • loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL. 

This workflow is illustrated by using the SparkR and sparklyr package frontends to Spark from R.

SparkR and sparklyr are then used as interfaces for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered. Additional methods, such as gradient boosting and deep learning, are illustrated using the h2o and rsparkling R packages. Finally, methods for analyzing streaming data are presented. The course finishes with an in-depth example. The infrastructure and content is containerized for easy download to your laptop using Docker.

INSTRUCTOR

E. James Harner 
E. James Harner is Professor Emeritus of Statistics at West Virginia University (WVU). He was the Chair of the Department of Statistics for 17 years and the Director of the Cancer Center Bioinformatics Core for 15 years at WVU. Currently, he is the Chairman of the Interface Foundation of North America which has partnered with the American Statistical Association to organize the annual Symposium on Data Science and Statistics (SDSS) beginning in May, 2018. The areas of his technical and research expertise include: bioinformatics, high-dimensional modeling, high-performance computing, streaming and big data modeling and statistical machine learning.

FEES: 
US $380 for currently enrolled students.
US $760 for employees of NISS Affiliates and members of CANSSI. 
US $990 for all others.

VENUE
Emmanuel College 
BC20 in 75 Queen's Park Crescent E
Victoria University in the University of Toronto
Toronto, ON M5S 1K7, Canada. (See map, right)

Nearby Hotel Information
Here is a comprehensive list of hotels near the venue: http://www.fields.utoronto.ca/resources/housing-resources with Holiday Inn Bloor Yorkville and Eaton Chelsea Hotel being the most convenient to the location of the short course. If you prefer to search on TripAdvisor, the relevant neighborhoods are The Annex and Downtown. Please note that travel costs are not included in the course fee and are not reimbursable by NISS or CANSSI.

PREREQUISITES FOR THIS COURSE
Differential calculus, basic matrix algebra, a statistics course covering regression, basic R. Special rates for students. 
Operating Systems: MacOS 10.11 (El Capitan) or higher or Windows 10 Professional. Students must bring their own laptops.

HOW TO REGISTER 

  1. Pay online with a Credit card: Please select one of the options from the drop-down menu on the right side of your screen.
  2. Call or email: You can call (202) 862-4316 or write to officeadmin@NISS.org to register

CONTACT US

  1. Direct questions about this course to the Instructor E. James Harner at eharner@mail.wvu.edu or call him on his cell phone at 304-376-4170.
  2. For other questions, contact officeadmin@niss.org 
Registration is not available.

Event Type

Host

National Institute of Statistical Sciences (NISS)
Canadian Statistical Sciences Institute (CANSSI)

Cost

US $380 for currently enrolled students
US $760 for employees of NISS Affiliates and members of CANSSI
US $990 for all others

Location

Birge-Carnegie Library (Building #503)
BC20 in 75 Queen's Park Crescent E
Victoria University in the University of Toronto
Toronto, ON
,
M5S 1K7, Canada
Instructor, James Harner
Map of Workshop Location