The course presents a new paradigm of thinking about the analysis of data. The emphasis is on gettting participants comfortable with building workflows from data sources to data products with the use of prediction metrics rather than p-values (probability values) as the measure of success.
R is a flexible, extensible statistical computing environment, but limited to single-core execution. Spark is a relatively new distributed computing environment, which extends R, a first-class programming language, to multiple processors thereby increasing the effectiveness and efficiency in the way Big Data is analyzed. It is used by major search engine organizations, such as Google and Yahoo. LinkedIn and Amazon use it to match advertisements to users in smart ways. This workflow is illustrated by using the SparkR and sparklyr package frontends to Spark from R.
SparkR and sparklyr are then used as interfaces for modeling big data using regression and classification supervised learning methods. Unsupervised learning methods, such as clustering and dimension reduction, are also covered in this workshop. Additional methods, such as gradient boosting and deep learning, are illustrated using the h2o and rsparkling R packages. Finally, methods for analyzing streaming data are presented. The course finishes with an in-depth example.
Target Audience and Availability
The workshop is valuable to statisticians and data scientists trying to understand the distinction between statistics and data science; to students about to enter the workforce; and to employers who want to encourage their in-house data analysts toward a valuable form of continuing education. Additionally, this course is valuable to people from a non-statistics background, especially those in a substantive area that uses statistics. The only pre-requisite is that everyone attending the course should have some statistical training and basic knowledge of R.
DC area late Jan - early march... 2018