Essential Data Science for Business: Unstructured Data Analysis - Text Mining with Tidy Data Principles

February 10, 2021 1-4 pm ET

The tutorials in this NISS series involve the Top 10 analytics approaches of the key topics that are used in business today!  Students and faculty, these are perhaps the top ten most important and practical topics that may not be covered in your program of study.  (Review the Overview Presentation about all 10 Sessions).

Unstructured Data Analysis - Text Mining with Tidy Data Principles

Overview

Have you ever encountered text data and suspected there was useful insight latent within it but felt frustrated about how to find that insight? Have you used dplyr and ggplot2 before, and are ready to learn how unstructured data such as text can be analyzed within the tidyverse ecosystem? Do you need a flexible framework for handling unstructured data that allows you to engage in tasks from exploratory data analysis to supervised predictive modeling?

Unstructured data is increasingly important in many domains, and tidy data principles and tidy tools can make mining data like text easier and more effective. In this workshop, learn how to manipulate, summarize, and visualize the characteristics of unstructured and text data using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. Explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, network analysis of words, and, time permitting, building both supervised and unsupervised text models.

At the end of this tutorial, participants will understand how to:

  • Perform exploratory data analyses of text datasets, including summarization and data visualization
  • Understand and implement both tf-idf and sentiment analysis
  • Build models for text using tidy data principles

Materials

Please tune into the tutorial with a computer that has the following installed (all available for free):
•    A recent version of R, available at https://cran.r-project.org/
•    A recent version of RStudio Desktop (RStudio Desktop Open Source License), available at https://www.rstudio.com/download
•    The following R packages, which you can install by connecting to the internet, opening RStudio, and running at the command line:

install.packages(c("tidyverse", "tidytext", 
                   "gutenbergr", "stm",
                   "tidymodels", "glmnet", 
                   "vip", "textrecipes"))

Instructor

Julia Silge (RStudio)


Series Goals

NISS is interested in sharing knowledge.  To this end, these tutorials have been geared to provide practical information that you can use tomorrow. Examples, projects and code sharing are a part of these sessions wherever possible.

Series Prerequisites

Participants require a working knowledge of probability distributions, statistical inference, statistical modeling and time series analysis as a prerequisite. Students who do not have this foundation or have not reviewed this material within the past couple of years will struggle with the concepts and methods that build on this foundation.

Registration

Select a registration/payment option above the 'Register for this Event' button ($35 for this Data Science Essentials tutorial session, $250 for all 10 Essential Data Science for Business tutorial sessions. 

Can't attend this session or any of the previous sessions? Post Session Access to tutorial materials and recording can be obtained for $35 after the event is over.  Purchasing all 10 Sessions also will provide you access to all previous session recordings and materials.) NISS Affiliates, (https://www.niss.org/affiliates-list), please send an email to officeadmin@niss.org.).  Notifications: You will recieve an email that comes immediately to let you know you paid.  Links to the event will come via email the day before and one hour prior to the actual session.


Agenda

About the Instructor

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker, and a real-world practitioner focusing on data analysis and machine learning practice. Julia loves text analysis, making beautiful charts, and communicating about technical topics with diverse audiences.

Event Type

Host

National Institute of Statistical Sciences

Sponsor

National Institute of Statistical Sciences

Cost

$35 for this session; $250 for all 10 Data Science Sessions

Location

Online Tutorial
Julia Silge (Data Scientist and Software Engineer at RStudio), Instructor of the "Unstructured Data Analysis - Text Mining with Tidy Data Principles" tutorial.