Speaker:
Lucas Mentch, Associate Professor, Department of Statistics at the University of Pittsburgh
Moderator:
Nancy McMillan, Data Science Research Leader, Health Research & Analytics Business Line at Battelle
Title: Random Forests: Why they Work and Why that’s a Problem
Abstract:
Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success, a full and satisfying explanation for their success has yet to be put forth. In this talk, we will show that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. From a model-complexity perspective, this means that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicit regularization procedures like the lasso. Realizing this, we demonstrate that alternative forms of randomness can provide similarly beneficial stabilization. In particular, we show that augmenting the feature space with additional features consisting of only random noise can substantially improve the predictive accuracy of the model. This surprising fact has been largely overlooked within the statistics community, but has crucial implications for thinking about how best to define and measure variable importance. Numerous demonstrations on both real and synthetic data are provided.
About the Speaker:
Lucas Mentch is an Associate Professor in the Department of Statistics at the University of Pittsburgh. He obtained is PhD from Cornell University in 2015 and an undergraduate degree in mathematics from Bucknell University in 2010. He has published dozens of scientific articles and book chapters at the intersection of statistical inference and machine learning and applied these methodologies to a diverse array of application areas ranging from ecology to medicine to sports statistics to law, policing, and forensic science.
About the Moderator:
Nancy McMillan currently serves as Data Science Research Leader within Battelle’s Health Research & Analytics Business Line. For a diverse set of federal government clients, she currently leads development of a large language model (LLM) based biocuration acceleration pipeline and user tool, development of pipelines, analytics, and visualizations of electronic initial case reporting data, and development of analytical methods for achieving abbreviated new drug application (ANDA) approval for an agile drug manufacturing technology. Nancy has a long history of collaborative work across Battelle bringing statistics and machine learning to Battelle’s deep capability in biology, chemistry, and material science. As a researcher and Project Management Professional, Nancy has worked and published on environmental exposure and risk assessment; transportation safety benefits; quantitative risk assessment related to chemical, biological, radiological and nuclear (CBRN) terrorism; bio surveillance; and bioinformatics. She managed the Health Analytics Division from 2017-2023, a team of approximately 100 data scientists that supports Battelle’s contract research business. Nancy is a member of the Board of Trustees for the National Institute of Statistical Sciences (NISS), the Chair of NISS’s Affiliates Committee, and a member of the Organ Procurement and Transplantation Network’s Data Advisory Committee.
About AI, Data Science, and Statistics in Practice
The NISS AI, Data Science, and Statistics in Practice is a monthly event series part of the NISS Collaboratory (CoLab) will bring together leading experts from industry and academia to discuss the latest advances and practical applications in AI, data science, and statistics. Each session will feature a keynote presentation on cutting-edge topics, where attendees can engage with speakers on the challenges and opportunities in applying these technologies in real-world scenarios. This series is intended for professionals, researchers, and students interested in the intersection of AI, data science, and statistics, offering insights into how these fields are shaping various industries. The series is designed to provide participants with exposure to and understanding of how modern data analytic methods are being applied in real-world scenarios across various industries, offering both theoretical insights, practical examples, and discussion of issues.
Featured Topics:
- Veridical Data Science
- Statistics and Experimentation Needs in Industry
- Generative AI for Use in Industry
- Causal AI in Finance and Technology Industries
- Uncertainty Quantification for Random Forests
- Deep Learning Methods for Closed-Loop Neuromodulation
- Machine Learning for Airborne Biological Hazard Detection
- Causal Inference in Marketing Analytics
- Practical Return on AI Investment
Event Type
- NISS Hosted