Speaker
Weijie Su, Associate Professor, Wharton Statistics and Data Science Department and, by courtesy, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania
Moderator
Abstract
Title: Recent Advances in the Statistical Foundations of Large Language Models
Abstract: In this talk, we advocate for the development of rigorous statistical foundations for large language models (LLMs). We begin by elaborating two key features that motivate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the complexity and black box nature of Transformer architectures. To illustrate how statistical insights can directly benefit LLM development and applications, we present two concrete examples. First, we introduce a novel statistical framework to analyze the efficiency of watermarking schemes, with a focus on a watermarking scheme developed by OpenAI for which we derive optimal detection rules that outperform existing ones. Second, we demonstrate statistical inconsistencies and biases arising from the current approach to aligning LLMs with human preference. We propose a regularization term for aligning LLMs that is both necessary and sufficient to ensure consistent alignment. Collectively, these findings showcase how statistical insights can address pressing challenges in LLMs while simultaneously illuminating new research avenues for the broader statistical community to advance responsible generative AI research. This talk is based on arXiv:2404.01245, 2405.16455, 2503.10990, and 2510.22007.
About the Speaker
Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and, by courtesy, in the Department of Biostatistics, Epidemiology, and Informatics at the University of Pennsylvania. He serves as a Co-Director of the Penn Research in Machine Learning Center. He received his Ph.D. in Statistics from Stanford University in 2016 and his Bachelor's degree in Mathematics from Peking University in 2011. His research interests span the statistical foundations of generative AI, high-dimensional statistics, privacy-preserving machine learning, and optimization. He is a founding Co-Editor of the journal Statistical Learning and Data Science and serves as an Associate Editor for JASA, AOAS, OPRE, JMLR, FnT in Statistics, and Harvard Data Science Review. He currently serves on the Organizing Committee of ICML 2026 as Scientific Integrity Chair, where his isotonic mechanism will be deployed to enhance peer review. His work has been recognized with many honors, including the Stanford Theodore Anderson Dissertation Award, NSF CAREER Award, Sloan Research Fellowship, IMS Peter Hall Prize, SIAM Early Career Prize in Data Science, ASA Noether Early Career Award, ICBS Frontiers of Science Award in Mathematics, and IMS Medallion Lectureship. He has authored two discussion papers in JRSSB and JASA and is a Fellow of the IMS. See Profile
About the Moderator
Whitney Huang is an Associate Professor of Statistics at Clemson University, where he has served since August 2019. Prior to joining Clemson, he was a Canadian Statistical Sciences Institute (CANSSI) and Statistical and Applied Mathematical Sciences Institute (SAMSI) postdoctoral fellow at the University of Victoria (UVic), affiliated with the Pacific Climate Impacts Consortium and the School of Earth and Ocean Sciences, working with Dr. Francis Zwiers and Prof. Adam Monahan. Before his time at UVic, he held a SAMSI/University of North Carolina postdoctoral position under the supervision of Prof. Richard Smith. He received his Ph.D. in Statistics from Purdue University in August 2017, advised by Prof. Hao Zhang. During his doctoral studies, he was actively involved in the Research Network for Statistical Methods for Atmospheric and Oceanic Sciences (STATMOS) and the Center for Robust Decision Making on Climate and Energy Policy (RDCEP), collaborating with Michael Stein and Elisabeth Moyer at the University of Chicago and Doug Nychka at the National Center for Atmospheric Research. Before pursuing his doctorate at Purdue, he earned a Master’s degree in Statistics from the University of Akron and a Bachelor’s degree in Mechanical Engineering from National Cheng Kung University in Taiwan. His research interests include statistics of extremes, spatio-temporal statistics, surrogate modeling for computer experiments, time-frequency analysis, multiscale statistical modeling, spatial point processes, environmental applications, and high-frequency physiological data analysis. See Profile
About AI, StAtIstics and Data Science in Practice
The NISS AI, Statistics and Data Science in Practice is a monthly event series will bring together leading experts from industry and academia to discuss the latest advances and practical applications in AI, data science, and statistics. Each session will feature a keynote presentation on cutting-edge topics, where attendees can engage with speakers on the challenges and opportunities in applying these technologies in real-world scenarios. This series is intended for professionals, researchers, and students interested in the intersection of AI, data science, and statistics, offering insights into how these fields are shaping various industries. The series is designed to provide participants with exposure to and understanding of how modern data analytic methods are being applied in real-world scenarios across various industries, offering both theoretical insights, practical examples, and discussion of issues.
Featured Topics:
- Veridical Data Science - Speaker: Bin Yu, October 15,2024
- Random Forests: Why they Work and Why that’s a Problem - Speaker: Lucas Mentch, November 19, 2024
- Causal AI in Business Practices - Speakers: Victor Lo, and Victor Chen, January 24, 2025
- Large Language Models: Transforming AI Architectures and Operational Paradigms - Speaker: Frank Wei, February 18, 2025
- Machine Learning for Airborne Biological Hazard Detection - Speaker: Jared Schuetter, March 11, 2025
- Trustworthy AI in Weather, Climate, and Coastal Oceanography - Speaker: Dr. Amy McGovern, May 13, 2025
- Sequential Causal Inference in Experimental or Observational Settings - Speaker: Aaditya Ramdas, August 26, 2025
- POSTPONED: AI, Statistics & Data Science in Practice Webinar: Reinventing Operations Management’s Research and Practice with Data Science - Speaker: David Simchi-Levi (DATE TBD)
- Covariate Adjustment, Intro to Resampling, and Surprises - Speaker: Tim Hesterberg, October 3, 2025
- Bayesian Geospatial Approaches for Prediction of Opioid Overdose Deaths Utilizing the Real-Time Urine Drug Test - Speaker: Joanne Kim, November 18, 2025
- COVID-19 Focused Cost-benefit Analysis of Public Health Emergency Preparedness and Crisis Response Programs - Speaker: Nancy McMillan, December 11, 2025
Event Type
- NISS Hosted
Cost
Location
Policy
