First CANSSI-NISS Health Data Science Workshop - FULL PROGRAM

THURSDAY, MAY 6

[Please Note: All times are Eastern Time] [Link back to Event page]

Short Course Registration

$35 registration per short course. Select the registration options on the right-hand side of the Event page, check the box 'I am not a Robot" and then "Register for this Event".  Do this for each course you are interested in attending (maximum one short course per AM or PM session).

AM Sessions

10:00 - 13:00 Short Course 1 - "A Brief Introduction to Causal Inference" - Instructor: Yeying Zhu, (University of Waterloo)

About the Course: This short course will introduce the basic concepts of causality, the potential outcomes framework, and the assumptions needed for drawing causality in observational studies. Popular methods for drawing causality will be introduced: regression methods, propensity score-based methods, marginal structural modelling, instrumental variable analysis and causal mediation analysis.  The importance of achieving balance in observational studies will also be discussed. These methods/concepts will be illustrated through a health application.

About the Instructor:  Dr. Zhu is currently an Associate Professor at Department of Statistics & Actuarial Science, University of Waterloo. She obtained her PhD degree at Pennsylvania State University in 2013 and her Master degree at National University of Singapore in 2008. Dr. Zhu’s research interest lies in causal inference, machine learning and the interface between the two.   Her recent focus is on the development of variable selection/dimension reduction procedures to adjust for confounding in a high-dimensional setting. In addition, she has developed innovative machine learning algorithms for the modeling of propensity scores for binary, multi-level and continuous treatments. Meanwhile, Dr. Zhu is also working on causal mediation analysis, which examines how a treatment/intervention affects the outcome through one/multiple intermediate variables. Applications of Dr. Zhu’s research lie in biomedical studies, public health and social sciences.

10:00 - 13:00 Short Course 2 - "Introduction to Reinforcement Learning in Precision Medicine" - Instructor: Eric Laber, (Duke University)

About the Course: A wide range of sequential decision problems in public health, including adaptive clinical trial design, optimal treatment regimes, and resource management, can be formalized as reinforcement learning (RL) problems.  In this tutorial, we: (i) review the foundations of RL from a statistical perspective; (ii) illustrate the application RL through several clinical case studies; and (iii) present open and emerging RL problems arising in mobile-/tele-health, respondent-driven sampling, management of infectious diseases, and platform trials.

About the Instructor:  Eric Laber is Professor of Statistical Science and Biostatistics and Bioinformatics at Duke University.  He has expertise in reinforcement learning and precision medicine and has published more than 60 articles and one book in this area.  His work has been recognized through numerous awards including the Noether Award and the Raymond J. Carroll Young Scholar award.   For the past two years, he has collaborated with Amazon on reinforcement learning problems in retail planning through the Amazon Scholars program.

PM Sessions

14:00 - 17:00 Short Course 3 - "Deep Learning" - Instructor: Ming Li, (Amazon)

About the Course:  In the past a few years, deep learning has gained traction in many application areas especially when non-traditional data sources are involved such as text and image. Deep learning methods now become essential tools in data scientist’s toolbox. In this tutorial, we will first introduce general concepts in deep learning using feed forward neural network (FFNN). Then we will cover convolutional neural network (CNN) for image related applications, and recurrent neural network (RNN) for text related applications. The purpose of this tutorial is introduction of fundamental concepts and applications of deep learning. We will focus on the application part with hands on exercises to build deep learning models for two well-known datasets: handwritten digit image dataset and IMDB movie review dataset. If you are a statistician and data scientist with very limited experience or knowledge of deep learning, this is the right place for you to jump start your deep learning application skills. After taking this course, you will be able to apply FFNN, CNN and RNN methods to your day-to-day work to combine structured dataset of numerical and categorical features with unstructured dataset of text and image.

About the Instructor:  Ming Li is currently a senior research scientist at Amazon and adjunct faculty of University of Washington. He organized and presented 2018 JSM Introductory Overview Lecture: Leading Data Science: Talent, Strategy, and Impact. He was the Chair of Quality & Productivity Section of ASA. He was a Data Scientist at Walmart and a Statistical Leader at General Electric Global Research Center before joining Amazon. He obtained his Ph.D. in Statistics from Iowa State University in 2010. With a deep statistics background and a few years’ experience in data science and machine learning, he has trained and mentored numerous junior data scientists with different backgrounds such as statistician, programmer, software developer, database administrator and business analyst. He is also an Instructor of Amazon’s internal Machine Learning University and was one of the key founding members of Walmart’s Analytics Rotational Program.

14:00 - 17:00 Short Course 4 - "Introduction to Disease Modeling" - Instructor: Rob Deardon, (University of Calgary)

About the Course:  With the ongoing COVID-19 pandemic, there has been an understandable increase in the interest in the mathematical and statistical modelling of infectious disease epidemics. The modelling of infectious disease spread through a population generally requires the use of non-standard statistical models. This is primarily because infection events depend upon the infection status of other members of the population (i.e., we cannot assume independence of infection events). Typically, statistical inference for these models (e.g., parameter estimation) is done in a Bayesian context using computational techniques such as Markov chain Monte Carlo (MCMC). This is in part due to the non-standard form of the models, but also in part due to the fact that we often have missing or uncertain data; for example, infection times are rarely observed. Bayesian data augmentation provides a natural framework for allowing for such uncertainty. In this workshop, we will examine the main characteristics of such infectious disease models, and how to fit them to data within a Bayesian statistical framework. The workshop will include instruction on the use of the R packages, “deSolve”, “EpiILM” and "EpiILMCT”, to implement models and statistical methods.

About the Instructor:  Rob Deardon is a Professor of Biostatistics with a joint position in the Faculty of Veterinary Medicine and Department of Mathematics & Statistics at the University of Calgary. Much of his recent work has been in the area of infectious disease modelling, but he is also interested in Bayesian & computational statistics, experimental design, disease surveillance methods, spatio-temporal modelling, statistical learning and statistical modelling in general. He currently has a research group consisting of 11 postdocs and graduate students, and has published 65+ papers in peer-reviewed journals. He is the Graduate Coordinator of the Interdisciplinary Biostatistics Graduate Program at Calgary, and served as Chair of the Statistics Section of the NSERC Discovery Grant Mathematics & Statistics Evaluation Group.


FRIDAY, MAY 7

10:00 - 10:15  Opening Remarks

10:15 - 11:15

Plenary Talk: Xiao-Li Meng (Harvard University)

"Personalized Treatments: Sounds heavenly, but where on Earth did they find my guinea pigs?"

Abstract: Are you kidding me? Surely no one should take personalized literally. Fair enough, but then how un-personalized is personalized? That is, how fuzzy should “me” become before there are enough qualified “me”s to serve as my guinea pigs? Wavelet-inspired Multi-resolution (MR) inference (Meng, 2014, COPSS 50th Anniversary Volume) allows us to theoretically frame such a question, where the primary resolution level defines the appropriate fuzziness --- very much like identifying the best viewing resolution when taking a photo. Statistically, the search for the appropriate primary resolution level is a quest for a sensible bias-variance trade-off: estimating more precisely a less relevant treatment effect verses estimating less precisely but a more relevant treatment effect for “me.” Theoretically, the MR framework provides a statistical foundation for transitional inference, an empiricism concept, rooted and practiced in clinical medicine since ancient Greece. Unexpectedly, the MR framework also reveals a world without the bias-variance trade-off, where the personal outcome is governed deterministically by potentially infinitely many personal attributes. This world without variance apparently prefers overfitting in the lens of statistical prediction and estimation, a discovery that might provide a clue to some of the puzzling success of deep learning and the like (Li and Meng, 2021, JASA).


11:20 - 12:40  Poster Competition for Students and New Researchers (Top 5 present!)

The First CANSSI-NISS Health Data Science Workshop invites graduate students and early researchers to submit posters that describe work in which they are currently engaged. (Individuals that earned their Ph.D. prior to January 1, 2019 are not eligible to submit.)  Use the form link below to enter your contact information along with a pdf of your poster and a link to a short video presentation of your poster (maximum 10 min limit).  Get to the point and be concise. Videos can be recorded using Zoom or other video platforms that will allow you to submit a link of your recording. Video files will not be accepted. If you have questions or need help recording your video email gjohnson@niss.org.

Judges will select the top five posters and these top five will present their posters live to attendees during this Poster Session for Students and New Researchers!  A first place award will be given, along with prizes for two runners-up!  Submit Your Poster Today!  (DEADLINE FOR SUBMISSIONS: April 21, 2021)

12:45 - 13:15  Break


Session 1:  Statistical Issues with COVID-19:

13:15 - 13:40  Nilanjan Chatterjee (Johns Hopkins University)

Title: "Modeling Individual and community-level risk for COVID-19 mortality: Some Implications for Vaccine Allocations"

Abstract: Reducing COVID-19 burden for populations will require equitable and effective risk-based allocations of scarce preventive resources, including vaccinations. In this talk, I will describe our recent effort for using modern data integration techniques to develop a general population risk calculator for COVID-19 mortality based on various sociodemographic factors and pre-existing conditions for the US population. We will further show how such tools can be tailored to produce absolute risk estimates in future time frames by incorporating information on pandemic dynamics at the community level. Finally, we will describe methods for using this individual level model to map community-level risks using information on prevalence and co-occurrence of risk factors from health survey datasets.  We will illustrate applications of these methods for making risk projections in the US and various South American countries and highlight some implications for risk-based vaccine allocation.  The talk will include a live demonstration of a web-based risk calculator and interactive maps for viewing community-level risks in the US setting.

13:40 - 14:05  Rob Deardon (University of Calgary)

Title: "Modelling COVID-19 using machine learning-based inference methods"

Abstract:  Information obtained from statistical infectious disease transmission models can be used to inform the development of containment strategies. Inference procedures such as Bayesian Markov chain Monte Carlo are typically used to estimate parameters of such models, but are often computationally expensive. However, in an emerging epidemic, stakeholders must move quickly to contain spread. Here, we explore machine learning methods for carrying out fast inference via supervised classification. We consider the issues of regularization, model choice and parameter estimation. This is done within the context of spatial models, applied to both diseases of agriculture and the COVID-19 epidemic. We also consider how accurate such methods are in comparison with naïve, and much slower, MCMC approaches.

14:05 - 14:30  Xihong Lin (Harvard T.H. Chan School of Public Health)

Title: "Learning from COVID-19 Data on Transmission, Health Outcomes, Interventions and Vaccination"

Abstract:  COVID-19 is an emerging respiratory infectious disease that has become a pandemic. In this talk, I will first provide a historical overview of the epidemic in Wuhan. I will then provide the analysis results of 32,000 lab-confirmed COVID-19 cases in Wuhan to estimate the transmission rates using Poisson Partial Differential Equation based transmission dynamic models. This model is also used to evaluate the effects of different public health interventions on controlling the COVID-19 outbreak, such as social distancing, isolation and quarantine. I will present the results on the epidemiological characteristics of the cases. The results show that multi-faceted intervention measures successfully controlled the outbreak in Wuhan. I will next present transmission regression models for estimating transmission rates in USA and other countries, as well as factors including intervention effects using social distancing, test-trace-isolate strategies that affect transmission rates. I will discuss estimation of the proportion of undetected cases, including asymptomatic, pre-symptomatic cases and mildly symptomatic cases, the chances of resurgence in different scenarios, prevalence, and the factors that affect transmissions. I will also present the US county-level analysis to study the demographic, social-economic, and comorbidity factors that are associated with COVID-19 case and death rates. I will also present the analysis results of >500,000 participants of the HowWeFeel project on health outcomes and behaviors in US, and discuss the factors associated with infection, behavior, and vaccine hesitancy.   I will discuss efficient pooled testing design using hypergraph factorization.

14:30 - 14:55  Grace Yi (Western University)

Title: "Analyzing COVID-19 Data: Some Issues and Challenges"

Abstract: The mystery of the coronavirus disease 2019 (COVID-19) and the lack of effective treatment for COVID-19 have presented a strikingly negative impact on public health. While research on COVID-19 has been ramping up rapidly, a very important yet somewhat overlooked challenge is on the quality and unique features of COVID-19 data. The manifestations of COVID-19 are not yet well understood.  The swift spread of the virus is largely attributed to its stealthy transmissions in which infected patients may be asymptomatic or exhibit only flu-like symptoms in the early stage. Due to a good portion of asymptomatic infections, the confirmed cases are typically under-reported, error-contaminated, and involved with substantial noise. In this talk, I will discuss some issues related to faulty COVID-19 data and how they may challenge inferential procedures.

14:55 - 15:00  Q&A


Session 2:  Causal Inference for Big Health Data:

15:15 - 15:40  Caroline Uhler (Massachusetts Institute of Technology)

Title: "Causal Inference in the Light of Drug Repurposing for COVID-19"

Abstract: Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. An exciting opportunity in this regard stems from the growing availability of perturbation / intervention data (drugs, knockouts, overexpression, etc.) in biology. In order to obtain mechanistic insights from such data, a major challenge is the development of a framework that integrates observational and interventional data and allows predicting the effect of yet unseen interventions or transporting the effect of interventions observed in one context to another. I will present a framework for causal structure discovery based on such data and demonstrate how these ideas can be applied for drug repurposing in the current SARS-CoV-2 crisis.

15:40 - 16:05  Debashis Ghosh (Colorado School of Public Health)

Title: "Thinking causally in high dimensions
  
Abstract: With the availability of ever-increasingly large observational databases in epidemiological and medical studies, we seek to understand how the classical potential outcomes framework and the attendant causal inferential procedures can be applied to this setting.  While there has been a lot of work on statistical methods for high-dimensional data, we argue that there are aspects about causal inference that make the problem more challenging. In this talk, we will describe three non-intuitive findings: The ‘treatment positivity’ assumption from causal inference becomes less innocuous in higher dimensions, Margin theory from machine learning can be used in high-dimensional causal problems; Gradient boosting yields a powerful tool for causal effect estimation.
  
This is joint work with Efrén Cruz-Cortés (Penn State University), Kevin Josey (Harvard School of Public Health), Elizabeth Juaréz-Colunga (University of Colorado) and Fan Yang (University of Colorado).

16:05 - 16:30  Erica Moodie (McGill University)

Title: "Preserving privacy in estimating individualized treatment rules from distributed data collection sites"

Abstract: Precision medicine is a rapidly expanding area of health research wherein patient level information is used to inform care via individualized treatment rules (ITR). Identifying the ITR which optimizes expected patient outcome is of obvious interest, and has to date been done exclusively using individual-level data. However, estimating ITRs require large amounts of data and may necessitate multi-centre collaborations. This may raise concerns about data privacy. In this talk, I will introduce ITRs and a straightforward, doubly-robust estimation method and discuss approaches to preserving privacy while producing unbiased estimates of rules that tailor treatment to individual characteristics. The preferred approach is illustrated via an analysis of optimal warfarin dosing. This is joint work with Coraline Danieli.

16:30 - 16:55  Dylan Small (The Wharton School, University of Pennsylvania)

Title: "Do C-Sections Protect Very Premature Babies? Dissonant Conclusions When Testing the Validity of an Instrumental Variable"

Abstract: An instrument or instrumental variable is often used in an effort to avoid selection bias in inference about the effects of treatments when treatment choice is based on thoughtful deliberation. Instruments are increasingly used in health outcomes research. An instrument is a haphazard push to accept one treatment or another, where the push can affect outcomes only to the extent that it alters the treatment received. There are two key assumptions here: (R) the push is haphazard or essentially random once adjustments have been made for observed covariates, (E) the push affects outcomes only by altering the treatment, the so-called “exclusion restriction.” These assumptions are often said to be untestable; however, that is untrue if testable means checking the compatibility of assumptions with other things we think we know. A test of this sort may result in a collection of claims that are individually plausible but mutually inconsistent, without a clear indication as to which claim is culpable for the inconsistency. We discuss this subject in the context of a study of the effects of delivery by cesarean section on the survival of extremely premature infants of 23–24 weeks gestational age. This talk is based on joint work with Fan Yang, Jose Zubizarreta, Dylan Small, Scott Lorch, and Paul Rosenbaum on one paper and joint work with Colin Fogarty, Pixu Shi, and Mark Mikkelsen on another paper.

16:55 - 17:00  Q&A


SATURDAY, MAY 8

10:00 - 11:00

Plenary Talk: Mary Thompson (University of Waterloo)

"The interface of health data science and survey methods"

Abstract: Data science encompasses data acquisition and management, modeling and analysis, summarization and dissemination. For health data science in particular, overarching themes are data quality and validity, security of data, and privacy. In health research, certain issues have particular emphasis: how can we facilitate making appropriate data and their summaries readily available to health care practitioners and administrators? how can we fill data gaps? how can we ensure the soundness of quantifiable conclusions?  In this talk, with examples, I will discuss roles of survey methodology in health data science, including: the quantification of bias and precision with complex data collection designs; the combination of health survey data with data from new and traditional sources; the complementary roles of longitudinal and cross-sectional survey data; and the design of surveys to support causal inference.
 


Late-Breaking Session: AI and Health Data Science

11:05 - 11:35  Bin Yu (University of California, Berkeley)

Title: "Veridical Data Science for biomedical research: subgroup discovery for precision medicine via staDISC"

Abstract: "A.I. is like nuclear energy -- both promising and dangerous" -- Bill Gates, 2019.

Data science is a pillar of A.I. and has driven much of recent cutting-edge discoveries in biomedical research. Its practice is through the data science life cycle (DSLC) ' that includes steps of problem formulation, data collection, data cleaning to modeling and data result interpretation and conclusions. Human judgment calls are ubiquitous at every step of this cycle, for example, those on the choices of data cleaning methods, predictive algorithms, and data perturbation schemes. These judgment calls, when based on solid prior knowledge, help realize the promises of A.I., but when not warranted, are often linked to "dangers" of A.I. To maximally realize promises and mitigate dangers, the PCS framework is introduced for DSLC based on the three core principles of data science: predictability, computability, and stability (PCS). It uses predictability as a universal reality check and stability as a paramount robustness consideration against unwarranted human judgment calls while putting computation at its center. It consists of a workflow and documentation (in R Markdown or Jupyter Notebook) for DSLC. PCS unifies, streamlines, and expands on the best practices of machine learning and statistics, and it takes a step forward towards veridical data science for responsible data analysis and decision making. Employing the PCS framework in causal inference for precision medicine, we developed staDISC for the stable discovery of interpretable subgroups via calibration. We used staDISC to analyze data from the clinical trial VIGOR on Merck's painkiller Vioxx. And we found that the subgroups discovered by staDISC using the VIGOR data is validated to a good extent by another clinical trial on Vioxx called APPROVe.

This talk is based on the following two papers:
B. Yu and K. Kumbier (2020) Veridical data science, PNAS, 117 (8), 3920-3929. https://www.pnas.org/content/117/8/3920
R. Dwivedi*, Y. Tan*, B. Park, M. Wei, K. Horgan, D. Madigan*, B. Yu* (2020). Stable discovery of interpretable subgroups via calibration in causal studies (staDISC).  International Statistical Review. https://onlinelibrary.wiley.com/doi/10.1111/insr.12427

11:35 - 12:05  David Buckeridge (McGill University)

Title: "Global Surveillance of Epidemics and Control Measures"

Abstract: National public health organizations and the World Health Organization rely increasingly on digital media (i.e., online news media, social media) to detect and monitor infectious disease events. Also known as event-based surveillance (EBS), this approach is meant to complement indicator-based surveillance (e.g., using the daily counts of laboratory-confirmed infections). Despite the widespread adoption of EBS, its impact is limited by methodological challenges. I will review two challenges and present work towards solutions. The first challenge is detecting the implementation, or changes in the use, of non-pharmaceutical interventions from online news media. I will describe a machine learning approach to this challenge. The second challenge is tracking and intelligently summarizing the evolution of an epidemic from online news media. I will present an initial framework for addressing this challenge. (The machine learning approach is joint work with Yue Li from McGill.)

12:05 - 12:35  Rob Tibshirani (Stanford University)

Title: "Understanding Cross-Validation and Prediction Error"

Abstract: Cross-validation is a widely-used technique to estimate prediction accuracy. However its properties are not that well understood. First, it is not clear exactly what form of prediction error is being estimated by cross-validation: one would like to think that cross-validation estimates the prediction error for the model and the data at hand. Surprisingly, we show here that this is not the case, (at least for the special case of linear models) and derive the actual estimand(s). Second, the standard (na\"ive) confidence intervals for prediction accuracy that are derived from cross-validation may fail to cover at the nominal rate, because each data point is used for both training and testing, inducing correlations among the measured accuracy for each fold. As a result, the variance of the CV estimate of error is larger than suggested by na\"ive estimators, which leads to confidence intervals for prediction accuracy that can have coverage far below the desired level. We introduce a nested cross-validation scheme to estimate the standard error of the cross-validation estimate of prediction error, showing empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail. (This is joint work with Stephen Bates and Trevor Hastie.)

12:35 - 12:45  Q&A


12:45 - 13:00 Lunch Break


Session 3:  Statistical Problems in Imaging and Genetics:

13:00 - 13:25  Brian Caffo (Johns Hopkins Bloomberg School of Public Health)

"Parsimony and fit in unsupervised learning and decompositions"

Abstract: In this talk, we consider the use of exploiting parsimony into unsupervised learning processes, especially in the form of decompositions. We review historical efforts in this area and show recent extensions. We apply the methods in the modern setting of complex biological measurements including brain structure and function and as high throughput genomics. We demonstrate that mixing parsimony assumptions within complex models increases their explainability and represent a good tradeoff from purely unstructured approaches. We apply our methods to applications in functional MRI brain connectivity analysis, complex bio-signals and computational biology.

13:25 - 13:50  Radu Craiu (University of Toronto)

Title: "General Behaviour of P-Values Under the Null and Alternative"

Abstract: Hypothesis testing results often rely on simple, yet important assumptions about the behaviour of the distribution of p-values under the null and alternative. We examine tests for one dimensional parameters of interest that converge to a normal distribution, possibly in the presence of many nuisance parameters, and characterize the distribution of the p-values using techniques from the higher order asymptotics literature. We show that commonly held beliefs regarding the distribution of p-values are misleading when the variance or location of the test statistic is not well-calibrated or when the higher order cumulants of the test statistic are not negligible. We further examine the impact of having these misleading p-values on reproducibility of scientific studies, with some examples focused on GWAS studies. Corrected tests are proposed and are shown to perform better than their traditional counterparts in various settings. This is joint work with Yanbo Tang and Lei Sun.

13:50 - 14:15  Linglong Kong (University of Alberta)

Title: "A Review of Statistical Methods in Imaging Genetics"

Abstract: Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function, and brain-related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention, and treatment of numerous complex brain-related disorders (e.g., schizophrenia and Alzheimer). However, the development of analytical methods for the joint analysis of both high-dimensional imaging phenotypes and high-dimensional genetic data, called big data squared (BD$^2$), presents major computational and theoretical challenges for existing analytical methods. Besides the high-dimensional nature of BD$^2$, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for the joint analysis of BD$^2$, including massive univariate and voxel-wise approaches, reduced rank regression, mixture models, and group sparse multi-task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research.

14:15 - 14:40  Marina Vannucci (Rice University)

Title: "Bayesian models for Microbiome Data with Variable Selection"

Abstract: I will describe Bayesian models developed for understanding how the microbiome varies within a population of interest. I will focus on integrative analyses, where the goal is to combine microbiome data with other available information (e.g., dietary patterns) to identify significant associations between taxa and a set of predictors. For this, I will describe a general class of hierarchical Dirichlet-Multinomial (DM) regression models which use spike-and-slab priors for the selection of the significant associations. I will also describe data augmentation techniques to efficiently embed DM regression models into joint modeling frameworks, in order to investigate how the microbiome may affect the relation between dietary factors and phenotypic responses, such as body mass index. I will discuss advantages and limitations of the proposed methods with respect to current standard approaches used in the microbiome community, and will present results on the analysis of real datasets.

14:40 - 14:45  Q&A


Session 4:  Methods for Electronic Health Records (EHR) Data:

15:00 - 15:25  Rebecca Hubbard (University of Pennsylvania)

Title: "Expediting longitudinal cohort studies using electronic health records with Bayesian clustering and trajectory analysis"

Abstract: Use of routinely collected data from electronic health records (EHR) can expedite longitudinal studies that require exposure and outcome data separated by years or decades. For instance, characteristics of the body mass index (BMI) trajectory early in life may be associated with subsequent development of type 2 diabetes. Past studies investigating such relationships have used longitudinal cohort data collected over the course of many years. This type of study is slow, costly, and may have limited generalizability. In contrast, EHR data from routine clinical care can provide longitudinal information on a broad patient population without requiring any additional or specialized data collection. Nonetheless, using EHR data to conduct longitudinal observational research necessitates addressing several challenges including imperfect ascertainment of outcomes in EHR data and inconsistent assessment of exposures. We developed a Bayesian joint phenotyping and trajectory estimation model to address these challenges and demonstrate its performance relative to a standard rule-based phenotyping approach. We found that children with the highest early-life BMI had significantly increased odds of type 2 diabetes in adolescence (odds ratio = 6.2, 95% credible interval 4.4-8.2). These results demonstrate that EHR data coupled with appropriate methodologic approaches can improve efficiency and timeliness of longitudinal observational studies.

15:25 - 15:50  Marc Suchard (UCLA)

Title: "Reliable evidence at scale from administrative claims and EHR data sources"

Abstract: Concerns over reproducibility in science extend to research using existing healthcare data; many observational studies investigating the same topic produce conflicting results, even when using the same data. To address this problem, we propose a paradigm shift. The current paradigm centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time. The new paradigm advocates for high-throughput observational studies using consistent and standardized methods, allowing evaluation, calibration, and unbiased dissemination to generate a more reliable and complete evidence base.  We demonstrate this new paradigm by comparing all hypertension treatments for a set of effectiveness and safety outcomes, producing 587,020 hazard ratios, each using methodology on par with state-of-the-art studies. We furthermore include control hypotheses to evaluate and calibrate our evidence generation process. Results agree with the limited number of randomized trials.  The distribution of effect size estimates reported in literature reveals an absence of small or null effects, with a sharp cutoff at p = 0.05. No such phenomena were observed in our results, suggesting more complete and more reliable evidence.

15:50 - 16:15  Eleanor Pullenayegum (Hospital for Sick Children, and the University of Toronto Dalla Lana School of Public Health)

Title: "Longitudinal studies using electronic health records: handing irregular observation"

Abstract: Longitudinal data collected as part of usual healthcare delivery are becoming increasingly available for research. However, a common feature of these data is that they are collected more frequently when patients are unwell. For example, newborns who are slow to regain their birthweight will require more frequent monitoring and will consequently have more weight measurements than their typically growing counterparts. Failing to account for this would lead to underestimation of the rate of growth of the population of newborns as a whole. I will argue that irregular observation should be handled with the same care with which we handle missing data: careful examination of the extent of irregularity, consideration of the reasons for irregularity, and using analytic approaches to handling the potentially informative nature of the observation process.

16:15 - 16:40  Sherri Rose (Stanford University)

Title: "Identifying Undercompensated Groups Defined by Multiple Attributes in Risk Adjustment"

Abstract: Risk adjustment for health plan payment is known to undercompensate insurers for certain groups of enrollees defined by single attributes (e.g., individuals with mental health and substance use disorders). This undercompensation creates incentives for insurers to discriminate against these groups. To date, no systematic method exists for identifying marginalized groups in risk adjustment, or more broadly in the algorithmic fairness literature, including groups defined by a more complex set of attributes. Extending the concept of variable importance, we construct a new measure of “group importance” in the random forests algorithm to identify groups defined by multiple attributes. Applying our tool in both the health insurance Marketplaces and Medicare Advantage, we find previously unknown undercompensated groups. This work provides policy makers with a tool to uncover incentives for selection in insurance markets and a path towards more equitable health coverage. (Joint work with Anna Zink, Harvard University).

16:40 - 16:45  Q&A

16:45 - 17:00  Closing Remarks