Using Surveys to Improve the Representativeness of Nonprobability Samples in Epidemiologic Studies

March 11, 2019 1:00 pm – 4:30 pm


The National Institute of Statistical Sciences (NISS) has invited Yan Li of the University of Maryland at College Park to organize a half-day workshop on "Using Surveys to Improve the Representativeness of Nonprobability Samples in Epidemiologic Studies."

The workshop is an expansion of her popular JSM 2018 session with the same title. (Review the JSM Program Session Information).  See the list of workshop speakers for this workshop below.    
The workshop should be of interest to epidemiologists, survey statisticians and others grappling with the theoretical and practical issues of drawing statistical inference from nonrandom epidemiological samples. Input from the attendees is not only welcome but encouraged. Statisticians and epidemiologists from all sectors, government, academia and industry, are encouraged to participate.

Workshop Capacity:  Limited to first 62 registrants.  BLS Conference Center Rooms 9 & 10.


You can register for this workshop using the 'Register for this Event' link on the upper right of this web page.



National Surveys collect probability samples that are population representative and allow valid inference for finite population parameters. Nonprobability samples are often collected for convenience or cost efficiency. The results drawn from analyzing nonprobability samples, however, can suffer from a lack of external validity, selection bias and noncoverage of a target population, leading to biased estimation of parameters such as prevalence/incidences and associations. In this workshop, we will present a combination of innovative statistical methods and applications on improving the representativeness of nonprobability samples in epidemiologic studies.

A 10 minute overview will  be followed by four talks of approximately 30 minutes each with plenty of time allocated for questions and discussion between the speakers and the audience.  

Presenter's Abstracts

Overview of issues surrounding the role of representative sampling in epidemiologic studies

Hormuzd Katki - Senior Investigator, Biostatistics Branch, Division of Cancer Epidemiology & Genetics National Cancer Institute

Few issues have engendered more disagreement between epidemiologists and statisticians than the role of representative sampling in the design of epidemiologic studies.   Both sides agree that representativeness is critical for estimating prevalences of risk-factors and diseases.  However, representativeness is much less important for estimating odds-ratios for association of disease and risk factors.  I discuss the conceptual and practical issues involved, considering the views of both sides of the debate.  I suggest that the best compromise is to improve the representativeness at the analysis stage.  This will involve a fruitful collaboration between epidemiologists, biostatisticians, and survey statisticians.  I suggest some important questions whose answers could help make this line of research most practically useful.

Population-based disease risk prediction modeling using national survey, clinical, and registry data: Application to risk prediction for oropharyngeal cancer in the US population

Barry Graubard - Senior Investigator, Biostatistics Branch, Division of Cancer Epidemiology & Genetics National Cancer Institute

Population-based case-controls studies (PBCCS) are a corner stone of epidemiology for obtaining nearly unbiased estimates of association of exposures and rare diseases for targeted regional populations. This paper discusses how to combine a clinically-based case series, a national survey as source of controls and a national-based disease surveillance system to calibrate the case series to form a national PBCCS that is used to estimate risk of disease for the population. Because the cases are a non-representative sample, propensity weighting is developed using the disease surveillance system. The risk model is evaluated using cross-validation, area under the ROC curve and quantification of risk stratification that take account of the propensity weighting and the sample design of the survey, These methods are illustrated for developing a prediction model of oropharyngeal cancer in the US population. The cases come from a clinical setting at Ohio State University, the population controls are from the National Health Examination Survey, and the National Cancer Institute Surveillance, Epidemiology, and End Results is used calibrate the cases.  Methods for the model evaluation and variance estimation are discussed.       

Combining Probability Non-probability Samples: Theory and Practice

Michael Elliott - Professor of Biostatistics Research Professor of Survey Methodology Survey Methodology Program, University of Michigan, Ann Arbor, MI

Although probability sample designs remain a “gold standard” in survey research, demand for use of non-probability samples is increasing, due to, among other reasons, rising costs and falling response rates in probability samples and the availability of “big data” from administrative databases, social media users, and other sources.Design-based inference, in which the distribution for inference is generated by the random mechanism used by the sampler, cannot be used for non-probability samples.One alternative is quasi-randomization in which pseudo-inclusion probabilities are estimated based on covariates available for samples and nonsample units. Another is superpopulation modeling for the analytic variables collected on the sample units in which the model is used to predict values for the nonsample units.A third alternative is a model-assisted approach in which probability samples are used to develop calibration estimators.We will overview these approaches and discuss their unique advantages in different analytic and application settings.

A Kernel Weighting Approach to Improve Population Representativeness for Estimating Prevalence of Risk-factors and Diseases

Yan Li - Associate Professor, Joint Program for Survey Methodology, University of Maryland, College Park, MD

Epidemiological (EPI) cohorts are often collected from convenient samples, suffering from selection bias and coverage issue. Accordingly, the analyses are lack of external validity, leading to biased point estimation and invalid variance estimation.  We develop an efficient kernel weighting approach that treats a survey sample as a reference to create pseudo weights for the cohort by utilizing propensity scores and kernel smoothing techniques. The proposed jackknife (JK) variance estimator, in addition to accounting for the correlation induced by the homogeneity of the participants from the same study center, considers the variability due to estimating propensity scores.  Comparing existing inverse propensity score- (IPS-) and the proposed kernel-weighted estimators, Monte Carlo simulation studies are conducted and show that the proposed kernel-weighed estimators reduce the bias and increase the efficiency of the estimated disease prevalence. The developed approach is further demonstrated using National Health Interview Survey and the Prostate, Lung, Colorectal, and Ovarian cohort to estimate the prevalence of emphysema. 

Improving External Validity of Association Estimation Using the Kernel Weighting Method

Lingxiao Wan - Joint Program for Survey Methodology, University of Maryland, College Park, MD; Predoctoral fellow, National Cancer Institute

In epidemiology, association between risk factors and diseases is important to study for human diseases. There is fractious debate about the value of population representative samples for external validity of association estimation. Some literature argues that lack of representativeness may not lead to large bias in estimates of association when the confounders are appropriately controlled. Others advocate the necessity of representative sample. With limited availability of variables, or mis-specified analysis model, controlling for confounders may not substantially reduce the bias if the sample is non-representative. In this talk, I will present how non-representative sample may bias the estimates of association in the population, and if the kernel weighting method can help improve the external validity of the association estimation.

Event Type


Bureau of Labor Statistics


National Institute of Statistical Sciences


Bureau Labor Statistics Conference Center Rooms 9 & 10
2 Massachusetts Ave NE
Please use 1st Street entrance!
District of Columbia
United States