The Longitudinal Business Database (LBD) is a census of business establishments in the U.S. with paid employees that was developed by the Center for Economic Studies at the U.S. Bureau of the Census. It supports an active research agenda focusing on business entry and exit, gross employment flows, employment volatility, industrial organization, and other topics that cannot be adequately addressed with establishment-level data. As the database contains confidential payroll and employment data that are protected by law under Title 13 and Title 26, gaining research access to the LBD is nontrivial. The Synthetic Longitudinal Database, or SynLBD project, aims to produce a useful public-use dataset analogous to those available for various demographic and health surveys.
Typically, public-use datasets are modified to protect confidentiality, while still allowing for many valid inferences. Longitudinal business census data present an unusual combination of statistical disclosure control challenges. Business data are at a greater risk of confidentiality breaches because many establishments are highly recognizable and can be identified from just a few variables. Longitudinal structures provide an additional characteristic that can aid in re-identification, and in fact, longitudinal public-use data are nearly as rare as public-use establishment data. Lastly, as the LBD is a census, there is no protection arising from sampling uncertainty. For these reasons, any public-use version of the LBD must be substantially altered to protect against disclosures. No actual values of continuous variables are permitted to be released, ruling out the use of methods such as data swapping and data shuffling. Other common methods, such as rounding or coarsening, would not provide an adequate balance between disclosure protection and analytic validity; however, synthetic data methods allow a great deal of modification to the data while still allowing for a range of broad analyses.
Synthetic data methods replace confidential values with multiple imputations. These are values drawn from the posterior predictive distribution of the data being altered. The SynLBD is considered partially synthetic data because the industry and geography are not synthesized, due to their being present in the public County Business Patterns. All other variables, including firm links and over 30 years of employment and payroll, are replaced with imputations. A disadvantage of synthetic data methods is that they can be difficult to implement for complex datasets. Hence the SynLBD project is part of an active research community developing repeatable methods for generating synthetic data.
An initial version of the SynLBD is available on the Census Bureau website at www.census.gov/ces/dataproducts/synlbd/. An enhanced version providing many additional requested features is nearing completion.
The Synthetic Longitudinal Database, or SynLBD project, aims to produce a useful public-use dataset analogous to those available for various demographic and health surveys.
Current team members include Saki Kinney (NISS), Jerry Reiter (Duke), and Javier Miranda (Census). Phase 1 also included Ron Jarmin and Arnold Reznek (Census) and John Abowd (Cornell).