Synthetic Longitudinal Business Database

Research Project

chart from synthetic business database

The Longitudinal Business Database (LBD) is a census of business establishments in the U.S. with paid employees that was developed by the Center for Economic Studies at the U.S. Bureau of the Census. It supports an active research agenda focusing on business entry and exit, gross employment flows, employment volatility, industrial organization, and other topics that cannot be adequately addressed with establishment-level data. As the database contains confidential payroll and employment data that are protected by law under Title 13 and Title 26, gaining research access to the LBD is nontrivial. The Synthetic Longitudinal Database, or SynLBD project, aims to produce a useful public-use dataset analogous to those available for various demographic and health surveys.
Typically, public-use datasets are modified to protect confidentiality, while still allowing for many valid inferences. Longitudinal business census data present an unusual combination of statistical disclosure control challenges. Business data are at a greater risk of confidentiality breaches because many establishments are highly recognizable and can be identified from just a few variables.  Longitudinal structures provide an additional characteristic that can aid in re-identification, and in fact, longitudinal public-use data are nearly as rare as public-use establishment data. Lastly, as the LBD is a census, there is no protection arising from sampling uncertainty. For these reasons, any public-use version of the LBD must be substantially altered to protect against disclosures. No actual values of continuous variables are permitted to be released, ruling out the use of methods such as data swapping and data shuffling. Other common methods, such as rounding or coarsening, would not provide an adequate balance between disclosure protection and analytic validity; however, synthetic data methods allow a great deal of modification to the data while still allowing for a range of broad analyses.

Synthetic data methods replace confidential values with multiple imputations. These are values drawn from the posterior predictive distribution of the data being altered. The SynLBD is considered partially synthetic data because the industry and geography are not synthesized, due to their being present in the public County Business Patterns. All other variables, including firm links and over 30 years of employment and payroll, are replaced with imputations. A disadvantage of synthetic data methods is that they can be difficult to implement for complex datasets. Hence the SynLBD project is part of an active research community developing repeatable methods for generating synthetic data.
An initial version of the SynLBD is available on the Census Bureau website at www.census.gov/ces/dataproducts/synlbd/. An enhanced version providing many additional requested features is nearing completion.

Project Goal: 

The Synthetic Longitudinal Database, or SynLBD project, aims to produce a useful public-use dataset analogous to those available for various demographic and health surveys.

Research Team: 

Current team members include Saki Kinney (NISS), Jerry Reiter (Duke), and Javier Miranda (Census).    Phase 1 also included Ron Jarmin and Arnold Reznek (Census) and John Abowd (Cornell).

Individual Team Members: 
Satkartar (Saki) KinneyJerome P. Reiter

Funding Sponsors:

Research Presentations: 
SynLBD: Providing firm characteristics for synthetic establishment data.” World Statistics Congress, Hong Kong, China. August 2013.“Synthetic Data Generation for the Longitudinal Business Database”, International Workshop/Conference on Bayesian Theory and Applications, Varanasi, India, January 2013“Synthetic Longitudinal Business Data”, International Conference on Advances in Interdisciplinary Statistics and Combinatorics, Greensboro, NC. October 2012.“Towards Unrestricted Public-Use Business Microdata: Construction of the Synthetic Longitudinal Business Database”, International Conference on Establishment Surveys, Montreal, Canada. June 2012.“Making public-use synthetic files of the Longitudinal Business Database”, Privacy in Statistical Databases, Istanbul, Turkey. September 2008.“The Synthetic Longitudinal Business Database” Joint Statistical Meetings, Washington, DC. August 2009.“Generating synthetic public-use files for the Longitudinal Business Database”. Census Research Data Centers Annual Research Conference. Washington, DC. September 2007.“Making public-use synthetic files of the Longitudinal Business Database” Joint Statistical Meetings, Salt Lake City, UT. July 2007.“Making public-use, synthetic files of longitudinal establishment data”. International Comparative Analysis of Enterprise (micro) Data Conference. Chicago, IL. September 2006.“SynLBD: Generation of the Synthetic Longitudinal Business Database”, MITRE Synthetic Data Technical Exchange Meeting, McLean, VA July 2014.“The Synthetic Longitudinal Business Database”. Joint Census-NSF-IRS Workshop on Synthetic Data. Suitland, MD. July 2009.“The Synthetic Longitudinal Business Database: Analytic Validity and Confidentiality Protection.” Census Advisory Committee of Professional Associations Meeting. Suitland, MD. October 2008.“Producing and protecting synthetic business microdata: The Longitudinal Business Database synthetic beta file”. Internal Revenue Service seminar. Washington, DC