The USDA’s National Agricultural Statistics Service (NASS) conducts the U.S. Census of Agriculture in years ending in 2 and 7. Population estimates from the census are adjusted for undercoverage, non-response and misclassification and calibrated to known population totals. These adjustments are reflected in weights that are attached to each responding unit. Calculating these weights has been a two-part procedure. First, one calculates initial (Dual System Estimation or DSE) weights that account for under-coverage, non-response and misclassification. and in the second step, calibration is used to adjust the weights by forcing the weighted estimates obtained in the first step to match known population totals. Recently, a calibration algorithm, Integer Calibration

(INCA), was developed to produce integer calibrated weights as required in NASS publications. This paper considers combining the two steps of calculating weights into one. This new algorithm is based on a regularized constrained dual system estimation methodology, which combines capture-recapture and calibration (CaRC).

Download: https://www.niss.org/sites/default/files/Toppin_CaRC_20170926.pdf

%0 Conference Paper %B 2015 FCSM Research Conference %D 2018 %T A design-sensitive approach to fitting regression models with complex survey data %A Phillip Kott %K designbased. %K extended model %K generalized cumulative logistic model %K proportional-odds model %K Pseudo-maximum likelihood %XFitting complex survey data to regression equations is explored under a design-sensitive model-based framework. A robust version of the standard model assumes that the expected value of the difference between the dependent variable and its model-based prediction is zero no matter what the values of the explanatory variables. The extended model assumes only that the difference is uncorrelated with the covariates. Little is assumed about the error structure of this difference under either model other than independence across primary sampling units. The standard model often fails in practice, but the extended model very rarely does. Under this framework some of the methods developed in the conventional design-based, pseudo-maximum-likelihood framework, such as fitting weighted estimating equations and sandwich mean-squared-error estimation, are retained but their interpretations change. Few of the ideas here are new to the refereed literature. The goal instead is to collect those ideas and put them into a unified conceptual framework.

%B 2015 FCSM Research Conference %I Statistics Surveys %@ 1935-7516 %G eng %U https://projecteuclid.org/euclid.ssu/1516179619 %L Vol. 12 (2018) 1–17 %0 Conference Proceedings %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %D 2017 %T Developments in Model-Based County Level Estimation of Agricultural Cash Rental Rates %A Bellow, Michael E. %A Cruze, Nathan %A Erciulescu, Andreea L. %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %G eng %U https://www.niss.org/sites/default/files/2017%20-%20Developments%20in%20Model-Based%20County-Level%20Estimation%20of%20Ag%20Cash%20Rental%20Rates.pdf %& 2773 - 2790 %0 Journal Article %J ICES V Proceedings. Alexandria, VA: American Statistical Association %D 2016 %T Developments in Model-Based Estimation of County-Level Agricultural Estimates %A Cruze N.B. %A Erciulescu A.L. %A Nandram B. %A Barboza W.J. %A Young L.J. %B ICES V Proceedings. Alexandria, VA: American Statistical Association %G eng %U http://ww2.amstat.org/meetings/ices/2016/proceedings/131_ices15Final00229.pdf %0 Journal Article %J Molecular and Cellular Proteomics %D 2013 %T Design, Implementation and Multisite Evaluation of a System Suitability Protocol for the Quantitative Assessment of Instrument Performance in Liquid Chromatography-Multiple Reaction Monitoring-MS (LC-MRM-MS) %A Abbatiello, S. %A Feng, X. %A Sedransk, N. %A Mani, DR %A Schilling, B %A Maclean, B %A Zimmerman, LJ %A Cusack, MP %A Hall, SC %A Addona, T %A Allen, S %A Dodder, NG %A Ghosh, M %A Held, JM %A Hedrick, V %A Inerowicz, HD %A Jackson, A %A Keshishian, H %A Kim, JW %A Lyssand, JS %A Riley, CP %A Rudnick, P %A Sadowski, P %A Shaddox, K %A Smith, D %A Tomazela, D %A Wahlander, A %A Waldemarson, S %A Whitwell, CA %A You, J %A Zhang, S %A Kinsinger, CR %A Mesri, M %A Rodriguez, H %A Borchers, CH %A Buck, C %A Fisher, SJ %A Gibson, BW %A Liebler, D %A Maccoss, M %A Neubert, TA %A Paulovich, A %A Regnier, F %A Skates, SJ %A Tempst, P %A Wang, M %A Carr, SA %XMultiple reaction monitoring (MRM) mass spectrometry coupled with stable isotope dilution (SID) and liquid chromatography (LC) is increasingly used in biological and clinical studies for precise and reproducible quantification of peptides and proteins in complex sample matrices. Robust LC-SID-MRM-MS-based assays that can be replicated across laboratories and ultimately in clinical laboratory settings require standardized protocols to demonstrate that the analysis platforms are performing adequately. We developed a system suitability protocol (SSP), which employs a predigested mixture of six proteins, to facilitate performance evaluation of LC-SID-MRM-MS instrument platforms, configured with nanoflow-LC systems interfaced to triple quadrupole mass spectrometers. The SSP was designed for use with low multiplex analyses as well as high multiplex approaches when software-driven scheduling of data acquisition is required. Performance was assessed by monitoring of a range of chromatographic and mass spectrometric metrics including peak width, chromatographic resolution, peak capacity, and the variability in peak area and analyte retention time (RT) stability. The SSP, which was evaluated in 11 laboratories on a total of 15 different instruments, enabled early diagnoses of LC and MS anomalies that indicated suboptimal LC-MRM-MS performance. The observed range in variation of each of the metrics scrutinized serves to define the criteria for optimized LC-SID-MRM-MS platforms for routine use, with pass/fail criteria for system suitability performance measures defined as peak area coefficient of variation <0.15, peak width coefficient of variation <0.15, standard deviation of RT <0.15 min (9 s), and the RT drift <0.5min (30 s). The deleterious effect of a marginally performing LC-SID-MRM-MS system on the limit of quantification (LOQ) in targeted quantitative assays illustrates the use and need for a SSP to establish robust and reliable system performance. Use of a SSP helps to ensure that analyte quantification measurements can be replicated with good precision within and across multiple laboratories and should facilitate more widespread use of MRM-MS technology by the basic biomedical and clinical laboratory research communities.

%B Molecular and Cellular Proteomics %V 12 %P 2623-2639 %G eng %R 10.1074/mcp.M112.027078 %0 Journal Article %J Journal of Official Statistics %D 2013 %T Discussion of five papers on “Systems and architectures for high-quality statistics production %A A. F. Karr %B Journal of Official Statistics %V 29 %P 157-163 %G eng %0 Journal Article %J Statistics, Politics and Policy %D 2012 %T Data, Statistics and Controversy: Making Scientific Data Intelligible %A Sedransk, N. %A Young, L. %A Spiegelman, C. %K data availability %K Daubert rule %K inference verification %K meta-data %K proprietary data %K publication bias %K reuse of data %K secondary analysis %K synthetic data %XMaking published, scientific research data publicly available can benefit scientists and policy makers only if there is sufficient information for these data to be intelligible. Thus the necessary meta-data go beyond the scientific, technological detail and extend to the statistical approach and methodologies applied to these data. The statistical principles that give integrity to researchers’ analyses and interpretations of their data require documentation. This is true when the intent is to verify or validate the published research findings; it is equally true when the intent is to utilize the scientific data in conjunction with other data or new experimental data to explore complex questions; and it is profoundly important when the scientific results and interpretations are taken outside the world of science to establish a basis for policy, for legal precedent or for decision-making. When research draws on already public data bases, e.g., a large federal statistical data base or a large scientific data base, selection of data for analysis, whether by selection (subsampling) or by aggregating, is specific to that research so that this (statistical) methodology is a crucial part of the meta-data. Examples illustrate the role of statistical meta-data in the use and reuse of these public datasets and the impact on public policy and precedent.

%B Statistics, Politics and Policy %V 3 %P 1-20 %G eng %R 10.1515/2151-7509.1046 %0 Journal Article %J Statist. Neerlandica %D 2012 %T Discussion on statistical use of administrative data: old and new challenges %A A. F. Karr %B Statist. Neerlandica %V 66 %P 80-84 %G eng %0 Journal Article %J Significance %D 2011 %T Deming, data and observational studies. A process out of control and needing fixing %A Young SS %A Karr Alan %K observational studies %XAny claim coming from an observational study is most likely to be wrong.? Startling, but true. Coffee causes pancreatic cancer. Type A personality causes heart attacks. Trans-fat is a killer. Women who eat breakfast cereal give birth to more boys. All these claims come from observational studies; yet when the studies are carefully examined, the claimed links appear to be incorrect. What is going wrong? Some have suggested that the scientific method is failing, that nature itself is playing tricks on us. But it is our way of studying nature that is broken and that urgently needs mending, say S. Stanley Young and Alan Karr; and they propose a strategy to fix it.

%B Significance %V 8 %P 116-120 %8 September %G eng %R 10.1111/j.1740-9713.2011.00506.x %0 Journal Article %J Journal of Privacy and Confidentiality %D 2010 %T Data confidentiality—the next five years: Summary and guide to papers %A A. F. Karr %A S. K. Kinney %A J. F. Gonzalez, Jr. %B Journal of Privacy and Confidentiality %V 1 %P 125-134 %G eng %0 Journal Article %J Statistical Methodology %D 2006 %T Data quality: A statistical perspective %A Alan F. Karr %A Ashish P. Sanil %A David L. Banks %B Statistical Methodology %V 3 %P 137–173 %G eng %0 Journal Article %J Statistical Science %D 2005 %T Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers %A A. F. Karr %A J. Feng %A X. Lin %A J. P. Reiter %A A. P. Sanil %A Young, S.S. %B Statistical Science %V 20 %P 163-177 %G eng %0 Conference Paper %B Bull. International Statistical Inst., 55th Session %D 2005 %T Data quality and data confidentiality for microdata: implications and strategies %A A. F. Karr %A A. P. Sanil %B Bull. International Statistical Inst., 55th Session %G eng %0 Journal Article %J Journal of Official Statistics %D 2005 %T Data Swapping as a Decision Problem %A Shanti Gomatam %A Alan F. Karr %A Ashish P. Sanil %K categorical data %K data confidentiality %K Data swapping %K data utility %K disclosure risk %K risk-utility frontier %XWe construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and, possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the U.S. Current Population Survey, the well-studied “Czech auto worker data” and data on schools and administrators generated by the U.S. National Center for Education Statistics.

%B Journal of Official Statistics %V 21 %P 635–655 %G eng %0 Journal Article %J Annals of Statistics %D 2005 %T Default Priors for Gaussian Processes %A R. Paulo %K Computer model %K frequentist coverage %K Gaussian process %K integrated likelihood %K Jeffreys prior %K posterior propriety %K reference prior %XMotivated by the statistical evaluation of complex computer models, we deal with the issue of objective prior specification for the parameters of Gaussian processes. In particular, we derive the Jeffreys-rule, independence Jeffreys and reference priors for this situation, and prove that the resulting posterior distributions are proper under a quite general set of conditions. A proper flat prior strategy, based on maximum likelihood estimates, is also considered, and all priors are then compared on the grounds of the frequentist properties of the ensuing Bayesian procedures. Computational issues are also addressed in the paper, and we illustrate the proposed solutions by means of an example taken from the field of complex computer model validation.

%B Annals of Statistics %V 33 %P 556-582 %G eng %R 10.1214/009053604000001264 %0 Journal Article %J Chance %D 2005 %T Discussion of ‘The impact of technology on the scientific method' by S. Keller–McNulty, A. G.Wilson and G. Wilson %A A. F. Karr %B Chance %V 18 %P 1 %G eng %0 Conference Paper %B ICSE 2005 Workshop on Advances in Model-Based Software Testing (A-MOST) %D 2005 %T Distributed performance testing using statistical modeling %A A. A. Porter %A A. F. Karr %B ICSE 2005 Workshop on Advances in Model-Based Software Testing (A-MOST) %G eng %0 Conference Paper %B Proc. dg.o 2004, National Conference on Digital Government Research %D 2004 %T Data confidentiality, data quality and data integration for federal databases %A A. F. Karr %B Proc. dg.o 2004, National Conference on Digital Government Research %P 91-92 %G eng %0 Journal Article %J Current Opinion in Drug Discovery & Development %D 2004 %T Design of diversity and focused combinatorial libraries in drug discovery %A Young SS %A Ge N %B Current Opinion in Drug Discovery & Development %V 7 %P 318-324 %G eng %0 Journal Article %J Chance %D 2004 %T Disclosure Risk vs Data Utility: The R-U Confidentiality Map %A Duncan, George T. %A Stokes, S. Lynne %B Chance %7 3 %V 17 %P 16-20 %G eng %R 0.1080/09332480.2004.10554908 %0 Conference Paper %B Proc. dg.o 2003, National Conference on Digital Government Research %D 2003 %T Data swapping: A risk–utility framework and Web service implementation %A A. F. Karr %A S. Gomatam %A C. Liu %A A. P. Sanil %B Proc. dg.o 2003, National Conference on Digital Government Research %I Digital Government Research Center %P 245-248 %G eng %0 Journal Article %J Journal of Chemistry Information and Computer Sciences %D 2003 %T Design of diverse and focused combinatorial libraries using an alternating algorithm %A Young SS %A Wang M %A Gu F %B Journal of Chemistry Information and Computer Sciences %V 43 %P 1916-1921 %G eng %0 Conference Paper %B Proceedings of Conference on New Frontiers of Statistical Data Mining %D 2003 %T Disclosure limitation methods based on bounds for large contingency tables with application to disability data %A A. Dobra %A E. A. Erosheva %A S. E. Fienberg %B Proceedings of Conference on New Frontiers of Statistical Data Mining %I CRC Press %G eng %0 Journal Article %J IEEE Computer %D 2001 %T Disseminating information but protecting confidentiality %A A. F. Karr %A J. Hernandez %A S. Karimi %A J. Lee %A K. Litwin %A A. Sanil %B IEEE Computer %V 34 %P 36?37 %G eng %0 Conference Paper %B In IEEE Transactions on Software Engineering %D 2001 %T Does code decay? Assessing the evidence from change management data %A Stephen G. Eick %A Todd L. Graves %A Alan F. Karr %A J. S. Marron %A Audris Mockus %XA central feature of the evolution of large software systems is that changeÐwhich is necessary to add new functionality, accommodate new hardware, and repair faultsÐbecomes increasingly difficult over time. In this paper, we approach this phenomenon, which we term code decay, scientifically and statistically. We define code decay and propose a number of measurements (code decay indices) on software and on the organizations that produce it, that serve as symptoms, risk factors, and predictors of decay. Using an unusually rich data set (the fifteen-plus year change history of the millions of lines of software for a telephone switching system), we find mixed, but on the whole persuasive, statistical evidence of code decay, which is corroborated by developers of the code. Suggestive indications that perfective maintenance can retard code decay are also discussed. Index TermsÐSoftware maintenance, metrics, statistical analysis, fault potential, span of changes, effort modeling.

%B In IEEE Transactions on Software Engineering %P 1–12 %G eng %0 Conference Paper %B Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining %D 2000 %T Defection detection: Using online activity profiles to predict ISP customer vulnerability %A A. F. Karr %A N. Raghavan %A R. Bell %A M. Schonlau %A D. Pregibon %B Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining %P 506?515 %G eng %R 10.1145/347090.347193 %0 Conference Proceedings %B XI Pan American Conference in Traffic and Transportation Engineering %D 2000 %T Direct Signal Timing Optimization: Strategy Development and Results %A Rouphail, N. %A Park, B. %A J. Sacks %B XI Pan American Conference in Traffic and Transportation Engineering %P 19-23 %8 2000 %G eng %0 Journal Article %J Mathematical Geology %D 1999 %T Design and Analysis for Modeling and Predicting Spatial Contamination %A Abt, Markus %A Welch, William J. %A Jerome Sacks %K best linear unbiased prediction %K dioxin contamination %K Gaussian stochastic process %K lognormal kriging %K ordinary kriging %K spatial statistics %XSampling and prediction strategies relevant at the planning stage of the cleanup of environmental hazards are discussed. Sampling designs and models are compared using an extensive set of data on dioxin contamination at Piazza Road, Missouri. To meet the assumptions of the statistical model, such data are often transformed by taking logarithms. Predicted values may be required on the untransformed scale, however, and several predictors are also compared. Fairly small designs turn out to be sufficient for model fitting and for predicting. For fitting, taking replicates ensures a positive measurement error variance and smooths the predictor. This is strongly advised for standard predictors. Alternatively, we propose a predictor linear in the untransformed data, with coefficients derived from a model fitted to the logarithms of the data. It performs well on the Piazza Road data, even with no replication.

%B Mathematical Geology %I Kluwer Academic Publishers-Plenum Publishers %V 31 %P 1-22 %G eng %U http://dx.doi.org/10.1023/A%3A1007504329298 %R 10.1023/A:1007504329298