The USDA’s National Agricultural Statistics Service (NASS) conducts the U.S. Census of Agriculture in years ending in 2 and 7. Population estimates from the census are adjusted for undercoverage, non-response and misclassification and calibrated to known population totals. These adjustments are reflected in weights that are attached to each responding unit. Calculating these weights has been a two-part procedure. First, one calculates initial (Dual System Estimation or DSE) weights that account for under-coverage, non-response and misclassification. and in the second step, calibration is used to adjust the weights by forcing the weighted estimates obtained in the first step to match known population totals. Recently, a calibration algorithm, Integer Calibration

(INCA), was developed to produce integer calibrated weights as required in NASS publications. This paper considers combining the two steps of calculating weights into one. This new algorithm is based on a regularized constrained dual system estimation methodology, which combines capture-recapture and calibration (CaRC).

Download: https://www.niss.org/sites/default/files/Toppin_CaRC_20170926.pdf

%0 Journal Article %J J. Survey Statistics and Methodology %D Submitted %T The effect of statistical disclosure limitation on parameter estimation for a finite population %A A. F. Karr %A H. J. Kim %B J. Survey Statistics and Methodology %V to appear %P to appear %G eng %0 Conference Proceedings %B JSM 2017 %D Submitted %T Estimated Covariance Matrices Associated with Calibration %A Sartore, L. %A Toppin, K. %A Spiegelman, C. %K Agriculture %K Calibration %K Census %K Estimation %K NASS %K Survey %K Variance %K Weighting %XSurveys often provide numerous estimates of population parameters. Some of the population values may be known to lie within a small range of values with a high level of certainty. Calibration is used to adjust survey weights associated with the observations within a data set. This process ensures that the “sample” estimates for the target population totals (benchmarks) lie within the anticipated ranges of those population values. The additional uncertainty due to the calibration process needs to be captured. In this paper, some methods for estimating the variance of the population totals are proposed for an algorithmic calibration process based on minimizing the L1-norm relative error. The estimated covariance matrices for the calibration totals are produced either by linear approximations or bootstrap techniques. Specific data structures are required to allow for the computation of massively large covariance matrices. In particular, the implementation of the proposed algorithms exploits sparse matrices to reduce the computational burden and memory usage. The computational efficiency is shown by a simulation study.

%B JSM 2017 %G eng %U https://www.niss.org/sites/default/files/Sartore_Variance_Estim_20170926.pdf %0 Conference Proceedings %B JSM 2017 %D Submitted %T Estimation of Capture Probabilities by Accounting for Sample Designs %A Abernethy, J. %A Sartore, L. %A Benecha, H. %A Spiegelman, C. %K Agriculture %K CaptureRecapture %K Estimation %K government %K NASS %K Research %K SampleDesigns %K Weights %XThe United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) conducts the Census of Agriculture every five years to estimate the number of U.S. farms, as well as other agriculturally related population totals. NASS applies a Dual-System Estimation (DSE) methodology on data collected from the Census and the June Area Survey (JAS) to estimate the number of farms in the U.S.. Traditional multinomial-based capture-recapture methodology requires a model to estimate the probability of capture for every captured operation on either survey. Of course, the selection probabilities associated with the JAS area frame design are different from those associated with the Census. Such a difference makes it difficult to compute the exact JAS selection probabilities for farm records captured only by the Census. For this reason, we propose and compare three methods for estimating the overall capture probability. The first two methods involve approximating the JAS selection probabilities and the third conditions them out. We compare these three techniques to investigate their precision through a simulation study.

%B JSM 2017 %G eng %U https://www.niss.org/sites/default/files/Abernethy_Capture_Probs_20170920.pdf %1In Proceedings of the Government Statistics Section, JSM 2017. Download https://www.niss.org/sites/default/files/Abernethy_Capture_Probs_20170920.pdf

%0 Conference Proceedings %B JSM 2017 %D Submitted %T Evaluation of a New Approach for Estimating the Number of U.S. Farms %A Benecha, H. %A Abreu, D. %A Abernethy, J. %A Sartore, L. %A Young, L. Y. %K Agriculture %K Area-frame %K BigData %K Capture-Recapture %K List Frame %K Logistic Regression %K Misclassification Error %K NASS %XUSDA’s National Agricultural Statistics Service (NASS) employs the June Area Survey (JAS) to produce annual estimates of U.S. farm numbers. The JAS is an area-frame-based survey conducted every year during the first two weeks of June. NASS also publishes an independent estimate of the number of farms from the quinquennial Census of Agriculture. Studies conducted by NASS have shown that farm number estimates from the JAS can be biased, mainly due to misclassification of agricultural tracts during the pre-screening and data collection processes. To adjust for the bias, NASS has developed a capture-recapture model that uses NASS’s list frame as the second sample, where estimation is performed based on records in the JAS with matches in the list frame. In the current paper, we describe an alternative capture-recapture approach that uses all available data from the JAS and the Census of Agriculture to correct for biases due to misclassification and to produce more stable farm number estimates.

%B JSM 2017 %G eng %U https://www.niss.org/sites/default/files/Benecha_Estim_Farms_20170929.pdf %0 Generic %D Submitted %T Multidimensionality in the Performance-based Online Reading Comprehension Assessment %A W. Cui %A Nell Sedransk %G eng %0 Journal Article %D Submitted %T The Performance Characteristics of Three Formats for Assessing Internet Research Skills in Science %A Kulikowich, J.M. %A Leu, D. %A Nell Sedransk %A Coiro, J. %A Forzani, E. %G eng %0 Journal Article %D Submitted %T Psychometric Invariance of Online Reading Comprehension Assessment across Measurement Conditions %A W. Cui %A Nell Sedransk %G eng %0 Conference Proceedings %B JSM 2017 %D Submitted %T Restricted Multinomial Regression for a Triple-System Estimation with List Dependence %A Sartore, L. %A Benecha, H. %A Toppin, K. %A Spiegelman, C. %K Agriculture %K BigData %K Capture %K DataScience %K Dependence %K Estimation %K NASS %K Probability %K Triple-System %K Weights %XThe National Agricultural Statistics Service (NASS) conducts the U.S. Census of Agriculture every five years. In 2012, NASS began using a capture-recapture approach to adjust the Census estimates for under-coverage, non-response, and misclassification. This requires two independent samples. NASS has kept its Census Mailing List (CML) independent from its area frame, which is used for the June Area Survey (JAS) every June. NASS is exploring the use of web-scraping to develop a third list-frame (TL) that would be independent of the CML and the area frame. In this paper, a Triple-System Estimation (TSE) methodology based on regularized multinomial regression is proposed to investigate for possible dependence between the CML and the TF. A simulation study is performed to compare the performance of the estimator based on the proposed methodology, which can take into account the frame dependence with others already presented in the literature.

%B JSM 2017 %G eng %U https://www.niss.org/sites/default/files/Sartore_RestMultiReg_TSE_20170901.pdf %0 Journal Article %D Submitted %T Using the Psychometric Characteristics of Multiple-Choice, Open Internet, and Closed (Simulated) Internet Formats to Refine the Development of Online Research and Comprehension Assessments in Science: Year Three of the ORCA Project %A Leu,D. %A Coiro, J. %A Kulikowich, J.M. %A W. Cui %G eng %0 Journal Article %J Journal of Official Statistics %D Submitted %T The World’s Simplest Survey Microsimulator (WSSM) %A A. F. Karr %A L. H. Cox %B Journal of Official Statistics %G eng %0 Journal Article %J Survey Methodology %D In Press %T Bayesian Benchmarking of the Fay-Herriot Model Using Random Deletion. %A Nandram B. %A Erciulescu A.L. %A Cruze N. %B Survey Methodology %0 Journal Article %J Survey Methodology %D In Press %T A Bivariate Hierarchical Bayesian Model for Estimating Cropland Cash Rental Rates at the County Level. %A Erciulescu A.L. %A Berg E. %A Cecere W. %A Ghosh M. %B Survey Methodology %0 Journal Article %J Environmental and Ecological Statistics %D 2018 %T Benchmarking a Triplet of Official Estimates. %A Erciulescu A.L. %A Cruze N. %A Nandram B. %B Environmental and Ecological Statistics %V 25 %P 523-547 %N 4 %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2018 %T Bootstrap Confidence Intervals for Small Area Proportions %A Erciulescu A.L. %E Fuller W.A. %B Journal of Survey Statistics and Methodology %N DOI 10.1093/jssam/smy014 %0 Conference Paper %B 2015 FCSM Research Conference %D 2018 %T A design-sensitive approach to fitting regression models with complex survey data %A Phillip Kott %K designbased. %K extended model %K generalized cumulative logistic model %K proportional-odds model %K Pseudo-maximum likelihood %XFitting complex survey data to regression equations is explored under a design-sensitive model-based framework. A robust version of the standard model assumes that the expected value of the difference between the dependent variable and its model-based prediction is zero no matter what the values of the explanatory variables. The extended model assumes only that the difference is uncorrelated with the covariates. Little is assumed about the error structure of this difference under either model other than independence across primary sampling units. The standard model often fails in practice, but the extended model very rarely does. Under this framework some of the methods developed in the conventional design-based, pseudo-maximum-likelihood framework, such as fitting weighted estimating equations and sandwich mean-squared-error estimation, are retained but their interpretations change. Few of the ideas here are new to the refereed literature. The goal instead is to collect those ideas and put them into a unified conceptual framework.

%B 2015 FCSM Research Conference %I Statistics Surveys %@ 1935-7516 %G eng %U https://projecteuclid.org/euclid.ssu/1516179619 %L Vol. 12 (2018) 1–17 %0 Journal Article %J Journal of the Royal Statistical Society %D 2018 %T Model-Based County-Level Crop Estimates Incorporating Auxiliary Sources of Information %A Erciulescu A.L. %E Cruze N. %E Nandram B. %B Journal of the Royal Statistical Society %I Journal of the Royal Statistical Society %V Series A %N DOI 10.1111/rssa.12390 %0 Conference Proceedings %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %D 2017 %T Developments in Model-Based County Level Estimation of Agricultural Cash Rental Rates %A Bellow, Michael E. %A Cruze, Nathan %A Erciulescu, Andreea L. %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %G eng %U https://www.niss.org/sites/default/files/2017%20-%20Developments%20in%20Model-Based%20County-Level%20Estimation%20of%20Ag%20Cash%20Rental%20Rates.pdf %& 2773 - 2790 %0 Journal Article %J Assessing Writing %D 2017 %T Similarities and differences in constructs represented by U.S. States’ middle school writing tests and the 2007 national assessment of educational progress writing assessment %A Mo, Y. %A Troia, G. A. %K Assessing Writing %K assessment %K writing %B Assessing Writing %V Volume 33 %8 07/2017 %G eng %U http://www.sciencedirect.com/science/article/pii/S1075293517300193 %9 Assessing Writing %! Similarities and differences in constructs represented by U.S. States’ middle school writing tests and the 2007 national assessment of educational progress writing assessment %& 48–67 %0 Conference Proceedings %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %D 2017 %T Small Area Estimates for End-of-Season Agricultural Quantities %A Erciulescu, Andreea L. %A Cruze, Nathan B. %A Nandram, Balgobin %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %G eng %U https://www.niss.org/sites/default/files/2017%20-%20Small%20Area%20Estimates%20for%20End-Of-Season%20Agricultural%20Quantities.pdf %0 Journal Article %J Reading Horizons %D 2016 %T The Common Core Writing Standards: A descriptive study of content and alignment with a sample of former state standards %A Troia, G. A. %A Olinghouse, N. G. %A Wilson, J. %A Stewart, K. O. %A Mo, Y. %A Hawkins, L. %A Kopke, R.A. %B Reading Horizons %G eng %0 Journal Article %J ICES V Proceedings. Alexandria, VA: American Statistical Association %D 2016 %T Developments in Model-Based Estimation of County-Level Agricultural Estimates %A Cruze N.B. %A Erciulescu A.L. %A Nandram B. %A Barboza W.J. %A Young L.J. %B ICES V Proceedings. Alexandria, VA: American Statistical Association %G eng %U http://ww2.amstat.org/meetings/ices/2016/proceedings/131_ices15Final00229.pdf %0 Journal Article %J JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %D 2016 %T Evaluating Record Linkage Software for Agricultural Surveys %A Bellow M.E. %A Daniel K. %A Gorsak M. %A Erciulescu A.L. %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %G eng %U https://ww2.amstat.org/MembersOnly/proceedings/2016/data/assets/pdf/389754.pdf. %& 3225-3235 %0 Journal Article %J JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %D 2016 %T Model-based County-Level Crop Estimates Incorporating Auxiliary Sources of Information %A Erciulescu A.L. %A Cruze N.B. %A Nandram B. %B JSM Proceedings. Survey Research Methods Section. Alexandria, VA: American Statistical Association. %G eng %U https://ww2.amstat.org/MembersOnly/proceedings/2016/data/assets/pdf/389784.pdf %& 3591-3605 %0 Journal Article %J Reading & Writing: An Interdisciplinary Journal %D 2016 %T Predicting Students’ Writing Performance on the NAEP from Student- and State-level Variables %A Mo, Y. %A Troia, G. A. %B Reading & Writing: An Interdisciplinary Journal %G eng %0 Journal Article %J Computers and Geosciences %D 2016 %T spMC: an R-package for 3D lithological reconstructions based on spatial Markov chains %A Sartore, L. %A Fabbri, P. %A Gaetan, C. %B Computers and Geosciences %V 94 %G eng %U http://www.sciencedirect.com/science/article/pii/S0098300416301479 %& 40-47 %R http://dx.doi.org/10.1016/j.cageo.2016.06.001 %0 Journal Article %J Journal of Survey Statistics and Methodology %D 2015 %T Accounting for nonignorable unit nonresponse and attrition in panel studies with refreshment samples %A Schifeling, T. %A Cheng, C. %A Jerome Reiter %A Hillygus, D.C. %B Journal of Survey Statistics and Methodology %V 3 %P 265–295 %8 18 August 2015 %G eng %N 3 %0 Journal Article %J Molecular Cell Proteomics %D 2015 %T Large-Scale Interlaboratory Study to Develop, Analytically Validate and Apply Highly Multiplexed, Quantitative Peptide Assays to Measure Cancer-Relevant Proteins in Plasma. %A Susan Abbatiello %A Birgit Schilling %A D.R. Mani %A L.I. Shilling %A S.C. Hall %A B. McLean %A M. Albetolle %A S. Allen %A M. Burgess %A M.P. Cusack %A M Gosh %A V Hedrick %A J.M. Held %A H.D. Inerowicz %A A. Jackson %A H. Keshishian %A C.R. Kinsinger %A Lyssand, JS %A Makowski L %A Mesri M %A Rodriguez H %A Rudnick P %A Sadowski P %A Nell Sedransk %A Shaddox K %A Skates SJ %A Kuhn E %A Smith D %A Whiteaker, JR %A Whitwell C %A Zhang S %A Borchers CH %A Fisher SJ %A Gibson BW %A Liebler DC %A M.J. McCoss %A Neubert TA %A Paulovich AG %A Regnier FE %A Tempst, P %A Carr, SA %XThere is an increasing need in biology and clinical medicine to robustly and reliably measure tens to hundreds of peptides and proteins in clinical and biological samples with high sensitivity, specificity, reproducibility, and repeatability. Previously, we demonstrated that LC-MRM-MS with isotope dilution has suitable performance for quantitative measurements of small numbers of relatively abundant proteins in human plasma and that the resulting assays can be transferred across laboratories while maintaining high reproducibility and quantitative precision. Here, we significantly extend that earlier work, demonstrating that 11 laboratories using 14 LC-MS systems can develop, determine analytical figures of merit, and apply highly multiplexed MRM-MS assays targeting 125 peptides derived from 27 cancer-relevant proteins and seven control proteins to precisely and reproducibly measure the analytes in human plasma. To ensure consistent generation of high quality data, we incorporated a system suitability protocol (SSP) into our experimental design. The SSP enabled real-time monitoring of LC-MRM-MS performance during assay development and implementation, facilitating early detection and correction of chromatographic and instrumental problems. Low to subnanogram/ml sensitivity for proteins in plasma was achieved by one-step immunoaffinity depletion of 14 abundant plasma proteins prior to analysis. Median intra- and interlaboratory reproducibility was <20%, sufficient for most biological studies and candidate protein biomarker verification. Digestion recovery of peptides was assessed and quantitative accuracy improved using heavy-isotope-labeled versions of the proteins as internal standards. Using the highly multiplexed assay, participating laboratories were able to precisely and reproducibly determine the levels of a series of analytes in blinded samples used to simulate an interlaboratory clinical study of patient samples. Our study further establishes that LC-MRM-MS using stable isotope dilution, with appropriate attention to analytical validation and appropriate quality control measures, enables sensitive, specific, reproducible, and quantitative measurements of proteins and peptides in complex biological matrices such as plasma.

%B Molecular Cell Proteomics %V 14 %P 2357-74 %8 09/2015 %G eng %N 9 %R 10.1074/mcp.M114.047050 %0 Journal Article %J Journal of the American Statistical Association %D 2015 %T Simultaneous Edit-Imputation for Continuous Microdata %A H. J. Kim %A L. H. Cox %A A. F. Karr %A J. P. Reiter %A Q. Wang %B Journal of the American Statistical Association %V 110 %P 987-999 %G eng %0 Journal Article %J Journal of Official Statistics %D 2015 %T Statistical disclosure limitation in the presence of edit rules %A A. F. Karr %A H. J. Kim %A J. P. Reiter %B Journal of Official Statistics %V 31 %P 121-138 %G eng %0 Book Section %B Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches. %D 2014 %T Analytical frameworks for data release: A statistical view %A A. F. Karr %A J. P. Reiter %B Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches. %I Cambridge University Press %C New York City, NY %G eng %0 Journal Article %J BMC Medical Informatics and Decision Making %D 2014 %T A Bayesian spatio-temporal approach for real-time detection of disease outbreaks: A case study %A A. F. Karr %A J. Zou %A G. S. Datta %A S. Grannis %A J. Lynch %B BMC Medical Informatics and Decision Making %V 14 %8 12/2014 %G eng %& 108 %R 10.1186/s12911-014-0108-4 %0 Journal Article %J Statistical Analysis and Data Mining %D 2014 %T Big data, big results: Knowledge discovery in output from large-scale analytics %A A. F. Karr %A R. Ferrell %A T. H. McCormick %A P. B. Ryan %B Statistical Analysis and Data Mining %V 7 %P 404-412 %8 09/2014 %G eng %N 5 %R 10.1002/sam.11237 %0 Journal Article %J Biometrics %D 2014 %T Calibration using Constrained Smoothing with Application to Mass Spectrometry Data %A Feng, X. %A Sedransk, N. %A Xia, J-Q %B Biometrics %V 70 %P 398-408 %G eng %U http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291541-0420 %& 398 %R 10.1111/biom.12135 %0 Book Section %D 2014 %T Confidentiality and Data Access in the Use of Big Data: Theory and Practical Approaches %A A. F. Karr %A J. P. Reiter %E J. Lane %E V. Stodden %E H. Nissenbaum %E S. Bender %I Cambridge University Press %G eng %& Analytical frameworks for data release: A statistical view %0 Book Section %B Lecture Notes in Computer Science %D 2014 %T Enabling statistical analysis of suppressed tabular data, in Privacy in Statistical Databases %A Lawrence H. Cox %B Lecture Notes in Computer Science %I Springer %C Heidelberg %V 8744 %P 1-10 %G eng %0 Journal Article %J Journal of Computational and Graphical Statistics %D 2014 %T The generalized multiset sampler %A H. J. Kim %A S. N. MacEachern %B Journal of Computational and Graphical Statistics %8 10/2014 %G eng %U http://dx.doi.org/10.1080/10618600.2014.962701 %R 10.1080/10618600.2014.962701 %0 Journal Article %J Molecular & Cellular Proteomics %D 2014 %T Improved Normalization of Systematic Biases Affecting Ion Current Measurements in Label-free Proteomics Data %A P. A. Rudnick %A X. Wang %A E. Yan %A Sedransk, N. %A S. E. Stein %B Molecular & Cellular Proteomics %V 13 %P 1341-1351 %G eng %N 5 %0 Journal Article %J Statistical Journal of the IAOS %D 2014 %T Improving the Synthetic Longitudinal Business Database %A S. K. Kinney %A J. P. Reiter %A J. Miranda %B Statistical Journal of the IAOS %V 30 %P 129-135 %G eng %N 2 %0 Journal Article %J Journal of Business Economic Statistics %D 2014 %T Multiple imputation of missing or faulty values under linear constraints %A A. F. Karr %A H. J. Kim %A L. H. Cox %A Q. Wang %A J. P. Reiter %B Journal of Business Economic Statistics %I American Statistical Association %V 32 %P 375-386 %G eng %N 3 %0 Book Section %D 2014 %T The New Literacies of Online Research and Comprehension: Assessing and Preparing Students for the 21st Century with Common Core State Standards %A Sedransk, N. %A Leu, D. %A Forzani, E. %A Burlingame, C. %A Kulikowich, J. %A Coiro, J. %A Kennedy, C. %E Neuman, S. B. %E Gambrell, L.B. %I International Reading Association %P to appear %G eng %& to appear %0 Journal Article %J Analytical Chemistry %D 2014 %T QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics %A X. Wang %A M. C. Chambers %A L. J. Vega-Montoto %A D. M. Bunk %A S. E. Stein %A D. Tabb %XShotgun proteomics experiments integrate a complex sequence of processes, any of which can introduce variability. Quality metrics computed from LC-MS/MS data have relied upon identifying MS/MS scans, but a new mode for the QuaMeter software produces metrics that are independent of identifications. Rather than evaluating each metric independently, we have created a robust multivariate statistical toolkit that accommodates the correlation structure of these metrics and allows for hierarchical relationships among data sets. The framework enables visualization and structural assessment of variability. Study 1 for the Clinical Proteomics Technology Assessment for Cancer (CPTAC), which analyzed three replicates of two common samples at each of two time points among 23 mass spectrometers in nine laboratories, provided the data to demonstrate this framework, and CPTAC Study 5 provided data from complex lysates under Standard Operating Procedures (SOPs) to complement these findings. Identification-independent quality metrics enabled the differentiation of sites and run-times through robust principalcomponents analysis and subsequent factor analysis. Dissimilarity metrics revealed outliers in performance, and a nested ANOVA model revealed the extent to which all metrics or individual metrics were impacted by mass spectrometer and run time. Study 5 data revealed that even when SOPs have been applied, instrument-dependent variability remains prominent, although it may bereduced, while within-site variability is reduced significantly. Finally, identification-independent quality metrics were shown to bepredictive of identification sensitivity in these data sets. QuaMeter and the associated multivariate framework are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively

%B Analytical Chemistry
%V 86
%P 2497 − 2509
%G eng
%U http://pubs.acs.org/doi/pdf/10.1021/ac4034455
%R dx.doi.org/10.1021
%0 Journal Article
%D 2014
%T Simultaneous Edit-Imputation for Continuous Microdata
%A H. J. Kim
%A Karr Alan F
%A L. H. Cox
%A J. P. Reiter
%A Q. Wang
%G eng
%0 Journal Article
%J Statistical Journal of the International Association for Official Statistics
%D 2014
%T SynLBD 2.0: Improving the Synthetic Longitudinal Business Database
%A S. K. Kinney
%A J. P. Reiter
%A J. Miranda
%B Statistical Journal of the International Association for Official Statistics
%V 30
%P 129-135
%G eng
%0 Journal Article
%J Statistical Journal of the International Association for Official Statistics
%D 2014
%T Why data availability is such a hard problem
%A A. F. Karr
%B Statistical Journal of the International Association for Official Statistics
%V 30
%8 05/2014
%G eng
%& 101
%R 10.3233/SJI-140809
%0 Journal Article
%J Technomet
%D 2013
%T Analysis of high-dimensional structure-activity screening datasets using the optimal bit string Tree
%A Zhang K
%A Hughes-Oliver JM
%A Young SS
%K Classification
%K Drug discovery
%K High throughput screening
%K Prediction
%K QSAR
%K Simulated annealing
%X We propose a new classification method called the Optimal Bit String Tree (OBSTree) to identify quantitative structure-activity relationships (QSARs). The method introduces the concept of a chromosome to describe the presence/absence context of a combination of descriptors. A descriptor set and its optimal chromosome form the splitting variable. A new stochastic searching scheme that contains a weighted sampling scheme, simulated annealing, and a trimming procedure optimizes the choice of splitting variable. Simulation studies and an application to screening monoamine oxidase inhibitors show that OBSTree is advantageous in accurately and effectively identifying QSAR rules and finding different classes of active compounds. Details of the algorithm, SAS code, and simulated and real datasets are available online as supplementary materials.

%B Technomet %V 55 %P 161-173 %G eng %R 10.1080/00401706.2012.760489 %0 Journal Article %J Survey Methodology %D 2013 %T Combining cohorts in longitudinal surveys %A I. A. Carrillo %A A. F. Karr %K Joint-randomization inference %K Multi-cohort longitudinal surveys %K Replication variance estimation %K Rotating panel surveys %K Superpopulation parameters %K Weighted Generalized Estimating Equations %XA question that commonly arises in longitudinal surveys is the issue of how to combine differing cohorts of the survey. In this paper we present a novel method for combining different cohorts, and using all available data, in a longitudinal survey to estimate parameters of a semiparametric model, which relates the response variable to a set of covariates. The procedure builds upon the Weighted Generalized Estimation Equation method for handling missing waves in longitudinal studies. Our method is set up under a joint-randomization frame work for estimation of model parameters, which takes into account the superpopulation model as well as the survey design randomization. We also propose a design-based, and a joint-randomization, variance estimation method. To illustrate the methodology we apply it to the Survey of Doctorate Recipients, conducted by the U.S. National Science Foundation

%B Survey Methodology %V 39 %P 149-182 %8 June %G eng %0 Generic %D 2013 %T Combining NAEP Items into a Baseline Offline Reading Assessment %A Sedransk, N. %A W. Cui %I U. S. Department of Education %G eng %0 Conference Paper %B JSM Proceedings, Section on Survey Research Methods 2013 %D 2013 %T Construction of replicate weights for Project TALENT %A A. F. Karr %A Z. He %A M. P. Cohen %A D. Battle %A D. L. Achorn %A A. D. McKay %B JSM Proceedings, Section on Survey Research Methods 2013 %G eng %0 Journal Article %J Molecular and Cellular Proteomics %D 2013 %T Design, Implementation and Multisite Evaluation of a System Suitability Protocol for the Quantitative Assessment of Instrument Performance in Liquid Chromatography-Multiple Reaction Monitoring-MS (LC-MRM-MS) %A Abbatiello, S. %A Feng, X. %A Sedransk, N. %A Mani, DR %A Schilling, B %A Maclean, B %A Zimmerman, LJ %A Cusack, MP %A Hall, SC %A Addona, T %A Allen, S %A Dodder, NG %A Ghosh, M %A Held, JM %A Hedrick, V %A Inerowicz, HD %A Jackson, A %A Keshishian, H %A Kim, JW %A Lyssand, JS %A Riley, CP %A Rudnick, P %A Sadowski, P %A Shaddox, K %A Smith, D %A Tomazela, D %A Wahlander, A %A Waldemarson, S %A Whitwell, CA %A You, J %A Zhang, S %A Kinsinger, CR %A Mesri, M %A Rodriguez, H %A Borchers, CH %A Buck, C %A Fisher, SJ %A Gibson, BW %A Liebler, D %A Maccoss, M %A Neubert, TA %A Paulovich, A %A Regnier, F %A Skates, SJ %A Tempst, P %A Wang, M %A Carr, SA %XMultiple reaction monitoring (MRM) mass spectrometry coupled with stable isotope dilution (SID) and liquid chromatography (LC) is increasingly used in biological and clinical studies for precise and reproducible quantification of peptides and proteins in complex sample matrices. Robust LC-SID-MRM-MS-based assays that can be replicated across laboratories and ultimately in clinical laboratory settings require standardized protocols to demonstrate that the analysis platforms are performing adequately. We developed a system suitability protocol (SSP), which employs a predigested mixture of six proteins, to facilitate performance evaluation of LC-SID-MRM-MS instrument platforms, configured with nanoflow-LC systems interfaced to triple quadrupole mass spectrometers. The SSP was designed for use with low multiplex analyses as well as high multiplex approaches when software-driven scheduling of data acquisition is required. Performance was assessed by monitoring of a range of chromatographic and mass spectrometric metrics including peak width, chromatographic resolution, peak capacity, and the variability in peak area and analyte retention time (RT) stability. The SSP, which was evaluated in 11 laboratories on a total of 15 different instruments, enabled early diagnoses of LC and MS anomalies that indicated suboptimal LC-MRM-MS performance. The observed range in variation of each of the metrics scrutinized serves to define the criteria for optimized LC-SID-MRM-MS platforms for routine use, with pass/fail criteria for system suitability performance measures defined as peak area coefficient of variation <0.15, peak width coefficient of variation <0.15, standard deviation of RT <0.15 min (9 s), and the RT drift <0.5min (30 s). The deleterious effect of a marginally performing LC-SID-MRM-MS system on the limit of quantification (LOQ) in targeted quantitative assays illustrates the use and need for a SSP to establish robust and reliable system performance. Use of a SSP helps to ensure that analyte quantification measurements can be replicated with good precision within and across multiple laboratories and should facilitate more widespread use of MRM-MS technology by the basic biomedical and clinical laboratory research communities.

%B Molecular and Cellular Proteomics %V 12 %P 2623-2639 %G eng %R 10.1074/mcp.M112.027078 %0 Journal Article %J Journal of Official Statistics %D 2013 %T Discussion of five papers on “Systems and architectures for high-quality statistics production %A A. F. Karr %B Journal of Official Statistics %V 29 %P 157-163 %G eng %0 Book Section %B Encyclopedia of Environmetrics, second edition %D 2013 %T National Institute of Statistical Sciences (US) %A A. F. Karr %B Encyclopedia of Environmetrics, second edition %I Wiley, Chichester %G eng %0 Journal Article %J Statistics in Medicine %D 2013 %T A New Functional Data Based Biomarker for Modeling Cardiovascular Behavior %A Zhou, Y-C. %A Sedransk, N. %K electrocardiogram %K QT interval %K ventricular repolarization %XCardiac safety assessment in drug development concerns the ventricular repolarization (represented by electrocardiogram (ECG) T-wave) abnormalities of a cardiac cycle, which are widely believed to be linked with torsades de pointes, a potentially life-threatening arrhythmia. The most often used biomarker for such abnormalities is the prolongation of the QT interval, which relies on the correct annotation of onset of QRS complex and offset of T-wave on ECG. A new biomarker generated from a functional data-based methodology is developed to quantify the T-wave morphology changes from placebo to drug interventions. Comparisons of T-wave-form characters through a multivariate linear mixed model are made to assess cardiovascular risk of drugs. Data from a study with 60 subjects participating in a two-period placebo-controlled crossover trial with repeat ECGs obtained at baseline and 12 time points after interventions are used to illustrate this methodology; different types of wave form changes were characterized and motivated further investigation.

%B Statistics in Medicine %V 32 %P 153-164 %G eng %R 10.1002/sim.5518 %0 Book Section %B Quality Reading Instruction in the Age of Common Core Standards %D 2013 %T The New Literacies of Online Research and Comprehension: Assessing and Preparing Students for the 21st Century with Common Core State Standards %A Leu, D. %A Sedransk, N. %E Neuman, S. %E Gambrell, L. %B Quality Reading Instruction in the Age of Common Core Standards %I International Reading Association %G eng %& 16 %0 Journal Article %J Trans. Res. Rec. %D 2013 %T Synthesizing route travel time distributions from segment travel time distributions %A Isukapati, Isaac Kumar %A List, George F. %A Williams, Billy M %A Alan F. Karr %B Trans. Res. Rec. %P 71–81 %8 02/2013 %G eng %0 Journal Article %J Information Fusion %D 2012 %T Bayesian CAR models for syndromic surveillance on multiple data streams: Theory and practice %A A. F. Karr %A D. L. Banks %A G. Datta %A J. Lynch %A J. Niemi %A F. Vera %K Bayes %K CAR models %K Gibbs distribution %K Markov random field %K Syndromic surveillance %XSyndromic surveillance has, so far, considered only simple models for Bayesian inference. This paper details the methodology for a serious, scalable solution to the problem of combining symptom data from a network of US hospitals for early detection of disease outbreaks. The approach requires high-end Bayesian modeling and significant computation, but the strategy described in this paper appears to be feasible and offers attractive advantages over the methods that are currently used in this area. The method is illustrated by application to ten quarters worth of data on opioid drug abuse surveillance from 636 reporting centers, and then compared to two other syndromic surveillance methods using simulation to create known signal in the drug abuse database.

%B Information Fusion %V 13 %P 105–116 %G eng %U http://dx.doi.org/10.1016/j.inffus.2009.10.005 %0 Journal Article %J Statistical Analysis and Data Mining %D 2012 %T Bayesian methodology for the analysis of spatial temporal surveillance data %A Zou, Jian %A Alan F. Karr %A Banks, David %A Heaton, Matthew J. %A Datta, Gauri %A Lynch, James %A Vera, Francisco %K conditional autoregressive process %K Markov random field %K spatial statistics %K spatio-temporal %K Syndromic surveillance %XEarly and accurate detection of outbreaks is one of the most important objectives of syndromic surveillance systems. We propose a general Bayesian framework for syndromic surveillance systems. The methodology incorporates Gaussian Markov random field (GMRF) and spatio-temporal conditional autoregressive (CAR) modeling. By contrast, most previous approaches have been based on only spatial or time series models. The model has appealing probabilistic representations as well as attractive statistical properties. Based on extensive simulation studies, the model is capable of capturing outbreaks rapidly, while still limiting false positives. Â© 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 194â€“204, 2012

%B Statistical Analysis and Data Mining %I Wiley Subscription Services, Inc., A Wiley Company %V 5 %P 194–204 %G eng %U http://dx.doi.org/10.1002/sam.10142 %R 10.1002/sam.10142 %0 Journal Article %J Cheminformatics %D 2012 %T ChemModLab: A web-based cheminromates modeling laboratory %A Hughes-Oliver JM %A Brooks A %A Welch W %A Khaldei MG %A Hawkins DM %A Young SS %A Patil K %A Howell GW %A Ng RT %A Chu MT %XChemModLab, written by the ECCR @ NCSU consortium under NIH support, is a toolbox for fitting and assessing quantitative structure-activity relationships (QSARs). Its elements are: a cheminformatic front end used to supply molecular descriptors for use in modeling; a set of methods for fitting models; and methods for validating the resulting model. Compounds may be input as structures from which standard descriptors will be calculated using the freely available cheminformatic front end PowerMV; PowerMV also supports compound visualization. In addition, the user can directly input their own choices of descriptors, so the capability for comparing descriptors is effectively unlimited. The statistical methodologies comprise a comprehensive collection of approaches whose validity and utility have been accepted by experts in the fields. As far as possible, these tools are implemented in open-source software linked into the flexible R platform, giving the user the capability of applying many different QSAR modeling methods in a seamless way. As promising new QSAR methodologies emerge from the statistical and data-mining communities, they will be incorporated in the laboratory. The web site also incorporates links to public-domain data sets that can be used as test cases for proposed new modeling methods. The capabilities of ChemModLab are illustrated using a variety of biological responses, with different modeling methodologies being applied to each. These show clear differences in quality of the fitted QSAR model, and in computational requirements. The laboratory is web-based, and use is free. Researchers with new assay data, a new descriptor set, or a new modeling method may readily build QSAR models and benchmark their results against other findings. Users may also examine the diversity of the molecules identified by a QSAR model. Moreover, users have the choice of placing their data sets in a public area to facilitate communication with other researchers; or can keep them hidden to preserve confidentiality.

%B Cheminformatics %V 11 %P 61-81 %G eng %R 10.3233/CI-2008-0016 %0 Book Section %D 2012 %T Current and emerging design and data analysis approaches %A Kulikowich, J.M. %A Sedransk, N. %I APA Handbook of Educational Psychology, American Psychological Association %G eng %0 Journal Article %J Statistics, Politics and Policy %D 2012 %T Data, Statistics and Controversy: Making Scientific Data Intelligible %A Sedransk, N. %A Young, L. %A Spiegelman, C. %K data availability %K Daubert rule %K inference verification %K meta-data %K proprietary data %K publication bias %K reuse of data %K secondary analysis %K synthetic data %XMaking published, scientific research data publicly available can benefit scientists and policy makers only if there is sufficient information for these data to be intelligible. Thus the necessary meta-data go beyond the scientific, technological detail and extend to the statistical approach and methodologies applied to these data. The statistical principles that give integrity to researchers’ analyses and interpretations of their data require documentation. This is true when the intent is to verify or validate the published research findings; it is equally true when the intent is to utilize the scientific data in conjunction with other data or new experimental data to explore complex questions; and it is profoundly important when the scientific results and interpretations are taken outside the world of science to establish a basis for policy, for legal precedent or for decision-making. When research draws on already public data bases, e.g., a large federal statistical data base or a large scientific data base, selection of data for analysis, whether by selection (subsampling) or by aggregating, is specific to that research so that this (statistical) methodology is a crucial part of the meta-data. Examples illustrate the role of statistical meta-data in the use and reuse of these public datasets and the impact on public policy and precedent.

%B Statistics, Politics and Policy %V 3 %P 1-20 %G eng %R 10.1515/2151-7509.1046 %0 Journal Article %J Statist. Neerlandica %D 2012 %T Discussion on statistical use of administrative data: old and new challenges %A A. F. Karr %B Statist. Neerlandica %V 66 %P 80-84 %G eng %0 Conference Paper %B Proc. 2012 International Symposium on Transportation Network Reliability (INSTR) %D 2012 %T Forging an understanding of travel time reliability for freeway and arterial networks %A A. F. Karr %A G. F. List %A B. M. Williams %A N. M. Rouphail %B Proc. 2012 International Symposium on Transportation Network Reliability (INSTR) %G eng %0 Journal Article %J Journal of Official Statistics %D 2012 %T Inferentially Valid, Partially Synthetic Datasets: Generating from Predictive Distributions Not Necessary %A J. P. Reiter %A S. K. Kinney %B Journal of Official Statistics %V 28 %P 1-9 %G eng %N 4 %0 Journal Article %J Statistics in Medicine %D 2012 %T A spatio-temporal absorbing state model for disease and syndromic surveillance %A M. J. Heaton %A A. F. Karr %A J. Zou %A D. L. Banks %A G. Datta %A J. Lynch %A F. Vera %XReliable surveillance models are an important tool in public health because they aid in mitigating disease outbreaks, identify where and when disease outbreaks occur, and predict future occurrences. Although many statistical models have been devised for surveillance purposes, none are able to simultaneously achieve the important practical goals of good sensitivity and specificity, proper use of covariate information, inclusion of spatio-temporal dynamics, and transparent support to decision-makers. In an effort to achieve these goals, this paper proposes a spatio-temporal conditional autoregressive hidden Markov model with an absorbing state. The model performs well in both a large simulation study and in an application to influenza/pneumonia fatality data.

%B Statistics in Medicine %V 31 %P 2123-2136 %G eng %0 Conference Paper %B The World’s Simplest Survey Microsimulator (WSSM) %D 2012 %T The World’s Simplest Survey Microsimulator (WSSM) %A A. F. Karr %A L. H. Cox %A S. K. Kinney %B The World’s Simplest Survey Microsimulator (WSSM) %G eng %U http://www.fcsm.gov/12papers/Karr_2012FCSM_II-A.pdf %0 Journal Article %J Journal of Agricultural, Biological, and Environmental Statistics %D 2011 %T A Bayesian Approach to Estimating Agricultural Yield Based on Multiple Repeated Surveys %A Jianqiang C. Wang %A S. H. Holan %A Balgobin Nandram %A Wendy Barboza %A Criselda Toto %A Edwin Anderson %K Bayesian hierarchical model %K Composite estimation %K Dynamic model %K Forecasting Model comparison %K Prediction %B Journal of Agricultural, Biological, and Environmental Statistics %V 17 %P 84-106 %8 October 29, 2011 %G eng %R 10.1007/s13253-011-0067-5 %0 Journal Article %J Significance %D 2011 %T Deming, data and observational studies. A process out of control and needing fixing %A Young SS %A Karr Alan %K observational studies %XAny claim coming from an observational study is most likely to be wrong.? Startling, but true. Coffee causes pancreatic cancer. Type A personality causes heart attacks. Trans-fat is a killer. Women who eat breakfast cereal give birth to more boys. All these claims come from observational studies; yet when the studies are carefully examined, the claimed links appear to be incorrect. What is going wrong? Some have suggested that the scientific method is failing, that nature itself is playing tricks on us. But it is our way of studying nature that is broken and that urgently needs mending, say S. Stanley Young and Alan Karr; and they propose a strategy to fix it.

%B Significance %V 8 %P 116-120 %8 September %G eng %R 10.1111/j.1740-9713.2011.00506.x %0 Journal Article %J Statistical Science %D 2011 %T Make research data public? - Not always so simple: A Dialogue for statisticians and science editors %A Nell Sedransk %A Lawrence H. Cox %A Deborah Nolan %A Keith Soper %A Cliff Spiegelman %A Linda J. Young %A Katrina L. Kelner %A Robert A. Moffitt %A Ani Thakar %A Jordan Raddick %A Edward J. Ungvarsky %A Richard W. Carlson %A Rolf Apweiler %XPutting data into the public domain is not the same thing as making those data accessible for intelligent analysis. A distinguished group of editors and experts who were already engaged in one way or another with the issues inherent in making research data public came together with statisticians to initiate a dialogue about policies and practicalities of requiring published research to be accompanied by publication of the research data. This dialogue carried beyond the broad issues of the advisability, the intellectual integrity, the scientific exigencies to the relevance of these issues to statistics as a discipline and the relevance of statistics, from inference to modeling to data exploration, to science and social science policies on these issues.

%B Statistical Science %V 5 %P 41-50 %G eng %R 10.1214/10-STS320 %0 Generic %D 2011 %T National Institute of Statistical Sciences Configuration and Data Integration for Longitudinal Studies Technical Panel: Final Report (2011). %A A. F. Karr %I US Department of Education, Institute of Education Sciences, NCES %G eng %0 Generic %D 2011 %T National Institute of Statistical Sciences Data Confidentiality Technical Panel: Final Report %A A. F. Karr %I US Department of Education, Institute of Education Sciences, NCES %G eng %0 Journal Article %J Chance %D 2011 %T Research access to restricted-use data %A A. F. Karr %A S. K. Kinney %B Chance %V 24 %P 41-45 %G eng %0 Journal Article %J International Statistical Review %D 2011 %T Risk-utility paradigms for statistical disclosure limitation: How to think, but not how to act (with discussion) %A A. F. Karr %A L. H. Cox %A S. K. Kinney %XRisk-utility formulations for problems of statistical disclosure limitation are now common. We argue that these approaches are powerful guides to official statistics agencies in regard to how to think about disclosure limitation problems, but that they fall short in essential ways from providing a sound basis for acting upon the problems. We illustrate this position in three specific contexts—transparency, tabular data and survey weights, with shorter consideration of two key emerging issues—longitudinal data and the use of administrative data to augment surveys.

%B International Statistical Review %V 79 %P 160-199 %G eng %R 10.1111/j.1751-5823.2011.00140.x %0 Conference Paper %B Proceedings, American Society for Engineering Education %D 2011 %T Studying the Reliability and Validity of Test Scores for Mathematical and Spatial Reasoning Tasks for Engineering Students %A Pauley, L. %A Kulikowich, J. %A Sedransk, N. %A Engel, R. %B Proceedings, American Society for Engineering Education %G eng %0 Journal Article %J PACE %D 2011 %T Systematic decrements in QTc between the first and second day of contiguous daily ECG recordings under controlled conditions %A Beasley CM Jr %A Benson C %A Xia JQ %A Young SS %A Haber H %A Mitchell MI %A Loghin C %K ECG %K QT interval %XBACKGROUND: Many thorough QT (TQT) studies use a baseline day and double delta analysis to account for potential diurnal variation in QTc. However, little is known about systematic changes in the QTc across contiguous days when normal volunteers are brought into a controlled inpatient environment.

%B PACE %V 34 %P 1116-1127 %8 April %G eng %R doi:10.1111/j.1540-8159.2011.03117.x %0 Journal Article %J International Statistical Review %D 2011 %T Toward Unrestricted Public Use Business Microdata: The Synthetic Longitudinal Business Database %A S. K. Kinney %A J. P. Reiter %A AP Reznek %A J Miranda %A R Jarmin %A JM Abowd %B International Statistical Review %V 79 %P 362-384 %G eng %N 3 %0 Journal Article %J PLoS1 %D 2011 %T Variance Component Analysis of a Multi-Site Study of Multiple Reaction Monitoring Measurements of Peptides and Proteins in Human Plasma %A Xia, J. %A Sedransk, N. %A Feng, X. %K analysis of Variance %K blood plasma %K experimental design %K Instrument calibration %K linear regression analysis %K peptides %K plasma proteins %K proteomic databases %XIn the Addona et al. paper (Nature Biotechnology 2009), a large-scale multi-site study was performed to quantify Multiple Reaction Monitoring (MRM) measurements of proteins spiked in human plasma. The unlabeled signature peptides derived from the seven target proteins were measured at nine different concentration levels, and their isotopic counterparts were served as the internal standards.

%B PLoS1 %V 6 %P e14590 %G eng %R 10.1371/journal.pone.0014590 %0 Journal Article %J Journal of Clinical Chemistry %D 2010 %T Analytical Validation of Proteomic-Based Multiplex Assays: A Workshop Report by the NCI-FDA Interagency Oncology Task Force on Molecular Diagnostics %A Stephan A. Carr %A Nell Sedransk. %A Henry Rodriguez %A Zivana Tezak %A Mehdi Mesri %A Daniel C. Liebler %A Susan J. Fisher %A Paul Tempst %A Tara Hiltke %A Larry G. Kessler %A Christopher R. Kinsinger %A Reena Philip %A David F. Ransohoff %A Steven J. Skates %A Fred E. Regnier %A N. Leigh Anderson %A Elizabeth Mansfield %A on behalf of the Workshop Participants %XClinical proteomics has the potential to enable the early detection of cancer through the development of multiplex assays that can inform clinical decisions. However, there has been some uncertainty among translational researchers and developers as to the specific analytical measurement criteria needed to validate protein-based multiplex assays. To begin to address the causes of this uncertainty, a day-long workshop titled “Interagency Oncology Task Force Molecular Diagnostics Workshop” was held in which members of the proteomics and regulatory communities discussed many of the analytical evaluation issues that the field should address in development of protein-based multiplex assays for clinical use. This meeting report explores the issues raised at the workshop and details the recommendations that came out of the day’s discussions, such as a workshop summary discussing the analytical evaluation issues that specific proteomic technologies should address when seeking US Food and Drug Administration approval.

%B Journal of Clinical Chemistry %V 56 %P 237-243 %G eng %R 10.1373/clinchem.2009.136416 %0 Journal Article %J Journal of American Statistical Association %D 2010 %T Bayesian multiscale multiple imputation with implications to data confidentiality %A A. F. Karr %A S. H. Holan %A D. Toth %A M. A. R. Ferreira %XMany scientific, sociological, and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such datasets experience missing observations in a manner that they can be accurately imputed, while respecting the constraints imposed by the multiscale nature of the data, using the method we propose known as Bayesian multiscale multiple imputation. Our approach couples dynamic linear models with a novel imputation step based on singular normal distribution theory. Although our method is of independent interest, one important implication of such methodology is its potential effect on confidential databases protected by means of cell suppression. In order to demonstrate the proposed methodology and to assess the effectiveness of disclosure practices in longitudinal databases, we conduct a large-scale empirical study using the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). During the course of our empirical investigation it is determined that several of the predicted cells are within 1% accuracy, thus causing potential concerns for data confidentiality.

%B Journal of American Statistical Association %V 105 %P 564-577 %G eng %0 Conference Paper %B Proceedings, American Society for Engineering Education %D 2010 %T Constructing mathematical and spatial-reasoning measures for engineering students %A Pauley, L. %A Kulikowich, J.M. %A Sedransk, N. %A Engel, R. %B Proceedings, American Society for Engineering Education %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2010 %T Data confidentiality—the next five years: Summary and guide to papers %A A. F. Karr %A S. K. Kinney %A J. F. Gonzalez, Jr. %B Journal of Privacy and Confidentiality %V 1 %P 125-134 %G eng %0 Journal Article %J Statistics in Biopharmaceutical Research %D 2010 %T Marking the Ends of T-waves: Algorithms and Experts %A Zhou, Y-C. %A Sedransk, N. %K Bayesian algorithm %K Functional data analysis %K QT interval %XThe prolongation of QT interval on electrocardiogram (ECG) is the current measure for cardiac safety that is used in drug development and drug approval. Although in thorough QT studies pharmaceutical companies need to measure QT intervals for thousands of beats, they mainly rely on experts to mark the QT interval endpoints. However, selected beats of data show that the difference between two experts’ marks can easily exceed 10 milliseconds. Note that for QT analyses presented to the FDA, if the maximal difference over all time points between QT measures comparing control to drug exceeds 10 milliseconds, the question of cardiac safety requires further discussion. Indeed experts appear to use the slope and curvature of the waveform differently in judging the end of the T-wave. This article develops a Bayesian approach combining both slope and curvature information. We show that the difference between the automatic Bayesian marks and either of the experts’ marks is not statistically larger than the difference between two experts’ marks, thus this approach is successful in closely approximating the experts’ results in marking T-wave end, and it is much faster and cost efficient. Being algorithmic, our method offers the opportunity to be more consistent.

%B Statistics in Biopharmaceutical Research %V 2 %P 359-367 %G eng %R 10.1198/sbr.2009.08085 %0 Journal Article %J Statistics in Biopharmaceutical Research %D 2010 %T Marking the Ends of T-waves: Algorithms and Experts %A Zhou, Y-C. %A Sedransk, N. %K Bayesian algorithm %K Functional data analysis %K QT interval %XThe prolongation of QT interval on electrocardiogram (ECG) is the current measure for cardiac safety that is used in drug development and drug approval. Although in thorough QT studies pharmaceutical companies need to measure QT intervals for thousands of beats, they mainly rely on experts to mark the QT interval endpoints. However, selected beats of data show that the difference between two experts’ marks can easily exceed 10 milliseconds. Note that for QT analyses presented to the FDA, if the maximal difference over all time points between QT measures comparing control to drug exceeds 10 milliseconds, the question of cardiac safety requires further discussion. Indeed experts appear to use the slope and curvature of the waveform differently in judging the end of the T-wave. This article develops a Bayesian approach combining both slope and curvature information. We show that the difference between the automatic Bayesian marks and either of the experts’ marks is not statistically larger than the difference between two experts’ marks, thus this approach is successful in closely approximating the experts’ results in marking T-wave end, and it is much faster and cost efficient. Being algorithmic, our method offers the opportunity to be more consistent.

%B Statistics in Biopharmaceutical Research %V 2 %P 359-367 %G eng %R 10.1198/sbr.2009.08085 %0 Journal Article %J J. Statist. Planning Inf. %D 2010 %T Masking methods that preserve positivity constraints in microdata %A A. F. Karr %A A. Oganian %K constraints %K Positivity %K SDL method %K Statistical disclosure limitation (SDL) %XStatistical agencies have conflicting obligations to protect confidential information provided by respondents to surveys or censuses and to make data available for research and planning activities. When the microdata themselves are to be released, in order to achieve these conflicting objectives, statistical agencies apply statistical disclosure limitation (SDL) methods to the data, such as noise addition, swapping or microaggregation. Some of these methods do not preserve important structure and constraints in the data, such as positivity of some attributes or inequality constraints between attributes. Failure to preserve constraints is not only problematic in terms of data utility, but also may increase disclosure risk. In this paper, we describe a method for SDL that preserves both positivity of attributes and the mean vector and covariance matrix of the original data. The basis of the method is to apply multiplicative noise with the proper, data-dependent covariance structure.

%B J. Statist. Planning Inf. %V 141 %P 31-41 %G eng %0 Journal Article %J Journal of Statistical Planning Inf. %D 2010 %T Preserving data utility via BART %A A. F. Karr %B Journal of Statistical Planning Inf. %V 140 %P 2551-2561 %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2010 %T Privacy-preserving maximum likelihood estimation for distributed data %A A. F. Karr %A X. Lin %B Journal of Privacy and Confidentiality %V 1 %P 213-222 %G eng %0 Conference Paper %B Social Statistics and Higher Education Conference Volume %D 2010 %T Psychometric and Statistical Modeling for the Study of Retention and Graduation in Undergraduate Engineering %A Sedransk, N. %A Kulikowich, J.M. %A Engel, R. %A X. Wang %A Gunning, P. %A Fleming, A. %B Social Statistics and Higher Education Conference Volume %G eng %0 Journal Article %J Journal of Privacy and Confidentiality %D 2010 %T Secure statistical analysis of distributed databases, emphasizing what we don’t know %A A. F. Karr %B Journal of Privacy and Confidentiality %V 1 %P 197-211 %G eng %0 Journal Article %J Journal of Official Statistics %D 2010 %T Statistical Careers in US Government Science Agencies %A Sedransk, N. %K complex system models %K engineering statistics %K high-dimensional data %K History of statistics %K metrology %XThe role of statistics in those U.S. government agencies that focus on progress in science and engineering became prominent at the end of the Second World War. The success of statistics in that historical period came from the power of statistics to enable science to advance more rapidly and with great assurance in the interpretation of experimental results. Over the past three quarters of a century, technology has changed both the practice of science and the practice of statistics. However, the comparative advantage of statistics still rests in the ability to achieve greater precision with fewer errors and a deeper understanding. Examples illustrate some of the challenges that complex science now presents to statisticians, demanding both creativity and technical skills.

%B Journal of Official Statistics %V 26 %P 443-453 %G eng %0 Conference Paper %B Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS ’09. First Workshop on %D 2009 %T Evaluation of unmixing methods for the separation of Quantum Dot sources %A Fogel, P. %A Gobinet, C. %A Young, S.S. %A Zugaj, D. %K Bayesian methods %K Bayesian positive source separation %K BPSS %K cadmium compounds %K CdSe %K consensus nonnegative matrix factorization %K Fluorescence %K hyperspectral images %K Hyperspectral imaging %K hyperspectral system %K ICA %K II-VI semiconductors %K independent component analysis %K Nanobioscience %K Nanocrystals %K nanometer dimensions %K NMF %K Photonic crystals %K Probes %K quantum dot sources %K Quantum dots %K semiconductor crystals %K semiconductor quantum dots %K Source separation %K spatial localization %K ultraviolet spectra %K unmixing methods %XQuantum Dots (QDs) are semiconductor crystals with nanometer dimensions, which have fluorescence properties that can be adjusted through controlling their diameter. Under ultraviolet light excitation, these nanocrystals re-emit photons in the visible spectrum, with a wavelength ranging from red to blue as their size diminishes. We created an experiment to evaluate unmixing methods for hyperspectral images. The wells of a matrix [3 times 3] were filled with individual or up to three of five QDs. The matrix was imaged by a hyperspectral system (Photon Etc., Montreal, QC, CA) and a data ldquocuberdquo of 512 rows times 512 columns times 63 wavelengths was generated. For unmixing, we tested three approaches: independent component analysis (ICA), Bayesian positive source separation (BPSS) and our new consensus non-negative matrix factorization (CNFM) method. For each of these methods, we assessed the ability to separate the different sources from both spectral and spatial localization points of view. In this situation, we showed that BPSS and CNMF model estimates were very close to the original design of our experiment and were better than the ICA results. However, the time needed for the BPSS model to converge is substantially higher than CNMF. In addition, we show how the CNMF coefficients can be used to provide reasonable bounds for the number of sources, a key issue for unmixing methods, and allow for an effective segmentation of the spatial signal.

%B Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS ’09. First Workshop on %P 1-4 %@ 978-1-4244-4686-5 %G eng %R 10.1109/WHISPERS.2009.5289020 %0 Journal Article %J Annals of Applied Statistics %D 2009 %T Functional Data Analytic Approach of Modeling ECG T-wave shape to Measure Cardiovascular Behavior %A Zhou, Y-C. %A Sedransk, N. %K cardiac safety %K ECG T-wave %K Functional data analysis %K QT interval %K T-wave morphology %XThe T-wave of an electrocardiogram (ECG) represents the ventricular repolarization that is critical in restoration of the heart muscle to a pre-contractile state prior to the next beat. Alterations in the T-wave reflect various cardiac conditions; and links between abnormal (prolonged) ventricular repolarization and malignant arrhythmias have been documented. Cardiac safety testing prior to approval of any new drug currently relies on two points of the ECG waveform: onset of the Q-wave and termination of the T-wave; and only a few beats are measured. Using functional data analysis, a statistical approach extracts a common shape for each subject (reference curve) from a sequence of beats, and then models the deviation of each curve in the sequence from that reference curve as a four-dimensional vector. The representation can be used to distinguish differences between beats or to model shape changes in a subject’s T-wave over time. This model provides physically interpretable parameters characterizing T-wave shape, and is robust to the determination of the endpoint of the T-wave. Thus, this dimension reduction methodology offers the strong potential for definition of more robust and more informative biomarkers of cardiac abnormalities than the QT (or QT corrected) interval in current use.

%B Annals of Applied Statistics %V 3 %P 1382-1402 %G eng %R 10.1214/09-AOAS273 %0 Journal Article %J Annals of Applied Statistics %D 2009 %T Functional Data Analytic Approach of Modeling ECG T-wave shape to Measure Cardiovascular Behavior %A Zhou, Y-C. %A Sedransk, N. %K cardiac safety %K ECG T-wave %K Functional data analysis %K QT interval %K T-wave morphology %XThe T-wave of an electrocardiogram (ECG) represents the ventricular repolarization that is critical in restoration of the heart muscle to a pre-contractile state prior to the next beat. Alterations in the T-wave reflect various cardiac conditions; and links between abnormal (prolonged) ventricular repolarization and malignant arrhythmias have been documented. Cardiac safety testing prior to approval of any new drug currently relies on two points of the ECG waveform: onset of the Q-wave and termination of the T-wave; and only a few beats are measured. Using functional data analysis, a statistical approach extracts a common shape for each subject (reference curve) from a sequence of beats, and then models the deviation of each curve in the sequence from that reference curve as a four-dimensional vector. The representation can be used to distinguish differences between beats or to model shape changes in a subject’s T-wave over time. This model provides physically interpretable parameters characterizing T-wave shape, and is robust to the determination of the endpoint of the T-wave. Thus, this dimension reduction methodology offers the strong potential for definition of more robust and more informative biomarkers of cardiac abnormalities than the QT (or QT corrected) interval in current use.

%B Annals of Applied Statistics %V 3 %P 1382-1402 %G eng %R 10.1214/09-AOAS273 %0 Journal Article %J Journal of Privacy and Confidentiality %D 2009 %T Global measures of data utility for microdata masked for disclosure limitation %A A. F. Karr %A A. Oganyan %A J. P. Reiter %A M.-J. Woo %B Journal of Privacy and Confidentiality %V 1 %P 111-124 %G eng %0 Generic %D 2009 %T NISS/NESSI Task Force on Full Population Estimates for NAEP %A A. F. Karr %I National Institute of Statistical Sciences %G eng %0 Journal Article %J Journal of Official Statistics %D 2009 %T Privacy-preserving analysis of vertically partitioned data using secure matrix products %A A. F. Karr %A X. Lin %A J. P. Reiter %A A. P. Sanil %B Journal of Official Statistics %V 25 %P 125-138 %G eng %0 Conference Paper %B Proc. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality %D 2009 %T The role of transparency in statistical disclosure limitation %A A. F. Karr %B Proc. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality %C Bilbao, Spain %8 December %G eng %U http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.41.e.pdf %0 Journal Article %J Computer Math Organ Theory %D 2009 %T Special issue on dynamic models for social networks %A David Banks %B Computer Math Organ Theory %V 15 %P 259-260 %8 12/2009 %G eng %& 259 %R 10.1007/s10588-009-9062-6 %0 Generic %D 2009 %T Task Force Report on Computer Adaptive Testing %A Sedransk, N. %I National Center for Education Statistics %G eng %0 Journal Article %J Computational Statistics and Data Analysis %D 2009 %T Verification servers: enabling analysts to assess the quality of inferences from public use data %A J. P. Reiter %A A. Oganyan %A A. F. Karr %XTo protect confidentiality, statistical agencies typically alter data before releasing them to the public. Ideally, although generally not done, the agency also provides a way for secondary data analysts to assess the quality of inferences obtained with the released data. Quality measures can help secondary data analysts to identify inaccurate conclusions resulting from the disclosure limitation procedures, as well as have confidence in accurate conclusions. We propose a framework for an interactive, web-based system that analysts can query for measures of inferential quality. As we illustrate, agencies seeking to build such systems must consider the additional disclosure risks from releasing quality measures. We suggest some avenues of research on limiting these risks.

%B Computational Statistics and Data Analysis %V 53 %P 1475-1482 %G eng %0 Journal Article %J Proceedings - Royal Society B %D 2008 %T Cereal-induced gender selection? Most likely a multiple testing false positive %A Young SS %A Bang H %A Oktay K %B Proceedings - Royal Society B %V 276 %P 1211-1212 %G eng %U http://rspb.royalsocietypublishing.org/content/276/1660/1211.full %R 10.1098/rspb.2008.1405 %0 Book Section %D 2008 %T Citizen access to government statistical information %A Alan F. Karr %E H. Chen %E L. Brandt %E V. Gregg %E R. Traunmüller %E S. Dawes %E E. Hovy %E A. Macintosh %E C. A. Larson %XModern electronic technologies have dramatically increased the volume of information collected and assembled by government agencies at all levels. This chapter describes digital government research aimed at keeping government data warehouses from turning into data cemeteries. The products of the research exploit modern electronic technologies in order to allow “ordinary citizens” and researchers access to government-assembled information. The goal is to help ensure that more data also means better and more useful data. Underlying the chapter are three tensions. The first is between comprehensiveness and understandability of information available to non-technically oriented “private citizens.” The second is between ensuring usefulness of detailed statistical information and protecting confidentiality of data subjects. The third tension is between the need to analyze “global” data sets and the reality that government data are distributed among both levels of government and agencies (typically, by the “domain” of data, such as education, health, or transportation).

%I Springer US %P 503-529 %G eng %& 25 %0 Journal Article %J Statisics in Medicine %D 2008 %T Estimation of propensity scores using generalized additive models %A Mi-ja Woo %A Jerome Reiter %A Alan F. Karr %B Statisics in Medicine %V 27 %P 3806-3816 %G eng %0 Book Section %B Terrorism Informatics %D 2008 %T Homeland Insecurity %A Stephen E. Fienberg %E Chen, Hsinchun %E Reid, Edna %E Sinai, Joshua %E Silke, Andrew %E Ganor, Boaz %XFollowing the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. There have also been claims that prospective datamining could be used to find the “signature” of terrorist cells embedded in larger networks. We present an overview of why the public has concerns about such activities and describe some proposals for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. We also explore their link to the related literatures on privacy-preserving data mining. In particular, we focus on the matching problem across databases and the concept of “selective revelation” and their confidentiality implications.

%B Terrorism Informatics %S Integrated Series In Information Systems %I Springer US %V 18 %P 197-218 %@ 978-0-387-71612-1 %G eng %U http://dx.doi.org/10.1007/978-0-387-71613-8_10 %R 10.1007/978-0-387-71613-8_10 %0 Journal Article %J Journal of National Cancer Institute %D 2008 %T Low-fat dietary pattern and cancer incidence in the Women’s Health Initiative Dietary Modification Randomized Controlled Trial %A Young SS %B Journal of National Cancer Institute %V 100 %P 284 %G eng %U http://jnci.oxfordjournals.org/content/100/4/284.1.extract# %R 10.1093/jnci/djm309 %0 Journal Article %J Computational Statistics & Data Analysis %D 2008 %T Pooled ANOVA %A Michael Last %A Gheorghe Luta %A Alex Orso %A Adam Porter %A Stan Young %B Computational Statistics & Data Analysis %V 52 %P 5215 %G eng %0 Journal Article %J Q. Applied Mathematics %D 2008 %T Sensitivity to noise variance in a social network dynamics model %A H. T. Banks %A H. K. Nguyen %A J. R. Samuels, Jr. %A A. F. Karr %B Q. Applied Mathematics %V 66 %P 233-247 %G eng %0 Book Section %B Encyclopedia of Risk Assessment IV %D 2008 %T Social Networks %A D. L. Banks %A N. Hengartner %K block models %K counterterrorism %K exponential family %K latent space models %K p* models %XSocial networks models are a body of statistical procedures for describing relationships between agents. The term stems from initial applications that studied interactions within human communities, but the methodology is now used much more broadly and can analyze interactions among genes, proteins, nations, and websites. In the context of risk analysis, social network models have been used to describe the formation, persistence, and breakdown of terrorist cells. They also pertain to studies of organizational behavior.

%B Encyclopedia of Risk Assessment IV %I Wiley %G eng %R 10.1002/9780470061596.risk0667 %0 Journal Article %J Annals of Statistics %D 2007 %T Computer Model Validation with Functional Output %A M.J. Bayarri %A J. Berger %A Garcia-Donato, G. %A Liu, F. %A R. Paulo %A Jerome Sacks %A Palomo, J. %A Walsh, D. %A J. Cafeo %A Parthasarathy, R. %B Annals of Statistics %V 35 %P 1874-190 %G eng %N 5 %0 Journal Article %J Journal of Chemical Information and Modeling %D 2007 %T Exploration of cluster structure-activity relationship analysis in efficient high-throughput screening %A Wang, X. S. %A Salloum, G.A. %A Chipman, H.A. %A Welch, W.J. %A Young, S.S. %XSequential screening has become increasingly popular in drug discovery. It iteratively builds quantitative structure-activity relationship (QSAR) models from successive high-throughput screens, making screening more effective and efficient. We compare cluster structure-activity relationship analysis (CSARA) as a QSAR method with recursive partitioning (RP), by designing three strategies for sequential collection and analysis of screening data. Various descriptor sets are used in the QSAR models to characterize chemical structure, including high-dimensional sets and some that by design have many variables not related to activity. The results show that CSARA outperforms RP. We also extend the CSARA method to deal with a continuous assay measurement.

%B Journal of Chemical Information and Modeling %V 47 %P 1206-1214 %G eng %R 10.1021/ci600458n %0 Journal Article %J Bioinformatics %D 2007 %T Inferential, robust non-negative matrix factorization analysis of microarray data %A Fogel, P. %A Young, S.S. %A Hawkins, D.M. %A Ledirac, N %XMotivation: Modern methods such as microarrays, proteomics and metabolomics often produce datasets where there are many more predictor variables than observations. Research in these areas is often exploratory; even so, there is interest in statistical methods that accurately point to effects that are likely to replicate. Correlations among predictors are used to improve the statistical analysis. We exploit two ideas: non-negative matrix factorization methods that create ordered sets of predictors; and statistical testing within ordered sets which is done sequentially, removing the need for correction for multiple testing within the set. Results: Simulations and theory point to increased statistical power. Computational algorithms are described in detail. The analysis and biological interpretation of a real dataset are given. In addition to the increased power, the benefit of our method is that the organized gene lists are likely to lead better understanding of the biology. Availability: An SAS JMP executable script is available from http://www.niss.org/irMF

%B Bioinformatics %V 23 %P 44-49 %G eng %R 10.1093/bioinformatics/btl550 %0 Journal Article %J Journal of Computational Statistics and Data Analysis %D 2007 %T Secure computation with horizontally partitioned data using adaptive regression splines %A A. F. Karr %A J. Ghosh %A J. P. Reiter %XWhen several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records’ data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining good-fitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model mis-specification inherent in secure computation settings. We illustrate the protocol with air pollution data.

%B Journal of Computational Statistics and Data Analysis %V 51 %P 5813-5820 %8 August %G eng %R 10.1016/j.csda.2006.10.013 %0 Conference Paper %B Bulletin of International Statistics Institute %D 2007 %T Secure logistic regression with distributed databases %A A. F. Karr %A S. E. Fienberg %A Y. Nardi %A A. Slavkovic %B Bulletin of International Statistics Institute %G eng %0 Journal Article %J Journal of Data Science %D 2007 %T Statistics in metrology: International key comparisons and interlaboratory studies %A Sedransk, N. %A Rukhin, A. %B Journal of Data Science %V 5 %P 393-412 %G eng %0 Journal Article %J IEEE TRANSACTIONS ON SOFTWARE ENGINEERING %D 2007 %T Techniques for classifying executions of deployed software to support software engineering tasks %A Murali Haran %A Alan Karr %A Michael Last %A Alessandro Orso %A Adam A. Porter %A Ashish Sanil %A Sandro Fouché %B IEEE TRANSACTIONS ON SOFTWARE ENGINEERING %V 33 %P 287-304 %G eng %0 Journal Article %J Q&SPES News %D 2006 %T Clustering Scotch Whiskies using Non-Negative Matrix Factorization %A Young, S.S. %A Fogel, P. %A Hawkins, D.M. %B Q&SPES News %V 14 %P 11-13 %G eng %0 Conference Paper %B Privacy in Statistical Databases: CENEX–SDC Project International Conference, PSD 2006 Rome, Italy, December 13–15, 2006 Proceedings %D 2006 %T Combinations of SDC methods for microdata protection %A A. F. Karr %A A. Oganyan %E J. Domingo–Ferrer %E L. Franconi %B Privacy in Statistical Databases: CENEX–SDC Project International Conference, PSD 2006 Rome, Italy, December 13–15, 2006 Proceedings %8 December %G eng %0 Journal Article %J Statistical Methodology %D 2006 %T Data quality: A statistical perspective %A Alan F. Karr %A Ashish P. Sanil %A David L. Banks %B Statistical Methodology %V 3 %P 137–173 %G eng %0 Journal Article %J The American Statistician %D 2006 %T A framework for evaluating the utility of data altered to protect confidentiality %A A. F. Karr %A C. N. Kohnen %A A. Oganyan %A J. P. Reiter %A A. P. Sanil %B The American Statistician %V 60 %P 224-232 %G eng %0 Journal Article %J Journal of Chemical Information and Modeling %D 2006 %T PharmID: Pharmacophore identification using Gibbs sampling %A Feng J. %A Sanil A %A Young SS %XThe binding of a small molecule to a protein is inherently a 3D matching problem. As crystal structures are not available for most drug targets, there is a need to be able to infer from bioassay data the key binding features of small molecules and their disposition in space, the pharmacophore. Fingerprints of 3D features and a modification of Gibbs sampling to align a set of known flexible ligands, where all compounds are active, are used to discern possible pharmacophores. A clique detection method is used to map the features back onto the binding conformations. The complete algorithm is described in detail, and it is shown that the method can find common superimposition for several test data sets. The method reproduces answers very close to the crystal structure and literature pharmacophores in the examples presented. The basic algorithm is relatively fast and can easily deal with up to 100 compounds and tens of thousands of conformations. The algorithm is also able to handle multiple binding mode problems, which means it can superimpose molecules within the same data set according to two different sets of binding features. We demonstrate the successful use of this algorithm for multiple binding modes for a set of D2 and D4 ligands.

%B Journal of Chemical Information and Modeling %V 46 %P 1352-1359 %G eng %R 10.1021/ci050427v %0 Journal Article %J Technometrics %D 2006 %T Secure, privacy-preserving analysis of distributed databases %A Alan F. Karr %A Fulp, WJ %A F. Vera %A Young, S.S. %A X. Lin %A J. P. Reiter %XThere is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology?specifically, secure multiparty computation and networking?can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the ?complement ? of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For low-risk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest.

%B Technometrics %V 48 %P 133-143 %G eng %R 10.1198/004017007000000209 %0 Journal Article %J Metrologia %D 2006 %T Statistical analysis for multiple artifact problem in key comparisons with linear trends %A Zhang, N.-F. %A Strawderman, W. %A Liu, H.-k. %A Sedransk, N. %K computational physics %K instrumentation and measurement %XA statistical analysis for key comparisons with linear trends and multiple artefacts is proposed. This is an extension of a previous paper for a single artefact. The approach has the advantage that it is consistent with the no-trend case. The uncertainties for the key comparison reference value and the degrees of equivalence are also provided. As an example, the approach is applied to key comparison CCEM–K2.

%B Metrologia %V 43 %P 21-26 %G eng %R 10.1088/0026-1394/43/1/003 %0 Journal Article %J Technom %D 2006 %T Statistical design of pools using optimal coverage and minimal collision %A Remlinger KS %A Hughes-Oliver JM %A Young SS %A Lam RL %K Pharmaceutical industry %K Pooled data %K Pooling %K Screening %K Throughput %XThe screening of large chemical libraries to identify new compounds can be simplified by testing compounds in pools. Two criteria for designing pools are considered: optimal coverage of the chemical space and minimal collision between compounds. Four pooling designs are applied to a public database and evaluated by determining how well the design criteria are met and whether the methods are able to find diverse active compounds. While one pool was outstanding, all designed pools outperformed randomly designed pools.

%B Technom %V 48 %P 133-143 %G eng %R 10.1198/004017005000000481 %0 Generic %D 2006 %T Survey Costs: Workshop Report and White Paper %A A. F. Karr %A M. Last %I National Institute of Statistical Sciences %G eng %0 Conference Proceedings %B Proc. ACM SIGSOFT Symposium Foundations of Software Engineering 2005 %D 2005 %T Applying classification techniques to remotely-collected program execution data %A A. F. Karr %A M. Haran %A A. A. Porter %A A. Orso %A A. P. Sanil %B Proc. ACM SIGSOFT Symposium Foundations of Software Engineering 2005 %I ACM %C New York %G eng %0 Journal Article %J Statistical Science %D 2005 %T Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers %A A. F. Karr %A J. Feng %A X. Lin %A J. P. Reiter %A A. P. Sanil %A Young, S.S. %B Statistical Science %V 20 %P 163-177 %G eng %0 Conference Paper %B Bull. International Statistical Inst., 55th Session %D 2005 %T Data quality and data confidentiality for microdata: implications and strategies %A A. F. Karr %A A. P. Sanil %B Bull. International Statistical Inst., 55th Session %G eng %0 Journal Article %J Journal of Official Statistics %D 2005 %T Data Swapping as a Decision Problem %A Shanti Gomatam %A Alan F. Karr %A Ashish P. Sanil %K categorical data %K data confidentiality %K Data swapping %K data utility %K disclosure risk %K risk-utility frontier %XWe construct a decision-theoretic formulation of data swapping in which quantitative measures of disclosure risk and data utility are employed to select one release from a possibly large set of candidates. The decision variables are the swap rate, swap attribute(s) and, possibly, constraints on the unswapped attributes. Risk–utility frontiers, consisting of those candidates not dominated in (risk, utility) space by any other candidate, are a principal tool for reducing the scale of the decision problem. Multiple measures of disclosure risk and data utility, including utility measures based directly on use of the swapped data for statistical inference, are introduced. Their behavior and resulting insights into the decision problem are illustrated using data from the U.S. Current Population Survey, the well-studied “Czech auto worker data” and data on schools and administrators generated by the U.S. National Center for Education Statistics.

%B Journal of Official Statistics %V 21 %P 635–655 %G eng %0 Journal Article %J Annals of Statistics %D 2005 %T Default Priors for Gaussian Processes %A R. Paulo %K Computer model %K frequentist coverage %K Gaussian process %K integrated likelihood %K Jeffreys prior %K posterior propriety %K reference prior %XMotivated by the statistical evaluation of complex computer models, we deal with the issue of objective prior specification for the parameters of Gaussian processes. In particular, we derive the Jeffreys-rule, independence Jeffreys and reference priors for this situation, and prove that the resulting posterior distributions are proper under a quite general set of conditions. A proper flat prior strategy, based on maximum likelihood estimates, is also considered, and all priors are then compared on the grounds of the frequentist properties of the ensuing Bayesian procedures. Computational issues are also addressed in the paper, and we illustrate the proposed solutions by means of an example taken from the field of complex computer model validation.

%B Annals of Statistics %V 33 %P 556-582 %G eng %R 10.1214/009053604000001264 %0 Journal Article %J Chance %D 2005 %T Discussion of ‘The impact of technology on the scientific method' by S. Keller–McNulty, A. G.Wilson and G. Wilson %A A. F. Karr %B Chance %V 18 %P 1 %G eng %0 Conference Paper %B ICSE 2005 Workshop on Advances in Model-Based Software Testing (A-MOST) %D 2005 %T Distributed performance testing using statistical modeling %A A. A. Porter %A A. F. Karr %B ICSE 2005 Workshop on Advances in Model-Based Software Testing (A-MOST) %G eng %0 Generic %D 2005 %T National Institute of Statistical Sciences/Education Statistics Services Institute Task Force on Graduation, Completion and Dropout Indicators: Final Report %A A. F. Karr %I US Department of Education, Institute of Education Sciences, NCES %8 November %G eng %0 Journal Article %J Journal of Chemical Information and Modeling %D 2005 %T PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation %A Liu, J. %A J. Feng %A Young, S.S. %XIdeally, a team of biologists, medicinal chemists and information specialists will evaluate the hits from high throughput screening. In practice, it often falls to nonmedicinal chemists to make the initial evaluation of HTS hits. Chemical genetics and high content screening both rely on screening in cells or animals where the biological target may not be known. There is a need to place active compounds into a context to suggest potential biological mechanisms. Our idea is to build an operating environment to help the biologist make the initial evaluation of HTS data. To this end the operating environment provides viewing of compound structure files, computation of basic biologically relevant chemical properties and searching against biologically annotated chemical structure databases. The benefit is to help the nonmedicinal chemist, biologist and statistician put compounds into a potentially informative biological context. Although there are several similar public and private programs used in the pharmaceutical industry to help evaluate hits, these programs are often built for computational chemists. Our program is designed for use by biologists and statisticians.

%B Journal of Chemical Information and Modeling %V 45 %P 515-522 %G eng %R 10.1021/ci049847v %0 Journal Article %J Pharmacogenomics %D 2005 %T Recursive partitioning as a tool for pharmcogenetic studies of complex diseases: II. Statistical considerations %A Zaykin, D.V. %A Young, S.S. %XIdentifying genetic variations predictive of important phenotypes, such as disease susceptibility, drug efficacy, and adverse events, remains a challenging task. There are individual polymorphisms that can be tested one at a time, but there is the more difficult problem of the identification of combinations of polymorphisms or even more complex interactions of genes with environmental factors. Diseases, drug responses or side effects can result from different mechanisms. Identification of subgroups of people where there is a common mechanism is a problem for diagnosis and prescribing of treatment. Recursive partitioning (RP) is a simple statistical tool for segmenting a population into non-overlapping groups where the response of interest, disease susceptibility, drug efficacy and adverse events are more homogeneous within the segments. We suggest that the use of RP is not only more technically feasible than other search methods but it is less susceptible to multiple-testing problems. The numbers of combinations of gene?gene and gene?environment interactions is potentially astronomical and RP greatly reduces the effective search and inference space. Moreover, the certain reliance of RP on the presence of marginal effects is justifiable as was found by using analytical and numerical arguments. In the context of haplotype analysis, results suggest that the analysis of individual SNPs is likely to be successful even when susceptibilities are determined by haplotypes. Retrospective clinical studies where cases and controls are collected will be a common design. This report provides methods that can be used to adjust the RP analysis to reflect the population incidence of the response of interest. Confidence limits on the incidence of the response in the segmented subgroups are also discussed. RP is a straightforward way to create realistic subgroups, and prediction intervals for the within-subgroup disease incidence are easily obtained.

%B Pharmacogenomics %V 6 %P 77-89 %G eng %R 10.1517/14622416.6.1.77 %0 Journal Article %J Biostatistics %D 2005 %T Sample size calculation for multiple testing in microarray data analysis %A Jung SH %A Bang H %A Young SS %B Biostatistics %V 6 %P 157-169 %G eng %0 Journal Article %J J. Computer-Aided Molecular Design %D 2005 %T Secure analysis of distributed chemical databases without data integration %A Alan F. Karr %A Jun Feng %A Xiaodong Lin %A Ashish P. Sanil %A S. Stanley Young %A Jerome P. Reiter %B J. Computer-Aided Molecular Design %V 19 %P 739-747 %8 November %G eng %0 Journal Article %J J. Computational and Graphical Statist %D 2005 %T Secure Regression on Distributed Databases %A Alan F. Karr %A Alan F. Karr %A Xiaodong Lin %A Xiaodong Lin %A Ashish P. Sanil %A Ashish P. Sanil %A Jerome P. Reiter %A Jerome P. Reiter %B J. Computational and Graphical Statist %V 14 %P 263–279 %G eng %0 Conference Paper %B In Statistical Methods in Counterterrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication %D 2005 %T Secure statistical analysis of distributed databases using partially trusted third parties. Manuscript in preparation %A Alan F. Karr %A Xiaodong Lin %A Ashish P. Sanil %A Jerome P. Reiter %E D. Olwell %E A. G.Wilson %E G. Wilson %B In Statistical Methods in Counterterrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication %I Springer–Verlag %C New York %G eng %0 Conference Paper %B Proceedings of 2004 Workshop on Verification & Validation of Computer Models of High-consequence Engineering Systems %D 2005 %T A statistical meteorologist looks at computational system models %A Sedransk, N. %B Proceedings of 2004 Workshop on Verification & Validation of Computer Models of High-consequence Engineering Systems %G eng %0 Generic %D 2005 %T Title IX Data Collection: Technical Manual for Developing the User’s Guide %A A. F. Karr %A A. P. Sanil %I National Institute of Statistical Sciences %G eng %0 Journal Article %J Chance %D 2004 %T Analysis of integrated data without data integration %A A. F. Karr %A X. Lin %A J. P. Reiter %A A. P. Sanil %B Chance %V 17 %P 26-29 %G eng %0 Thesis %D 2004 %T Calibration and Validation of Macroscopic, Deterministic Traffic Models %A J. Lin %I North Carolina State University %C Raleigh %V Masters %G eng %9 masters %0 Conference Paper %B Proc. dg.o 2004, National Conference on Digital Government Research %D 2004 %T Data confidentiality, data quality and data integration for federal databases %A A. F. Karr %B Proc. dg.o 2004, National Conference on Digital Government Research %P 91-92 %G eng %0 Journal Article %J Current Opinion in Drug Discovery & Development %D 2004 %T Design of diversity and focused combinatorial libraries in drug discovery %A Young SS %A Ge N %B Current Opinion in Drug Discovery & Development %V 7 %P 318-324 %G eng %0 Journal Article %J Chance %D 2004 %T Disclosure Risk vs Data Utility: The R-U Confidentiality Map %A Duncan, George T. %A Stokes, S. Lynne %B Chance %7 3 %V 17 %P 16-20 %G eng %R 0.1080/09332480.2004.10554908 %0 Book Section %B Web Dynamics %D 2004 %T How Large Is the World Wide Web? %A Adrian Dobra %A Stephen E. Fienberg %XThere are many metrics one could consider for estimating the size of the World Wide Web, and in the present chapter we focus on size in terms of the number N of Web pages. Since a database with all the valid URLs on the Web cannot be constructed and maintained, determining N by counting is impossible. For the same reasons, estimating N by directly sampling from the Web is also infeasible. Instead of studying the Web as a whole, one can try to assess the size of the publicly indexable Web, which is the part of the Web that is considered for indexing by the major search engines. Several groups of researchers have invested considerable efforts to develop sound sampling schemes that involve submitting a number of queries to several major search engines. Lawrence and Giles [8] developed a procedure for sampling Web documents by submitting various queries to a number of search engines. We contrast their study with the one performed by Bharat and Broder [2] in November 1997. Although both experiments took place almost in the same period of time, their estimates are significantly different. In this chapter we review how the size of the indexable Web was estimated by three groups of researchers using three different statistical models: Lawrence and Giles 18, 9], Bharat and Broder [2] and Bradlow and Schmittlein 13]. Then we present a statistical framework for the analysis of data sets collected by query-based sampling, utilizing a hierarchical Bayes formulation of the Rasch model for multiple list population estimation developed in 16]. We explain why this approach seems to be in reasonable accord with the real-world constraints and thus allows us to make credible inferences about the size of the Web. We give two different methods that lead to credible estimates of the size of the Web in a reasonable amount of time and are also consistent with the real-world constraints.

%B Web Dynamics %I Springer Berlin Heidelberg %P 23-43 %@ 978-3-642-07377-9 %G eng %U http://dx.doi.org/10.1007/978-3-662-10874-1_2 %R 10.1007/978-3-662-10874-1_2 %0 Conference Proceedings %B Proceedings of JSM 2004 %D 2004 %T A Model for Relating Browsing Behavior to Site Design on the World Wide Web %A M. Haran %A A. F. Karr %B Proceedings of JSM 2004 %I American Statistical Association %C Alexandria, VA %8 August %G eng %0 Conference Paper %B Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining %D 2004 %T Privacy preserving regression modelling via distributed computation %A A. F. Karr %A X. Lin %A J. P. Reiter %A A. P. Sanil %B Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining %P 677-682 %G eng %0 Conference Paper %B Proc. dg.o 2004, National Conference on Digital Government Research %D 2004 %T Regression on distributed databases via secure multi-party computation %A A. F. Karr %A X. Lin %A J. P. Reiter %A A. P. Sanil %B Proc. dg.o 2004, National Conference on Digital Government Research %P 405-406 %G eng %0 Conference Paper %B ASA Proceedings 2004 %D 2004 %T Secure regression for vertically partitioned, partially overlapping data %A A. F. Karr %A C. N. Kohnen %A X. Lin %A J. P. Reiter %A A. P. Sanil %B ASA Proceedings 2004 %G eng %0 Thesis %D 2004 %T Traffic Simulation Failure Detection and Analysis %A B. Wan %I North Carolina State University %C Raleigh %V Ph.D. %G eng %9 masters %0 Thesis %D 2003 %T Bayesian Stochastic Computation with application to Model Selection and Inverse Problems %A G. Molina %I Duke University %C Durham %G eng %9 masters %0 Book Section %B Foundations of Statistical Inference %D 2003 %T Bounding Entries in Multi-way Contingency Tables Given a Set of Marginal Totals %A Adrian Dobra %A Stephen E. Fienberg %E Haitovsky, Yoel %E Ritov, Yaacov %E Lerche, HansRudolf %XWe describe new results for sharp upper and lower bounds on the entries in multi-way tables of counts based on a set of released and possibly overlapping marginal tables. In particular, we present a generalized version of the shuttle algorithm proposed by Buzzigoli and Giusti that computes sharp integer bounds for an arbitrary set of fixed marginals. We also present two examples which illustrate the practical import of the bounds for assessing disclosure risk.

%B Foundations of Statistical Inference %S Contributions to Statistics %I Physica-Verlag HD %P 3-16 %@ 978-3-7908-0047-0 %G eng %U http://dx.doi.org/10.1007/978-3-642-57410-8_1 %R 10.1007/978-3-642-57410-8_1 %0 Conference Paper %B Foundations of Statistical Inference, Proceedings of the Shoresh Conference 2000 %D 2003 %T Bounding entries in multi-way contingency tables given a set of marginal totals %A A. Dobra %A S. E. Fienberg %B Foundations of Statistical Inference, Proceedings of the Shoresh Conference 2000 %I Spr %G eng %0 Conference Paper %B Proc. dg.o 2003, National Conference on Digital Government Research %D 2003 %T Data swapping: A risk–utility framework and Web service implementation %A A. F. Karr %A S. Gomatam %A C. Liu %A A. P. Sanil %B Proc. dg.o 2003, National Conference on Digital Government Research %I Digital Government Research Center %P 245-248 %G eng %0 Journal Article %J Journal of Chemistry Information and Computer Sciences %D 2003 %T Design of diverse and focused combinatorial libraries using an alternating algorithm %A Young SS %A Wang M %A Gu F %B Journal of Chemistry Information and Computer Sciences %V 43 %P 1916-1921 %G eng %0 Conference Paper %B Proceedings of Conference on New Frontiers of Statistical Data Mining %D 2003 %T Disclosure limitation methods based on bounds for large contingency tables with application to disability data %A A. Dobra %A E. A. Erosheva %A S. E. Fienberg %B Proceedings of Conference on New Frontiers of Statistical Data Mining %I CRC Press %G eng %0 Journal Article %J Chance %D 2003 %T Exploring blood spectra for signs of ovarian cancer %A Hawkins, D.M. %A Wolfinger, R.D. %A L. Liu %A Young. S.S. %B Chance %V 16 %P 19-23 %G eng %R 10.1080/09332480.2003.10554870 %0 Thesis %D 2003 %T Methods for Calibrating and Validating Stochastic Micro-Simulation Traffic Models %A N. Siddiqui %I North Carolina State University %C Raleigh %V Masters %G eng %9 masters %0 Journal Article %J J. Statist. Software %D 2003 %T NISS WebSwap: A Web Service for Data Swapping %A Ashish Sanil %A Shanti Gomatam %A Alan F. Karr %B J. Statist. Software %V 8 %P 2003 %G eng %0 Journal Article %J STATISTICS AND COMPUTING %D 2003 %T Preserving confidentiality of high-dimensional tabular data: Statistical and computational issues %A Adrian Dobra %A Alan F. Karr %A Ashish P. Sanil %B STATISTICS AND COMPUTING %V 8 %P 363–370 %G eng %0 Journal Article %J Phys Rev E Stat Nonlin Soft Matter Phys %D 2003 %T Random-walk-based estimates of transport properties in small specimens of composite materials %A Jeffrey D. Picka %A Chermakani, Karthik %K Advanced Traveler Information Systems %K random walks %XA method based on random walks is developed for estimating the dc conductance and similar transport properties in small specimens of composite materials. The method is valid over a much wider range of material structures than are asymptotic methods, and requires only that the internal structure of the material be known. The error in its estimates is limited primarily by CPU speed. It is found to work best for composites consisting of a bulk conducting phase and inclusions of lower conductivity.

%B Phys Rev E Stat Nonlin Soft Matter Phys %V 4 %G eng %0 Journal Article %J Proceedings of the National Academy of Sciences %D 2003 %T Robust singular value decomposition analysis of microarray data %A Liu L %A Hawkins DM %A Ghosh S %A Young SS %B Proceedings of the National Academy of Sciences %V 100 %P 13167-13172 %G eng %0 Journal Article %J Comm. ACM %D 2003 %T Table servers protect confidentiality in tabular data releases %A Alan F. Karr %A Adrian Dobra %A Ashish P. Sanil %B Comm. ACM %V 46 %P 57–58 %G eng %0 Journal Article %J Journal of Regional Science %D 2003 %T Validation of multiclass urban travel forecasting models combining origin-destination, mode, and route choices %A Bar-Gera, H. %A Boyce, D. E. %B Journal of Regional Science %V 43 %P 517-540 %G eng %& 517 %0 Book Section %D 2002 %T Advances in Digital Government %A A. F. Karr %A J. Lee %A A. P. Sanil %A J. Hernandez %A S. Karimi %A K. Litwin %E E. Elmagarmid %E W. M. McIver %XThe Internet provides an efficient mechanism for Federal agencies to distribute their data to the public. However, it is imperative that such data servers have built-in mechanisms to ensure that confidentiality of the data, and the privacy of individuals or establishments represented in the data, are not violated. We describe a prototype dissemination system developed for the National Agricultural Statistics Service that uses aggregation of adjacent geographical units as a confidentiality-preserving technique. We also outline a Bayesian approach to statistical analysis of the aggregated data.

%I Kluwer %C Boston %P 181-196 %@ 978-1-4020-7067-9 %G eng %& Web-based systems that disseminate information from data but preserve confidentiality %R 10.1007/0-306-47374-7_11 %0 Conference Paper %B Bayesian Statistics 7, Proceedings of the Seventh Valencia International Meeting on Bayesian Statistics %D 2002 %T Assessing the Risk of Disclosure of Confidential Categorical Data %A Dobra, A., %A Fienberg, S.E., %A Trottini , M %B Bayesian Statistics 7, Proceedings of the Seventh Valencia International Meeting on Bayesian Statistics %I Oxford Press %G eng %0 Conference Paper %B Proceedings of Conference on Foundation of Statistical Inference and Its Applications, Jerusalem %D 2002 %T Bounding entries in multi-way contingency tables given a set of marginal totals %A A. Dobra %A S. E. Fienberg %B Proceedings of Conference on Foundation of Statistical Inference and Its Applications, Jerusalem %I Springer-Verlag %G eng %0 Book Section %B Trade, Networks and Hierarchies %D 2002 %T Combined Model of Interregional Commodity Flows on a Transportation Network %A Boyce, David %E Hewings, Geoffrey J.D. %E Sonis, Michael %E Boyce, David %XThis chapter is motivated by two ongoing research objectives of the author. The first concerns models of flows on transportation networks. Whether the subject is personal travel or freight transportation, representation of the transportation network is necessary to determine realistically interzonal/interregional travel/transportation costs. The methodological effort required to achieve such results is nontrivial, but the issues raised by such an attempt are enlightening and worthwhile. This insight is demonstrated once more by the models considered here.

%B Trade, Networks and Hierarchies %S Advances in Spatial Science %I Springer Berlin Heidelberg %P 29-40 %@ 978-3-642-07712-8 %G eng %U http://dx.doi.org/10.1007/978-3-662-04786-6_3 %R 10.1007/978-3-662-04786-6_3 %0 Journal Article %J Journal of Chemical Information and Computer Science %D 2002 %T The construction and assessment of a statistical model for the prediction of protein assay data %A Jennifer Pittman Clarke %A Jerome Sacks %A S. Stanley Young %XThe focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

%B Journal of Chemical Information and Computer Science %V 42 %P 729-741 %G eng %R 10.1021/ci0103828 %0 Journal Article %J Journal of Chemical Information and Computer Science %D 2002 %T The construction and assessment of a statistical model for the prediction of protein assay data %A Jennifer Pittman Clarke %A Jerome Sacks %A S. Stanley Young %XThe focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

%B Journal of Chemical Information and Computer Science %V 42 %P 729-741 %G eng %R 10.1021/ci0103828 %0 Journal Article %J Technometrics %D 2002 %T Experimental designs for estimating a response surface and variance components %A Bruce E Ankenman %A Hui Liu %A Alan F. Karr %A Jeffrey D. Picka %B Technometrics %V 44 %P 45-54 %G eng %R 10.1198/004017002753398317 %0 Conference Proceedings %B Workshop on Foundations for Modeling and Simulation %D 2002 %T A Framework for Validating Computer Models %A M.J. Bayarri %A J. Berger %A D. Higdon %A M. Kottas %A R. Paulo %A J. Sacks %A J. Cafeo %A J. Cavendish %A C. Lin %A J. Tu %B Workshop on Foundations for Modeling and Simulation %I Society for Computer Simulation %8 2002 %G eng %0 Book Section %B Encyclopedia of Environmetrics %D 2002 %T National Institute of Statistical Sciences (US) %A A. F. Karr %B Encyclopedia of Environmetrics %I Wiley, Chichester %G eng %0 Conference Paper %B Proc. dgo.2002, National Conference on Digital Government Research %D 2002 %T Optimal tabular releases from confidential data %A A. F. Karr %A A. Dobra %A A. P. Sanil %B Proc. dgo.2002, National Conference on Digital Government Research %G eng %0 Book Section %B Transportation Planning %D 2002 %T Origin-based Network Assignment %A Bar-Gera, Hillel %A Boyce, David %E Patriksson, Michael %E Labbé, Martine %K network optimization %K Origin-based traffic assignment %K user equilibrium %B Transportation Planning %S Applied Optimization %I Springer US %V 64 %P 1-17 %@ 978-1-4020-0546-6 %G eng %U http://dx.doi.org/10.1007/0-306-48220-7_1 %R 10.1007/0-306-48220-7_1 %0 Book Section %B Transportation Planning %D 2002 %T Origin-based Network Assignment %A Bar-Gera, Hillel %A Boyce, David %E Patriksson, Michael %E Labbé, Martine %K network optimization %K Origin-based traffic assignment %K user equilibrium %XMost solution methods for the traffic assignment problem can be categorized as either link-based or route-based. Only a few attempts have followed the intermediate, origin-base dapproach. This paper describes the main concepts of a new, origin-based method for the static user equilibrium traffic assignment problem. Computational efficiency in time and memory makes this method suitable for large-scale networks of practical interest. Experimental results show that the new method is especially efficient in finding highly accurate solutions.

%B Transportation Planning %S Applied Optimization %I Springer US %V 64 %P 1-17 %@ 978-1-4020-0546-6 %G eng %U http://dx.doi.org/10.1007/0-306-48220-7_1 %R 10.1007/0-306-48220-7_1 %0 Journal Article %J ASCE J. Materials %D 2002 %T Permeability of Cracked Steel Fiber–Reinforced Concrete %A Julie Rapoport Corina–Maria %A Surendra P. Shah %A Bruce Ankenman %A Alan F. Karr %XThis research explores the relationship between permeability and crack width in cracked, steel fiber–reinforced concrete. In addition, it inspects the influence of steel fiber reinforcement on concrete permeability. The feedback–controlled splitting tension test (also known as the Brazilian test) is used to induce cracks of up to 500 microns (0.02in) in concrete specimens without reinforcement, and with steel fiber reinforcement volumes of both 0.5% and 1%. The cracks relax after induced cracking. The steel fibers decrease permeability of specimens with relaxed cracks larger than 100 microns. Keywords: permeability, fiber-reinforced concrete, steel fibers 1 NSF Center for Advanced Cement–Based Materials, Northwestern University, 2145 Sheridan Rd., Evanston, IL, 60208–4400, USA 2 Saint Gobain Technical Fabrics, P. Box 728, St. Catharines, Ontario, L2R-6Y3, Canada 3 Department of Industrial Engineering and Management Science, Northwestern University, 2145 Sheridan Rd., Evanston, IL.

%B ASCE J. Materials %V 14 %P 355–358 %G eng %0 Thesis %D 2002 %T Problems on the Bayesian-Frequentist Interface %A R. Paulo %I Duke University %G eng %9 masters %0 Journal Article %J Int. Journal of Uncertainty, Fuzziness and Knowledge Based Systems %D 2002 %T Software Systems for Tabular Data Releases %A Adrian Dobra %A Alan F. Karr %A Ashish P. Sanil %A Stephen E. Fienberg %B Int. Journal of Uncertainty, Fuzziness and Knowledge Based Systems %V 10 %P 529-544 %G eng %0 Journal Article %J Journal of Forecasting %D 2002 %T Statistical Analyses of Freeway Traffic Flows %A Claudia Tebaldi %A Mike West %A Alan F. Karr %B Journal of Forecasting %V 21 %P 39–68 %G eng %0 Journal Article %J Journal of Transportation and Statistics %D 2002 %T Statistically-Based Validation of Computer Simulation Models in Traffic Operations and Management %A Jerome Sacks %A Nagui M. Rouphail %A B. Brian Park %A Piyushimita Thakuriah %K Advanced traffic management systems %K computer simulation %K CORSIM %K model validation %K transportation policy %XThe process of model validation is crucial for the use of computer simulation models in transportation policy, planning, and operations. This article lays out obstacles and issues involved in performing a validation. We describe a general process that emphasizes five essential ingredients for validation: context, data, uncertainty, feedback, and prediction. We use a test bed to generate specific (and general) questions as well as to give concrete form to answers and to the methods used in providing them. The traffic simulation model CORSIM serves as the test bed; we apply it to assess signal-timing plans on a street network of Chicago. The validation process applied in the test bed demonstrates how well CORSIM can reproduce field conditions, identifies flaws in the model, and shows how well CORSIM predicts performance under new (untried) signal conditions. We find that CORSIM, though imperfect, is effective with some restrictions in evaluating signal plans on urban networks.

%B Journal of Transportation and Statistics %V 5 %G eng %0 Journal Article %J Transportation Research Record C %D 2002 %T Variability of travel times on arterial streets: effects of signals and volume %A A. F. Karr %A T.L. Graves %A A. Mockus %A P. Schuster %B Transportation Research Record C %V 10 %P 000-000 %G eng %0 Journal Article %J Journal Comp. Graphical Statistics %D 2002 %T Visual Scalability %A A. F. Karr %A S. G. Eick %B Journal Comp. Graphical Statistics %V 11 %P 22-43 %G eng %0 Journal Article %J INTERACTIONS %D 2002 %T Visualizing Software Changes %A Stephen G. Eick %A Paul Schuster %A Audris Mockus %A Todd L. Graves %A Alan F. Karr %B INTERACTIONS %V 17 %P 29–31 %G eng %0 Journal Article %J Res. Official Statist %D 2001 %T Analysis of aggregated data in survey sampling with application to fertilizer/pesticide usage surveys %A Jaeyong Lee %A Christopher Holloman %A Alan F. Karr %A Ashish P. Sanil %XIn many cases, the public release of survey or census data at fine geographical resolution (for example, counties) may endanger the confidentiality of respondents. A strategy for such cases is to aggregate neighboring regions into larger units that satisfy confidentiality requirements. An aggregation procedure employed in a prototype system for the US National Agricultural Statistics Service is used as context to investigate the impact of aggregation on statistical properties of the data. We propose a Bayesian simulation approach for the analysis of such aggregated data. As a consequence, we are able to specify the type of additional information (such as certain sample sizes) that needs to be released in order to enable the user to perform meaningful analyses with the aggregated data.

%B Res. Official Statist %V 4 %P 11–6 %G eng %0 Journal Article %J Transportation Research Record %D 2001 %T Assessment of Stochastic Signal Optimization Method Using Microsimulation %A Byungkyu Park %A Nagui M. Rouphail %A Jerome Sacks %XA stochastic signal optimization method based on a genetic algorithm (GA-SOM) that interfaces with the microscopic simulation program CORSIM is assessed. A network in Chicago consisting of nine signalized intersections is used as an evaluation test bed. Taking CORSIM as the best representation of reality, the performance of the GA-SOM plan sets a ceiling on how good any (fixed) signal plan can be. An important aspect of this approach is its accommodations of variability. Also discussed is the robustness of an optimal plan under changes in demand. This benchmark is used to assess the best signal plan generated by TRANSYT-7F (T7F), Version 8.1, from among 12 reasonable strategies. The performance of the best T7F plan falls short of the benchmark on several counts, reflecting the need to account for variability in the highly stochastic system of traffic operations, which is not possible under the deterministic conditions intrinsic to T7F. As a sidelight, the performance of the GA-SOM plan within T7F is also computed and it is found to perform nearly as well as the optimum T7F plan.

%B Transportation Research Record %V 1748 %P 40-45 %G eng %R 10.3141/1748-05 %0 Conference Paper %B Concrete Under Severe Conditions, Proceedings of the Third International Conference on Concrete Under Severe Conditions %D 2001 %T Combined effect of cracking and water permeability of fiber-reinforced concrete %A A. F. Karr %A C.-M. Aldea %A J. Rapoport %A S. P. Shah %B Concrete Under Severe Conditions, Proceedings of the Third International Conference on Concrete Under Severe Conditions %P 71?78 %G eng %0 Journal Article %J Statistical Science %D 2001 %T Computer intrusion: detecting masqueraders %A Alan Karr %A William DuMouchel %A Wen-Hua Ju %A Martin Theus %A Yehuda Vardi %K Anomaly %K Bayes %K compression %K computer security %K high-orderMarkov %K profiling %K Unix %XMasqueraders in computer intrusion detection are people who use somebody else?s computer account. We investigate a number of statistical approaches for detecting masqueraders. To evaluate them, we collected UNIX command data from 50 users and then contaminated the data with masqueraders. The experiment was blinded. We show results from six methods, including two approaches from the computer science community.

%B Statistical Science %V 16 %P 1-17 %G eng %0 Journal Article %J IEEE Computer %D 2001 %T Disseminating information but protecting confidentiality %A A. F. Karr %A J. Hernandez %A S. Karimi %A J. Lee %A K. Litwin %A A. Sanil %B IEEE Computer %V 34 %P 36?37 %G eng %0 Conference Paper %B In IEEE Transactions on Software Engineering %D 2001 %T Does code decay? Assessing the evidence from change management data %A Stephen G. Eick %A Todd L. Graves %A Alan F. Karr %A J. S. Marron %A Audris Mockus %XA central feature of the evolution of large software systems is that changeÐwhich is necessary to add new functionality, accommodate new hardware, and repair faultsÐbecomes increasingly difficult over time. In this paper, we approach this phenomenon, which we term code decay, scientifically and statistically. We define code decay and propose a number of measurements (code decay indices) on software and on the organizations that produce it, that serve as symptoms, risk factors, and predictors of decay. Using an unusually rich data set (the fifteen-plus year change history of the millions of lines of software for a telephone switching system), we find mixed, but on the whole persuasive, statistical evidence of code decay, which is corroborated by developers of the code. Suggestive indications that perfective maintenance can retard code decay are also discussed. Index TermsÐSoftware maintenance, metrics, statistical analysis, fault potential, span of changes, effort modeling.

%B In IEEE Transactions on Software Engineering %P 1–12 %G eng %0 Conference Paper %B 2001 International Symposium on Advanced Highway Technology %D 2001 %T A Framework for Traffic Simulation Model Validation Procedure Using CORSIM as a Test-Bed %A Park, B. %A N. M. Rouphail %A J. Sacks %B 2001 International Symposium on Advanced Highway Technology %8 2001 %G eng %0 Journal Article %D 2001 %T A Hybrid High-Order Markov Chain Model for Computer Intrusion Detection %A Ju, W-H %A Yehuda Vardi %XA hybrid model based mostly on a high-order Markov chain and occasionally on a statistical-independence model is proposed for profiling command sequences of a computer user in order to identify a "signature behavior" for that user. Based on the model, an estimation procedure for such a signature behavior driven by maximum likelihood (ML) considerations is devised. The formal ML estimates are numerically intractable, but the ML-optimization problem can be substituted by a linear inverse problem with positivity constraint (LININPOS), for which the EM algorithm can be used as an equation solver to produce an approximate ML-estimate. The intrusion detection system works by comparing a user’s command sequence to the user’s and others’ estimated signature behaviors in real time through statistical hypothesis testing. A form of likelihood-ratio test is used to detect if a given sequence of commands is from the proclaimed user, with the alternative hypothesis being a masquerader user. Applying the model to real-life data collected from AT&T Labs-Research indicates that the new methodology holds some promise for intrusion detection.

%V 10 %P 277-295 %G eng %0 Journal Article %J Statistica Sinica %D 2001 %T Propriety of posteriors with improper priors in hierarchical linear mixed models %A Sun,Dongchu %A Tsuakawa, R. K. %A Z. He %B Statistica Sinica %V 2 %P 77-95 %G eng %0 Conference Paper %B Proceedings dg.o 2001 %D 2001 %T Web-based systems that disseminate information but protect confidential data %A Alan F. Karr %B Proceedings dg.o 2001 %I Digital Government Research Center %P 159?166 %G eng %0 Conference Paper %B Advances in Digital Government. Kluwer, Amserdam %D 2001 %T Web-Based Systems that Disseminate Information but Protect Confidential Data %A Alan F. Karr %A Ashish P. Sanil %B Advances in Digital Government. Kluwer, Amserdam %I Kluwer %G eng %0 Generic %D 2001 %T Workshop Report: Affiliates Workshop on Data Quality %A A. F. Karr %A A. P. Sanil %A J. Sacks %A A. Elmagarmid %I National Institute of Statistical Sciences %G eng %0 Generic %D 2001 %T Workshop Report: Workshop on Statistics and Information Technology %A A. F. Karr %A J. Lee %A A. P. Sanil %I National Institute of Statiatical Sciences %G eng %0 Book Section %B Molecular Modeling and Prediction of Bioactivity %D 2000 %T Analysis of a Large, High-Throughput Screening Data Using Recursive Partitioning %A Young, S.Stanley %A Jerome Sacks %E Gundertofte, Klaus %E Jørgensen, Flemming Steen %XAs biological drug targets multiply through the human genome project and as the number of chemical compounds available for screening becomes very large, the expense of screening every compound against every target becomes prohibitive. We need to improve the efficiency of the drug screening process so that active compounds can be found for more biological targets and turned over to medicinal chemists for atom-by-atom optimization. We create a method for analysis of the very large, complex data sets coming from high throughput screening, and then integrate the analysis with the selection of compounds for screening so that the structure-activity rules derived from an initial compound set can be used to suggest additional compounds for screening. Cycles of screening and analysis become sequential screening rather than the mass screening of all available compounds. We extend the analysis method to deal with multivariate responses. Previously, a screening campaign might screen hundreds of thousands of compounds; sequential screening can cut the number of compounds screened by up to eighty percent. Sequential screening also gives SAR rules that can be used to mathematically screen compound collections or virtual chemical libraries.

%B Molecular Modeling and Prediction of Bioactivity %I Springer US %P 149-156 %@ 978-1-4613-6857-1 %G eng %U http://dx.doi.org/10.1007/978-1-4615-4141-7_17 %R 10.1007/978-1-4615-4141-7_17 %0 Journal Article %J Statistics in Medicine %D 2000 %T Bayesian Analysis of Mortality Rates with Disease Maps %A Sun,Dongchu %A Tsuakawa, R. K. %A Kim, H. %A Z. He %XThis article summarizes our research on estimation of age-specific and age-adjusted mortality rates for chronic obstructive pulmonary disease (COPD) for white males. Our objectives are more precise and informative displays (than previously available) of geographic variation of the age-specific mortality rates for COPD, and investigation of the relationships between the geographic variation in mortality rates and the corresponding variation in selected covariates. For a given age class, our estimates are displayed in a choropleth map of mean rates. We develop a variation map that identifies the geographical areas where inferences are reliable. Here, the variation is measured by considering a set of maps produced using samples from the posterior distribution of the population mortality rates. Finally, we describe the spatial patterns in the age-specific maps and relate these to patterns in potential explanatory covariates such as smoking rate, annual rainfall, population density, elevation, and measures of air quality.

%B Statistics in Medicine %V 19 %P 2015-2035 %G eng %0 Conference Proceedings %B Proceedings of the National Academy of Sciences of the United States of America %D 2000 %T Bounds for Cell Entries in Contingency Tables Given Marginal Totals and Decomposable Graphs %A Adrian Dobra %A Stephen E. Fienberg %XUpper and lower bounds on cell counts in cross-classifications of nonnegative counts play important roles in a number of practical problems, inclusing statistical disclosure limitation, computer tomography, mass transportation, cell suppression, and data swapping. Some features of the Frechet bounds are well known, intuitive, and regularly used by those working on disclosure limitation methods, especially those for two-dimensional tables. We previously have described a series of results relating these bounds to theory on loglinear models for cross-classified counts. This paper provides the actual theory and proofs for the special case of decomposable loglinear models and their related independence graphs. It also includes an extension linked to the structure of reducible graphs and a discussion of the relevance of other results linked to nongraphical loglinear models.

%B Proceedings of the National Academy of Sciences of the United States of America %V 97 %P 11885-11892 %G eng %0 Journal Article %J Genetics %D 2000 %T Conditional Genotypic Probabilities for Microsatellite Loci %A Graham, Jinko %A Curran, James %A Weir, Bruce %XModern forensic DNA profiles are constructed using microsatellites, short tandem repeats of 2-5 bases. In the absence of genetic data on a crime-specific subpopulation, one tool for evaluating profile evidence is the match probability. The match probability is the conditional probability that a random person would have the profile of interest given that the suspect has it and that these people are different members of the same subpopulation. One issue in evaluating the match probability is population differentiation, which can induce coancestry among subpopulation members. Forensic assessments that ignore coancestry typically overstate the strength of evidence against the suspect. Theory has been developed to account for coancestry; assumptions include a steady-state population and a mutation model in which the allelic state after a mutation event is independent of the prior state. Under these assumptions, the joint allelic probabilities within a subpopulation may be approximated by the moments of a Dirichlet distribution. We investigate the adequacy of this approximation for profiled loci that mutate according to a generalized stepwise model. Simulations suggest that the Dirichlet theory can still overstate the evidence against a suspect with a common microsatellite genotype. However, Dirichlet-based estimators were less biased than the product-rule estimator, which ignores coancestry.

%B Genetics %V 155 %P 1973-1980 %G eng %0 Conference Paper %B Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining %D 2000 %T Defection detection: Using online activity profiles to predict ISP customer vulnerability %A A. F. Karr %A N. Raghavan %A R. Bell %A M. Schonlau %A D. Pregibon %B Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining %P 506?515 %G eng %R 10.1145/347090.347193 %0 Conference Proceedings %B XI Pan American Conference in Traffic and Transportation Engineering %D 2000 %T Direct Signal Timing Optimization: Strategy Development and Results %A Rouphail, N. %A Park, B. %A J. Sacks %B XI Pan American Conference in Traffic and Transportation Engineering %P 19-23 %8 2000 %G eng %0 Journal Article %J ACI Materials Journal %D 2000 %T Estimation of water flow through cracked concrete under load %A A. F. Karr %A C.-M. Aldea %A M. Ghandehari %A S. P. Shah %XThis research studied the relationship between cracking and water permeability of normal-strength concrete under load and compared the experimental results with theoretical models. A feedback-controlled wedge splitting test was used to generate width-controlled cracks. Speckle interferometry was used to record the cracking history. Water permeability of the loaded specimens was evaluated by a low-pressure water permeability test at the designed crack mouth opening displacements (CMODs). Water permeability results were compared with those previously obtained for unloaded specimens for which cracks were induced by a feedback-controlled splitting tension test. The experimental results indicate that water permeability of cracked material significantly increases with increasing crack width. The flow for the same cracking level is repeatable regardless of the procedure used for inducing the cracks. No direct relationship between water flow and crack length was observed, whereas clear relationships existed between CMOD or crack area and flow characteristics. Experimentally measured flow was compared with theoretical models of flow through cracked rocks with parallel walls and a correction factor accounting for the tortuosity of the crack was determined. Calculated flow through cracks induced by a wedge-splitting test provided an acceptable approximation of the measured flow.

%B ACI Materials Journal %V 97 %P 567?575 %G eng %0 Journal Article %J ASTM Cement, Concrete and Aggregates %D 2000 %T Experimental and statistical study of chloride permeability of cracked high strength concrete %A A. F. Karr %A C.-M. Aldea %A J.D. Picka %A S. P. Shah %A S.S. Jaiswal %A T. Igusa %XWithin any cast cylinder of concrete, the coarse aggregate will tend to be inhomogeneously distributed. This variability may arise as a result of segregation caused by gravity or as a result of the wall effect that is caused by the inability of the aggregate to penetrate the walls of the mold. Using methods from image analysis, stereology, and statistics, local estimates of aggregate inhomogeniety are defined that quantify phenomena that have been qualitatively described in the past. These methods involve modification of the two-dimensional images to prepare them for analysis, as well as simple diagnostic statistics for determining the presence of a wall effect. While the techniques presented herein are developed specifically for cast cylinders, they can be generalized to other cast or cored concrete specimens.

%B ASTM Cement, Concrete and Aggregates %V 22 %P 000-000 %8 December %G eng %R 10.1520/CCA10473J %0 Conference Paper %B Proc. 12th Engrg. Mechanics Conf %D 2000 %T Impact of the interfacial transition zone on the chloride permeability of concrete %A A. F. Karr %A S. P. Shah %A S.S. Jaiswal %A B.E. Ankenman %A J.D. Picka %A T. Igusa %B Proc. 12th Engrg. Mechanics Conf %P 1134-1137 %G eng %0 Journal Article %J Transportation %D 2000 %T Micro-simulation of daily activity-travel patterns for travel demand forecasting %A Kitamura, Ryuichi %A Chen, Cynthia %A Pendyala, Ram M. %A Narayanan, Ravi %K daily activity-travel patterns %K forecasting %K micro-simulation %K synthetic travel patterns %XThe development and initial validation results of a micro-simulator for the generation of daily activity-travel patterns are presented in this paper. The simulator assumes a sequential history and time-of-day dependent structure. Its components are developed based on a decomposition of a daily activity-travel pattern into components to which certain aspects of observed activity-travel behavior correspond, thus establishing a link between mathematical models and observational data. Each of the model components is relatively simple and is estimated using commonly adopted estimation methods and existing data sets. A computer code has been developed and daily travel patterns have been generated by Monte Carlo simulation. Study results show that individuals’ daily travel patterns can be synthesized in a practical manner by micro-simulation. Results of validation analyses suggest that properly representing rigidities in daily schedules is important in simulating daily travel patterns.

%B Transportation %I Kluwer Academic Publishers %V 27 %P 25-51 %G eng %U http://dx.doi.org/10.1023/A%3A1005259324588 %R 10.1023/A:1005259324588 %0 Journal Article %J IEEE Transportation Software Engineering %D 2000 %T Predicting fault incidence using software change history %A A. F. Karr %A S. G. Eick %A T.L. Graves %A J. S. Marron %A H. Siy %K aging %K change history %K degradation %K management of change %K software fault tolerance %K software maintenance %XThis paper is an attempt to understand the processes by which software ages. We define code to be aged or decayed if its structure makes it unnecessarily difficult to understand or change and we measure the extent of decay by counting the number of faults in code in a period of time. Using change management data from a very large, long-lived software system, we explore the extent to which measurements from the change history are successful in predicting the distribution over modules of these incidences of faults. In general, process measures based on the change history are more useful in predicting fault rates than product metrics of the code: For instance, the number of times code has been changed is a better indication of how many faults it will contain than is its length. We also compare the fault rates of code of various ages, finding that if a module is, on the average, a year older than an otherwise similar module, the older module will have roughly a third fewer faults. Our most successful model measures the fault potential of a module as the sum of contributions from all of the times the module has been changed, with large, recent changes receiving the most weight

%B IEEE Transportation Software Engineering %V 26 %P 653?661 %G eng %R 10.1109/32.859533 %0 Journal Article %J Cement Concrete and Aggregates %D 2000 %T Quantitative description of coarse aggregate volume fraction gradients %A A. F. Karr %A S.S. Jaiswal %A T. Igusa %A J.D. Picka %A S. P. Shah %XWithin any cast cylinder of concrete, the coarse aggregate will tend to be inhomogeneously distributed. This variability may arise as a result of segregation caused by gravity or as a result of the wall effect that is caused by the inability of the aggregate to penetrate the walls of the mold. Using methods from image analysis, stereology, and statistics, local estimates of aggregate inhomogeniety are defined that quantify phenomena that have been qualitatively described in the past. These methods involve modification of the two-dimensional images to prepare them for analysis, as well as simple diagnostic statistics for determining the presence of a wall effect. While the techniques presented herein are developed specifically for cast cylinders, they can be generalized to other cast or cored concrete specimens.

%B Cement Concrete and Aggregates %V 22 %P 151-159 %G eng %R 10.1520/CCA10473J %0 Book Section %B Generalized Linear Models: A Bayesian Perspective %D 2000 %T Random effects in generalized linear mixed models (GLMMs) %A Sun,Dongchu %A Speckman, Paul %A Tsutakawa, R. K. %B Generalized Linear Models: A Bayesian Perspective %I Marcel dekker, Inc. %P 23-40 %G eng %0 Journal Article %J Environmetrics %D 2000 %T Regression models for air pollution and daily mortality: analysis of data from Birmingham, Alabama %A RICHARD L. SMITH %A J.M. Davis %A Jerome Sacks %A Speckman, Paul %A P. Styer %K Air Pollutants/adverse effects %K Air Pollutants/analysis %K Air Pollution/adverse effects %K Air Pollution/analysis %K Air Pollution/statistics & numerical data %K Alabama/epidemiology %K Humans %K Mortality %K Poisson Distribution %K Regression Analysis %K Risk %K Sensitivity and Specificity %K Statistical Models %XSeveral recent studies have reported associations between common levels of particulate air pollution and small increases in daily mortality. This study examined whether a similar association could be found in the southern United States, with different weather patterns than the previous studies, and examined the sensitivity of the results to different methods of analysis and covariate control. Data were available in Birmingham, Alabama, from August 1985 through 1988. Regression analyses controlled for weather, time trends, day of the week, and year of study and removed any long-term patterns (such as seasonal and monthly fluctuations) from the data by trigonometric filtering. A significant association was found between inhalable particles and daily mortality in Poisson regression analysis (relative risk = 1.11, 95% confidence interval 1.02-1.20). The relative risk was estimated for a 100-micrograms/m3 increase in inhalable particles. Results were unchanged when least squares regression was used, when robust regression was used, and under an alternative filtering scheme. Diagnostic plots showed that the filtering successfully removed long wavelength patterns from the data. The generalized additive model, which models the expected number of deaths as nonparametric smoothed functions of the covariates, was then used to ensure adequate control for any nonlinearities in the weather dependence. Essentially identical results for inhalable particles were seen, with no evidence of a threshold down to the lowest observed exposure levels. The association also was unchanged when all days with particulate air pollution levels in excess of the National Ambient Air Quality Standards were deleted. The magnitude of the effect is consistent with recent estimates from Philadelphia, Steubenville, Detroit, Minneapolis, St. Louis, and Utah Valley.

%B Environmetrics %V 11 %P 719-743 %G eng %0 Journal Article %J Concrete Science and Engineering %D 2000 %T Statistical studies of the conductivity of concrete using ASTM C1202?94 %A A. F. Karr %A S.S. Jaiswal %A J.D. Picka %A T. Igusa %A S. P. Shah %A B.E. Ankenman %A P. Styer %B Concrete Science and Engineering %V 2 %P 97-105 %G eng %0 Conference Proceedings %B American Society of Civil Engineers %D 2000 %T Traffic Signal Offset Optimization Using Microscopic Simulation Program with Stochastic Process Model %A Park, B. %A N. M. Rouphail %A J. Sacks %B American Society of Civil Engineers %G eng %0 Journal Article %J Advances in Applied Probability %D 2000 %T Variance Reducing Modifications for Estimators of Standardized Moments of Random Sets %A Jeffrey D. Picka %XIn the statistical analysis of random sets, it is useful to have simple statistics that can be used to describe the realizations of these sets. The cumulants and several other standardized moments such as the correlation and second cumulant can be used for this purpose, but their estimators can be excessively variable if the most straightforward estimation strategy is used. Through exploitation of similarities between this estimation problem and a similar one for a point process statistic, two modifications are proposed. Analytical results concerning the effects of these modifications are found through use of a specialized asymptotic regime. Simulation results establish that the modifications are highly effective at reducing estimator standard deviations for Boolean models. The results suggest that the reductions in variance result from a balanced use of information in the estimation of the first and second moments, through eliminating the use of observations that are not used in second moment estimation.

%B Advances in Applied Probability %V 32 %P 682-700 %G eng %0 Conference Paper %B Proc. 12th Engrg. Mechanics Conf %D 2000 %T Water permeability of cracked concrete %A A. F. Karr %A C.-M. Aldea %A S.S. Jaiswal %A B.E. Ankenman %A J.D. Picka %A T. Igusa %B Proc. 12th Engrg. Mechanics Conf %P 1158?1162 %G eng %0 Journal Article %J Journal of Agricultural Biological and Environmental Statistics %D 1999 %T A bivariate Bayes method for improving the estimates of mortality rates with a twofold conditional autoregressive model %A Woodard, R. %A Sun,Dongchu %A Z. He %A Sheriff, S. %XThe Missouri Turkey Hunting Survey (MTHS) is a post-season mail survey conducted by the Missouri Department of Conservation to monitor and aid in the regulation of the turkey hunting season. Questionnaires are distributed after the hunting season to a simple random sample of persons who purchased permits to hunt wild turkey during the spring season. For the 1996 turkey hunting season 95,801 persons purchased hunting permits. From these individuals a simple random sample of 6,999 hunters were selected for the survey and 5,005 of these responded. The MTHS 1 Roger Woodard (E-mail: woodard@stat.missouri.edu) is a Ph.D student and Dongchu Sun (E-mail: dsun@stat.missouri.edu) is Associate Professor of Statistics, Department of Statistics, University of Missouri, Columbia, MO 65211. Zhuoqiong He (E-mail: HEZ@mail.conservation.state.mo.us) is a biometrician and Steven L. Sheri (E-mail: SHERIS@mail.conservation.state.mo.us) is a wildlife biometrics superv

%B Journal of Agricultural Biological and Environmental Statistics %G eng %0 Journal Article %J Journal of Educational and Behavioral Statistics %D 1999 %T Controlling error in multiple comparisons, with special attention to the national assessment of educational progress %A Valerie S. L. Williams %A Lyle V. Jones %A John W. Tukey %XThree alternative procedures to adjust significance levels for multiplicity are the traditional Bonferroni technique, a sequential Bonferroni technique devel-oped by Hochberg (1988), and a sequential approach for controlling the false discovery rate proposed by Benjamini and Hochberg (1995). These procedures are illustrated and compared using examples from the National Assessment of Educational Progress (NAEP). A prominent advantage of the Benjamini and Hochberg (B-H) procedure, as demonstrated in these examples, is the greater invariance of statistical significance for given comparisons over alternative family sizes. Simulation studies show that all three procedures maintain a false discovery rate bounded above, often grossly, by ct (or c /2). For both uncorre-lated and pairwise families of comparisons, the B-H technique is shown to have greater power than the Hochberg or Bonferroni procedures, and its power remains relatively stable as the number of comparisons becomes large, giving it an increasing advantage when many comparisons are involved. We recommend that results from NAEP State Assessments be reported using the B-H technique rather than the Bonferroni procedure. Two questions often asked about each of a set of observed comparisons are: (a) should we be confident about the direction or the sign of the corresponding underlying population comparison, and (b) for what interval of values should we be confident that it contains the value for the population comparison?

%B Journal of Educational and Behavioral Statistics %V 24 %P 42–69 %G eng %0 Journal Article %J Mathematical Geology %D 1999 %T Design and Analysis for Modeling and Predicting Spatial Contamination %A Abt, Markus %A Welch, William J. %A Jerome Sacks %K best linear unbiased prediction %K dioxin contamination %K Gaussian stochastic process %K lognormal kriging %K ordinary kriging %K spatial statistics %XSampling and prediction strategies relevant at the planning stage of the cleanup of environmental hazards are discussed. Sampling designs and models are compared using an extensive set of data on dioxin contamination at Piazza Road, Missouri. To meet the assumptions of the statistical model, such data are often transformed by taking logarithms. Predicted values may be required on the untransformed scale, however, and several predictors are also compared. Fairly small designs turn out to be sufficient for model fitting and for predicting. For fitting, taking replicates ensures a positive measurement error variance and smooths the predictor. This is strongly advised for standard predictors. Alternatively, we propose a predictor linear in the untransformed data, with coefficients derived from a model fitted to the logarithms of the data. It performs well on the Piazza Road data, even with no replication.

%B Mathematical Geology %I Kluwer Academic Publishers-Plenum Publishers %V 31 %P 1-22 %G eng %U http://dx.doi.org/10.1023/A%3A1007504329298 %R 10.1023/A:1007504329298 %0 Journal Article %J ACSE Journal of Materials in Civil Engineering %D 1999 %T Effect of cracking on water and chloride permeability of concrete %A A. F. Karr. %A C.-M. Aldea %A S. P. Shah %XThe goal of this research was to study the relationship between cracking and concrete permeability and to support accounting for permeability and cracking resistance to other factors besides strength, as criteria to be considered in mix design to achieve a durable concrete. The effect of material composition [normal-strength concrete (NSC) and high-strength concrete (HSC) with two different mix designs] and crack width (ranging from 50 to 400 ?m) on water and chloride permeability were examined. Cracks of designed widths were induced in the concrete specimens using a feedback-controlled splitting tensile test. Chloride permeability of the cracked samples was evaluated using a rapid chloride permeability test and the water permeability of cracked concrete was then evaluated by a low-pressure water permeability test. Uncracked HSC was less water permeable than NSC, as expected, but cracking changed the material behavior in terms of permeability. Both NSC and HSC were affected by cracking, and the water permeability of cracked samples increased with increasing crack width. Among the tested materials, only HSC with a very low water-to-cement ratio chloride permeability was sensitive with respect to cracking. Results indicate that the water permeability is significantly more sensitive than the chloride permeability with respect to the crack widths used in this study.

%B ACSE Journal of Materials in Civil Engineering %V 11 %P 181?187 %G eng %R http://dx.doi.org/10.1061/(ASCE)0899-1561(1999)11:3(181) %0 Journal Article %J Transportation Research Record %D 1999 %T Effect of microcracking on durability of high strength concrete %A A. F. Karr %A C.-M. Aldea %A S. P. Shah %XThe relationship between cracking and chloride and water permeability of high-strength concrete (HSC) was studied. Two different mix designs were used: HSC_1 (w/b = 0.31) and HSC_2 (w/b = 0.25). The effects of crack width and sample thickness on permeability were examined. Cracks of designed widths were induced in the concrete specimens using the feedback-controlled splitting tensile test. Chloride permeability of the cracked samples was evaluated by using a rapid chloride permeability test. The water permeability of cracked concrete was then evaluated by a low-pressure water permeability test. Among the materials tested, only high-strength concrete with a very low water-to-cement ratio conductivity is sensitive with respect to cracking. The water permeability of cracked HSC significantly increases with increasing crack width. Among the parameters considered, crack parameters significantly affect water permeability, and there is little thickness effect. The results indicate that the water permeability is significantly more sensitive than conductivity with respect to the crack width used.

%B Transportation Research Record %V 1668 %P 86-90 %G eng %R 10.3141/1668-13 %0 Journal Article %J In Papers in Regional Science %D 1999 %T Estimation of Demand due to Welfare Reform %A Sen, Ashish %A P. Metaxatos %A Sööt, Siim %A Piyushimita Thakuriah %B In Papers in Regional Science %V 78 %P 195 – 211 %G eng %0 Journal Article %J Environmetrics %D 1999 %T Meteorologically-dependent trends in urban ozone %A Huang, Li-Shan %A RICHARD L. SMITH %K ANOVA %K empirical Bayes %K regression tree %XOzone concentrations are affected by precursor emissions and by meteorological conditions. As part of a broad study to assess the effects of standards imposed by the U.S. Environmental Protection Agency (EPA), it is of interest to analyze trends in ozone after adjusting for meteorological influences. Previous papers have studied this problem for ozone data from Chicago, using a variety of regression techniques. This paper presents a different approach, in which the meteorological influence is treated nonlinearly through a regression tree. A particular advantage of this approach is that it allows us to consider different trends within the clusters produced by the regression tree analysis. The variability of trend estimates between clusters is reduced by applying an empirical Bayes adjustment. The results confirm the findings of previous authors that there is an overall downward trend in Chicago ozone values, but they also go beyond previous analyses by showing that the trend is stronger at higher levels of ozone. Copyright © 1999 John Wiley & Sons, Ltd.

%B Environmetrics %V 10 %P 103–118 %G eng %0 Journal Article %D 1999 %T Network equilibrium models of travel choices with multiple classes %A Boyce, D. E. %A Bar-Gera, H. %G eng %0 Journal Article %J Materials and Structures %D 1999 %T Permeability of cracked concrete %A A. F. Karr %A C.-M. Aldea %A S. P. Shah %XThe goal of the research presented here was to study the relationship between cracking and water permeability. A feedback-controlled test was used to generate width-controlled cracks. Water permeability was evaluated by a low-pressure water permeability test. The factors chosen for the experimental design were material type (paste, mortar, normal and high strength concrete), thickness of the sample and average width of the induced cracks (ranging from 50 to 350 micrometers). The water permeability test results indicated that the relationships between permeability and material type differ for uncracked and cracked material, and that there was little thickness effect. Permeability of uncracked material decreased from paste, mortar, normal strength concrete (NSC) to high strength concrete (HSC). Water permeability of cracked material significantly increased with increasing crack width. For cracks above 100 microns, NSC showed the highest permeability coefficient, where as mortar showed the lowest one.

%B Materials and Structures %V 32 %P 370-376 %G eng %R 10.1007/BF02479629 %0 Conference Paper %B Proceedings of the International Symposium on High Performance and Reactive Powder Concretes %D 1999 %T Permeability of cracked high strength concrete %A A. F. Karr %A C.-M. Aldea %A S. P. Shah %E P. C. Aïtcin %E Y. Delagrave %XThe goal of the research presented here was to study the relationship between cracking and water permeability. A feedback-controlled test was used to generate width-controlled cracks. Water permeability was evaluated by a low-pressure water permeability test. The factors chosen for the experimental design were material type (paste, mortar, normal and high strength concrete), thickness of the sample and average width of the induced cracks (ranging from 50 to 350 micrometers). The water permeability test results indicated that the relationships between permeability and material type differ for uncracked and cracked material, and that there was little thickness effect. Permeability of uncracked material decreased from paste, mortar, normal strength concrete (NSC) to high strength concrete (HSC). Water permeability of cracked material significantly increased with increasing crack width. For cracks above 100 microns, NSC showed the highest permeability coefficient, where as mortar showed the lowest one.

%B Proceedings of the International Symposium on High Performance and Reactive Powder Concretes %P 211-219 %G eng %0 Journal Article %J Biometrika %D 1999 %T Posterior distribution of hierarchical models using CAR(1) distributions %A Sun,Dongchu %A Tsuakawa, R. K. %A Speckman, Paul %K Gibbs sampling %K Linear mixed model %K Multivariate normal %K Partially informative normal distribution %XWe examine properties of the conditional autoregressive model, or CAR(1) model, which is commonly used to represent regional effects in Bayesian analyses of mortality rates. We consider a Bayesian hierarchical linear mixed model where the fixed effects have a vague prior such as a constant prior and the random effect follows a class of CAR(1) models including those whose joint prior distribution of the regional effects is improper. We give sufficient conditions for the existence of the posterior distribution of the fixed and random effects and variance components. We then prove the necessity of the conditions and give a one-way analysis of variance example where the posterior may or may not exist. Finally, we extend the result to the generalised linear mixed model, which includes as a special case the Poisson log-linear model commonly used in disease

%B Biometrika %V 86 %P 341-350 %G eng %R 10.1093/biomet/86.2.341 %0 Book Section %D 1999 %T Probe-based surveillance for travel time information in ITS %A A. F. Karr %A P. Thakuriah %A A. Sen %E R. Emmerink %E P. Nijkamp %I Ashgate Publishing Ltd %P 393-425 %G eng %& 17 %0 Conference Proceedings %B 14th International Symposium on Transportation and Traffic Theory %D 1999 %T Route flow entropy maximization in origin-based traffic assignment, transportation and traffic theory %A Bar-Gera, H. %A Boyce, D. E. %B 14th International Symposium on Transportation and Traffic Theory %I Elsevier Science %G eng %0 Journal Article %J International Transactions in Operational Research %D 1999 %T Variances of link travel time estimates: Implications for optimal routes %A A. F. Karr %A A. Sen %A P. Thakuriah %A X. Zhu %K Advanced Traveler Information System %K Covariance of travel times %K Dependence in travel time observations %K Intelligent Transportation System %K Probe vehicles %K Variance of travel time estimates %K Vehicle simulation model %XIn this paper, we explore the consequences of using link travel time estimates with high variance to compute the minimum travel time route between an origin and destination pair. Because of platoon formation or for other reasons, vehicles on a link separated by small headways tend to have similar travel times. In other words, the covariance of link travel times of distinct vehicles which are close together may not be zero. It follows that the variance of the mean of travel times obtained from a sample of n vehicles on a same link over small time intervals is of the form a+b/n where a and b would usually be positive. This result has an important implication for the quality of road network travel time information given by Intelligent Transportation Systems (ITS)?that the variance of the estimate of mean travel time does not go to zero with increasing n. Thus the quality of information disseminated by ITS is not necessarily improved by increasing the market penetration of vehicles monitoring the system with the necessary equipment (termed probe vehicles). Estimates of a and b for a set of links are presented in the paper and consequences for probe-based ITS are explored by means of a simulation of such a system which is operational on an actual network.

%B International Transactions in Operational Research %V 6 %P 75-87 %8 January %G eng %R 10.1111/j.1475-3995.1999.tb00144.x %0 Conference Proceedings %B Transportation Networks: Recent Methodological Advances. Selected Proceedings of the 4th EURO Transportation Meeting %D 1999 %T Variational inequality Model of Ideal Dynamic User-Optimal Route Choice %A Boyce, D. E. %A Lee, D.-H. %A Janson, B.N. %K Advanced traffic management systems %K Advanced Traveler Information Systems %K Links (Networks) %K Route choice %K Variational inequalities (Mathematics) %XAn ideal dynamic user-optimal (DUO) route choice model is described for predicting dynamic traffic conditions, as required for off-line evaluation of Advanced Traffic Management Systems and Advanced Traveler Information Systems. The model is formulated as a variational inequality (VI), a general way of describing a dynamic network equilibrium. Although route-based VI models have an intuitive interpretation, their computational complexity makes them intractable for real applications. Consequently, the proposed model is formulated as a link-based variational inequality for use in large-scale implementations. Using the diagonalization technique with discrete time intervals, the model is solved to a specified level of convergence. Computational results for a real, large-scale traffic network are presented.

%B Transportation Networks: Recent Methodological Advances. Selected Proceedings of the 4th EURO Transportation Meeting %C Newcastle, England %P 289-302 %G eng %0 Journal Article %J Papers in Regional Science %D 1999 %T Welfare reform and spatial matching between clients and jobs %A Sen, Ashish %A Metaxatos, Paul %A Sööt, Siim %A Thakuriah, Vonu %K C13 %K C51 %K C52 %K entry-level job openings. %K I31 %K J23 %K JEL classification:C12 %K Key words:Welfare to work %K R12 %K R41 %K R53 %K targeted service %K travel demand %XThe recent Welfare Reform Act requires several categories of public assistance recipients to transition to the work force. In most metropolitan areas public assistance clients reside great distances from areas of entry-level jobs. Any program designed to provide access to these jobs, for those previously on public aid, needs relevant transportation services when the job search process begins. Therefore it is essential that the latent demand for commuting among public aid clients be assessed in developing public transportation services. The location of entry-level jobs must also be known or, as in this article, estimated using numerous data sources. This article reports on such a demand estimation effort, focusing primarily on the use of Regional Science methods.

%B Papers in Regional Science %I Springer-Verlag %V 78 %P 195-211 %G eng %U http://dx.doi.org/10.1007/s101100050021 %R 10.1007/s101100050021 %0 Journal Article %J Journal of the American Statistical Association %D 1998 %T Bayesian Inference on Network Traffic Using Link Count Data %A Claudia Tebaldi %A Michael West %XWe study Bayesian models and methods for analysing network traffic counts in problems of inference about the traffic intensity between directed pairs of origins and destinations in networks. This is a class of problems very recently discussed by Vardi in a 1996 JASA article and is of interest in both communication and transportation network studies. The current article develops the theoretical framework of variants of the origin-destination flow problem and introduces Bayesian approaches to analysis and inference. In the first, the so-called fixed routing problem, traffic or messages pass between nodes in a network, with each message originating at a specific source node, and ultimately moving through the network to a predetermined destination node. All nodes are candidate origin and destination points. The framework assumes no travel time complications, considering only the number of messages passing between pairs of nodes in a specified time interval. The route count, or route flow, problem is to infer the set of actual number of messages passed between each directed origin-destination pair in the time interval, based on the observed counts flowing between all directed pairs of adjacent nodes. Based on some development of the theoretical structure of the problem and assumptions about prior distributional forms, we develop posterior distributions for inference on actual origin-destination counts and associated flow rates. This involves iterative simulation methods, or Markov chain Monte Carlo (MCMC), that combine Metropolis-Hastings steps within an overall Gibbs sampling framework. We discuss issues of convergence and related practical matters, and illustrate the approach in a network previously studied in Vardi’s article. We explore both methodological and applied aspects much further in a concrete problem of a road network in North Carolina, studied in transportation flow assessment contexts by civil engineers. This investigation generates critical insight into limitations of statistical analysis, and particularly of non-Bayesian approaches, due to inherent structural features of the problem. A truly Bayesian approach, imposing partial stochastic constraints through informed prior distributions, offers a way of resolving these problems and is consistent with prevailing trends in updating traffic flow intensities in this field. Following this, we explore a second version of the problem that introduces elements of uncertainty about routes taken by individual messages in terms of Markov selection of outgoing links for messages at any given node. For specified route choice probabilities, we introduce the concept of a super-network-namely, a fixed routing problem in which the stochastic problem may be embedded. This leads to solution of the stochastic version of the problem using the methods developed for the original formulation of the fixed routing problem. This is also illustrated. Finally, we discuss various related issues and model extensions, including inference on stochastic route choice selection probabilities, questions of missing data and partially observed link counts, and relationships with current research on road traffic network problems in which travel times within links are nonnegligible and may be estimated from additional data.

%B Journal of the American Statistical Association %V 93 %P 557-573 %8 06/1998 %G eng %U http://www.jstor.org/stable/2670105http://www.jstor.org/stable/2670105 %0 Book Section %B Statistics in Science and Technology: Case Studies 4 %D 1998 %T Bayesian Mixture Models in Exploration of Structure-Activity Relationships in Drug Design %A Susan Paddock %A Michael West %A S. Stanley Young %A M. Clyde %B Statistics in Science and Technology: Case Studies 4 %I Springer-Verlag %G eng %0 Book Section %B Case Studies in Environmental Statistics %D 1998 %T Categorical Exposure-Response Regression Analysis of Toxicology Experiments %A Xie, Minge %A Simpson, Douglas %E Nychka, Douglas %E Piegorsch, Walter W. %E Lawrence H. Cox %XIn the mid-1980s, an accident at the Union Carbide pesticides plant in Bhopal, India released the toxic gas methylisocyanate (MIC) in that densely populated region, killing more than 4000 people and injuring 500,000 others. Even today, many people in Bhopal are affected by illnesses related to that earlier exposure. This notorious industrial disaster not only forced scientists to pay greater attention to identifying and handling of hazardous chemicals but also prompted greater awareness of those common industrial products that contain hazard pollutants.

%B Case Studies in Environmental Statistics %S Lecture Notes in Statistics %I Springer US %V 132 %P 121-141 %@ 978-0-387-98478-0 %G eng %U http://dx.doi.org/10.1007/978-1-4612-2226-2_7 %R 10.1007/978-1-4612-2226-2_7 %0 Journal Article %J Journal of the Royal Statistical Society: Series C %D 1998 %T Circuit optimization via sequential computer experiments: design of an output buffer %A Aslett, Robert %A Buck, Robert J. %A Duvall, Steven G. %A Jerome Sacks %A Welch, William J. %K Circuit simulator %K Computer code %K Computer model %K Engineering design %K Parameter design %K Stochastic process %K Visualization %XIn electrical engineering, circuit designs are now often optimized via circuit simulation computer models. Typically, many response variables characterize the circuit’s performance. Each response is a function of many input variables, including factors that can be set in the engineering design and noise factors representing manufacturing conditions. We describe a modelling approach which is appropriate for the simulator’s deterministic input–output relationships. Non-linearities and interactions are identified without explicit assumptions about the functional form. These models lead to predictors to guide the reduction of the ranges of the designable factors in a sequence of experiments. Ultimately, the predictors are used to optimize the engineering design. We also show how a visualization of the fitted relationships facilitates an understanding of the engineering trade-offs between responses. The example used to demonstrate these methods, the design of a buffer circuit, has multiple targets for the responses, representing different trade-offs between the key performance measures.

%B Journal of the Royal Statistical Society: Series C %V 47 %P 31-48 %G eng %0 Conference Paper %B Proceedings of the 20th International Conference on Software Engineering %D 1998 %T An Empirical Study of Regression Test Selection Techniques %A Todd L. Graves %A Harrold, Mary Jean %A Kim, Jung-Min %A Adam Porter %A Rothermel, Gregg %B Proceedings of the 20th International Conference on Software Engineering %S ICSE ’98 %I IEEE Computer Society %C Washington, DC, USA %P 188–197 %@ 0-8186-8368-6 %G eng %U http://dl.acm.org/citation.cfm?id=302163.302182 %0 Journal Article %J Mathematical and Computer Modelling %D 1998 %T Estimation of static travel times in a dynamic route guidance system—II %A Sen, Ashish %A Sööt, Siim %A Piyushimita Thakuriah %A Condie, Helen %K Advanced Traveler Information Systems %K Dynamic Route Guidance %K Link travel times %K Static estimates %XIn an earlier paper a method for computing static profiles of link travel times was given. In this paper, the centrality of such profiles for ATIS is examined and the methods given in the earlier paper are applied to actual data. Except for a minor, easily correctable problem, the methods are shown to work very well under real-life conditions.

%B Mathematical and Computer Modelling %V 27 %P 67–85 %G eng %R 10.1016/S0895-7177(98)00052-1 %0 Journal Article %J Lecture Notes-Monograph Series %D 1998 %T Global versus Local Search in Constrained Optimization of Computer Models %A M. Schonlau %A Welch, William J. %A Jones, Donald R. %K Bayesian global optimization %K Computer code %K sequential design %K Stochastic process %XEngineering systems are now frequently optimized via computer models. The input-output relationships in these models are often highly nonlinear deterministic functions that are expensive to compute. Thus, when searching for the global optimum, it is desirable to minimize the number of function evaluations. Bayesian global optimization methods are well-suited to this task because they make use of all previous evaluations in selecting the next search point. A statistical model is fit to the sampled points which allows predictions to be made elsewhere, along with a measure of possible prediction error (uncertainty). The next point is chosen to maximize a criterion that balances searching where the predicted value of the function is good (local search) with searching where the uncertainty of prediction is large (global search). We extend this methodology in several ways. First, we introduce a parameter that controls the local-global balance. Secondly, we propose a method for dealing with nonlinear inequality constraints from additional response variables. Lastly, we adapt the sequential algorithm to proceed in stages rather than one point at a time. The extensions are illustrated using a shape optimization problem from the automotive industry.

%B Lecture Notes-Monograph Series %V 34 %P 11-25 %G eng %0 Book Section %D 1998 %T Good Statistical Practice %A Alan Karr %E C. E. Minder %E F. Friedl %I Austrian Statistical Society %P 175?179 %G eng %& Modeling software changes %0 Journal Article %J Statistics and Computing %D 1998 %T A hybrid Markov chain for the Bayesian analysis of the multinomial probit model %A Nobile, Agostino %K Bayesian analysis %K Gibbs sampling %K Metropolis algorithm %K Multinomial probit model %XBayesian inference for the multinomial probit model, using the Gibbs sampler with data augmentation, has been recently considered by some authors. The present paper introduces a modification of the sampling technique, by defining a hybrid Markov chain in which, after each Gibbs sampling cycle, a Metropolis step is carried out along a direction of constant likelihood. Examples with simulated data sets motivate and illustrate the new technique. A proof of the ergodicity of the hybrid Markov chain is also given.

%B Statistics and Computing %I Kluwer Academic Publishers %V 8 %P 229-242 %G eng %U http://dx.doi.org/10.1023/A%3A1008905311214 %R 10.1023/A:1008905311214 %0 Conference Paper %B Software Metrics Symposium, 1998. Metrics 1998. Proceedings. Fifth International %D 1998 %T Inferring change effort from configuration management databases %A T.L. Graves %A A. Mockus %XIn this paper we describe a methodology and algorithm for historical analysis of the effort necessary for developers to make changes to software. The algorithm identifies factors which have historically increased the difficulty of changes. This methodology has implications for research into cost drivers. As an example of a research finding, we find that a system under study was “decaying” in that changes grew more difficult to implement at a rate of 20% per year. We also quantify the difference in costs between changes that fix faults and additions of new functionality: fixes require 80% more effort after accounting for size. Since our methodology adds no overhead to the development process, we also envision it being used as a project management tool: for example, developers can identify code modules which have grown more difficult to change than previously, and can match changes to developers with appropriate expertise. The methodology uses data from a change management system, supported by monthly time sheet data if available. The method’s performance does not degrade much when the quality of the time sheet data is limited. We validate our results using a survey of the developers under study: the change efforts resulting from the algorithm match the developers’ opinions. Our methodology includes a technique based on the jackknife to determine factors that contribute significantly to change effort

%B Software Metrics Symposium, 1998. Metrics 1998. Proceedings. Fifth International %P 267-273 %8 Nov %G eng %R 10.1109/METRIC.1998.731253 %0 Book Section %B Case Studies in Environmental Statistics %D 1998 %T Introduction: Problems in Environmental Monitoring and Assessment %A Lawrence H. Cox %A Nychka, Douglas %A Piegorsch, Walter W. %XThe need for innovative statistical methods for modern environmental assessment is undisputed. The case studies in this book are a sampling of the broad sweep of statistical applications available in the environmental sciences, targeted to environmental monitoring and assessment. A unique feature of the applications presented here is that they are not isolated projects but were, instead, fostered under a long-term collaborative association between the U.S. Environmental Protection Agency (EPA) and the National Institute of Statistical Sciences (NISS). This institutional support resulted in a strong interdisciplinary component to the research, and common threads of statistical methodology and data analysis principles are seen across all of the projects. The case studies necessarily are detailed and technical and so this introductory chapter will give an overview of what follows and emphasize common themes that tie the projects together. Research, by its very nature, does not follow a direct path and depends on past results for the next step. This process is enriched through the collaboration of statisticians with other scientists.

%B Case Studies in Environmental Statistics %S Lecture Notes in Statistics %I Springer US %V 132 %P 1-4 %@ 978-0-387-98478-0 %G eng %U http://dx.doi.org/10.1007/978-1-4612-2226-2_1 %R 10.1007/978-1-4612-2226-2_1 %0 Journal Article %J Journal of Educational Measurement %D 1998 %T Projecting to the NAEP Scale: Results from the North Carolina End-of-Grade Testing Program %A Williams, Valerie %A Billeaud, Kathleen %A Davis, Lori A. %A Thissen, David %A Sanford, Eleanor E. %XData from the North Carolina End-of-Grade test of eighth-grade mathematics are used to estimate the achievement results on the scale of the National Assessment of Educational Progress (NAEP) Trial State Assessment. Linear regression models are used to develop projection equations to predict state NAEP results in the future, and the results of such predictions are compared with those obtained in the 1996 administration of NAEP. Standard errors of the parameter estimates are obtained using a bootstrap resampling technique.

%B Journal of Educational Measurement %V 35 %P 277-296 %G eng %0 Book Section %B Knowledge and Networks in a Dynamic Economy %D 1998 %T Roadway Incident Analysis with a Dynamic User-Optimal Route Choice Model %A Boyce, D. E. %A Lee, D.-H. %A Janson, B.N. %E Beckmann, Martin J. %E Johannsson, Börje %E Snickars, Folke %E Thord, Roland %XThe transportation system conveys interdependencies. When analysing the costs and benefits of transport investment projects, it is therefore necessary to address the question of linkages among projects. Such linkages can occur in terms of economies of scale in arising from the combination of projects during the construction phase. Intelligent Transportation Systems (ITS), also known as Intelligent Vehicle Highway Systems (IVHS), are applying advanced technologies (such as navigation, automobile, computer science, telecommunication, electronic engineering, automatic information collection and processing) in an effort to bring revolutionary improvements in traffic safety, network capacity utilization, vehicle emission reductions, travel time and fuel consumption savings, etc. Within the framework of ITS, Advanced Traffic Management Systems (ATMS) and Advanced Traveler Information Systems (ATIS) both aim to manage and predict traffic congestion and provide historical and real time network-wide traffic information to support drivers’ route choice decisions. To enable ATMS/ATIS to achieve the above described goals, traffic flow prediction models are needed for system operation and evaluation. Linkages may also arise in supply through interaction among network components, or among the producers of transportation services. Linkages may also emerge in demand through the creation of new opportunities for interaction.

%B Knowledge and Networks in a Dynamic Economy %I Springer Berlin Heidelberg %P 371-390 %@ 978-3-642-64350-7 %G eng %U http://dx.doi.org/10.1007/978-3-642-60318-1_21 %R 10.1007/978-3-642-60318-1_21 %0 Book Section %D 1998 %T SoftStat ?97: Advances in Statistical Software 6 %A A. F. Karr %A G. Eick %A A. Mockus %A T.L. Graves %E W. Badilla %E F. Faulbaum %I Lucius & Lucius %P 3-10 %G eng %& Web-based text visualization %0 Journal Article %J In Transportation Research Record %D 1998 %T Transportation Planning Process for Linking Welfare Recipients to Jobs %A Metaxatos, Paul %A Sööt, Siim %A Piyushimita Thakuriah %A Sen, Ashish %B In Transportation Research Record %V 1626 %P 149 - 158 %G eng %0 Journal Article %D 1998 %T Traveler Destination Choice Behavior: Effects of Time of Day, Activity Duration and Home Location %A Kitamura, Ryuichi %A Chen, Cynthia %A Narayanan, Ravi %K Choice models %K Hypothesis testing %K Logits %K Multinomial logits %K Origin and destination %K Periods of the day %K Residential location %K Time duration %XMultinomial logit destination choice models are developed and the following hypotheses are examined: (a) time of day affects destination choice behavior, (b) the duration of stay at the destination affects destination choice, and (c) home location affects non-home-based destination choice. The statistical results offer strong evidence in support of the hypotheses.

%P 76-81 %@ 0309065178 %G eng %0 Journal Article %J World Wide Web %D 1998 %T A Web laboratory for software data analysis %A G. Eick %A A. Mockus %A T.L. Graves %A A. F. Karr %XWe describe two prototypical elements of a World Wide Web?based system for visualization and analysis of data produced in the software development process. Our system incorporates interactive applets and visualization techniques into Web pages. A particularly powerful example of such an applet, SeeSoftTM, can display thousands of lines of text on a single screen, allowing detection of patterns not discernible directly from the text. In our system, Live Documents replace static statistical tables in ordinary documents by dynamic Web?based documents, in effect allowing the ?reader? to customize the document as it is read. Use of the Web provides several advantages. The tools access data from a very large central data base, instead of requiring that it be downloaded; this ensures that readers are always working with the most up?to?date version of the data, and relieves readers of the responsibility of preparing data for their use. The tools encourage collaborative research, as one researcher’s observations can easily be replicated and studied in greater detail by other team members. We have found this particularly useful while studying software data as part of a team that includes researchers in computer science, software engineering, and statistics, as well as development managers. Live documents will also help the Web revolutionize scientific publication, as papers published on the Web can contain Java applets that permit readers to confirm the conclusions reached by the authors’ statistical analyses.

%B World Wide Web %V 1 %P 55-60 %G eng %R 10.1023/A:1019299211575 %0 Book Section %B Case Studies in Environmental Statistics %D 1998 %T Workshop: Statistical Methods for Combining Environmental Information %A Lawrence H. Cox %E Nychka, Douglas %E Piegorsch, Walter W. %E Lawrence H. Cox %XPrimary objectives of the NISS-USEPA cooperative research agreement were to identify important environmental problems to which statistical science could contribute, to perform interdisciplinary research on these problems and stimulate related research and problem identification within the broader statistical community, to assess important examples and areas of environmetric research, and to identify new research problems and directions. To provide a forum for identifying and examining new research and problem areas, a NISS-USEPA workshop series was established within the cooperative research program.

%B Case Studies in Environmental Statistics %S Lecture Notes in Statistics %I Springer US %V 132 %P 143-158 %@ 978-0-387-98478-0 %G eng %U http://dx.doi.org/10.1007/978-1-4612-2226-2_8 %R 10.1007/978-1-4612-2226-2_8 %0 Journal Article %J Environmental and Ecological Statistics %D 1997 %T Bayes methods for combining disease and exposure data in assessing environmental justice %A Waller, Lance A. %A Louis, Thomas A. %A Carlin, Bradley P. %K environmental equity %K hierarchical model %K Markov chain Monte Carlo %K regulation %XEnvironmental justice reflects the equitable distribution of the burden of environmental hazards across various sociodemographic groups. The issue is important in environmental regulation, siting of hazardous waste repositories and prioritizing remediation of existing sources of exposure. We propose a statistical framework for assessing environmental justice. The framework includes a quantitative assessment of environmental equity based on the cumulative distribution of exposure within population subgroups linked to disease incidence through a dose-response function. This approach avoids arbitrary binary classifications of individuals solely as ’exposed’ or ’unexposed’. We present a Bayesian inferential approach, implemented using Markov chain Monte Carlo methods, that accounts for uncertainty in both exposure and response. We illustrate our method using data on leukemia deaths and exposure to toxic chemical releases in Allegheny County, Pennsylvania.

%B Environmental and Ecological Statistics %I Kluwer Academic Publishers %V 4 %P 267-281 %G eng %U http://dx.doi.org/10.1023/A%3A1018586715034 %R 10.1023/A:1018586715034 %0 Journal Article %J Ecological Modeling %D 1997 %T Characterization of Parameters in Mechanistic Models: A Case Study of PCB Fate and Transport in Surface Waters %A Steinberg, Laura J. %A Reckhow, Kenneth H. %A Wolpert, Robert L. %B Ecological Modeling %V 97 %G eng %N 1 %0 Journal Article %J Journal of Transportation Engineering, ASCE %D 1997 %T Frequency of probe vehicle reports and variances of link travel time estimates %A A. Sen %A P. Thakuriah %A X. Zhu %A A. F. Karr %XAn important design issue relating to probe-based Advanced Traveler Information Systems (ATISs) and Advanced Traffic Management Systems is the sample size of probes (or the number of link traversals by probe vehicles) per unit time used in order to obtain reliable network information in terms of link travel time estimates. The variance of the mean of travel times obtained from n probes for the same link over a fixed time period may be shown to be of the form a+b/n where a and b are link-specific parameters. Using probe travel time data from a set of signalized arterials, it is shown that a is positive for well-traveled signalized links. This implies that the variance does not go to zero with increasing n. Consequences of this fact for probe-based systems are explored. While the results presented are for a specific set of links, we argue that because of the nature of the underlying travel time process, the broad conclusions would hold for most well-traveled links with signal control.

%B Journal of Transportation Engineering, ASCE %V 123 %P 290?297 %G eng %R http://dx.doi.org/10.1061/(ASCE)0733-947X(1997)123:4(290) %0 Conference Paper %B Brittle Matrix Composites - International Symposium %D 1997 %T Influence of microstructure and fracture on the transport properties in cement-based materials %A S. Jaiswal %A T. Igusa %A T. Styer %A A. F. Karr %B Brittle Matrix Composites - International Symposium %V 5 %P 199-220 %G eng %0 Journal Article %J Cement Concrete Res. %D 1997 %T Permeability study of cracked concrete %A K. Wang %A D.C. Jansen %A S. P. Shah %A A. F. Karr %XCracks in concrete generally interconnect flow paths and increase concrete permeability. The increase in concrete permeability due to the progression of cracks allows more water or aggressive chemical ions to penetrate into the concrete, facilitating deterioration. The present work studies the relationship between crack characteristics and concrete permeability. In this study, feedback controlled splitting tests are introduced to generate crack width-controlled concrete specimens. Sequential crack patterns with different crack widths are viewed under a microscope. The permeability of cracked concrete is evaluated by water permeability tests. The preliminary results indicate that crack openings generally accelerate water flow rate in concrete. When a specimen is loaded to have a crack opening displacement smaller than 50 microns prior to unloading, the crack opening has little effect on concrete permeability. When the crack opening displacement increases from 50 microns to about 200 microns, concrete permeability increases rapidly. After the crack opening displacement reaches 200 microns, the rate of water permeability increases steadily. The present research may provide insight into developing design criteria for a durable concrete and in predicting service life of a concrete structure.

%B Cement Concrete Res. %V 27 %P 381-393 %G eng %R http://dx.doi.org/10.1016/S0008-8846(97)00031-8. %0 Book Section %B Case Studies in Bayesian Statistics %D 1997 %T A Random-Effects Multinomial Probit Model of Car Ownership Choice %A Nobile, Agostino %A Bhat, Chandra R. %A Pas, Eric I. %E Gatsonis, Constantine %E Hodges, JamesS. %E Kass, RobertE. %E McCulloch, Robert %E Rossi, Peter %E Singpurwalla, NozerD. %K car ownership %K longitudinal data %K Multinomial probit model %XThe number of cars in a household has an important effect on its travel behavior (e.g., choice of number of trips, mode to work and non-work destinations), hence car ownership modeling is an essential component of any travel demand forecasting effort. In this paper we report on a random effects multinomial probit model of car ownership level, estimated using longitudinal data collected in the Netherlands. A Bayesian approach is taken and the model is estimated by means of a modification of the Gibbs sampling with data augmentation algorithm considered by McCulloch and Rossi (1994). The modification consists in performing, after each Gibbs sampling cycle, a Metropolis step along a direction of constant likelihood. An examination of the simulation output illustrates the improved performance of the resulting sampler.

%B Case Studies in Bayesian Statistics %S Lecture Notes in Statistics %I Springer New York %V 121 %P 419-434 %@ 978-0-387-94990-1 %G eng %U http://dx.doi.org/10.1007/978-1-4612-2290-3_13 %R 10.1007/978-1-4612-2290-3_13 %0 Book Section %B Modelling Longitudinal and Spatially Correlated Data %D 1997 %T Scaled Link Functions for Heterogeneous Ordinal Response Data* %A Xie, Minge %A Simpson, Douglas G %A Carroll, Raymond J. %E Gregoire, Timothy G. %E Brillinger, David R. %E Diggle, PeterJ. %E Russek-Cohen, Estelle %E Warren, William G. %E Wolfinger, Russell D. %K Aggregated observations %K Generalized likelihood inference %K Marginal modeling approach %K Ordinal regression %XThis paper describes a class ordinal regression models in which the link function has scale parameters that may be estimated along with the regression parameters. One motivation is to provide a plausible model for group level categorical responses. In this case a natural class of scaled link functions is obtained by treating the group level responses as threshold averages of possible correlated latent individual level variables. We find scaled link functions also arise naturally in other circumstances. Our methodology is illustrated through environmental risk assessment data where (correlated) individual level responses and group level responses are mixed.

%B Modelling Longitudinal and Spatially Correlated Data %S Lecture Notes in Statistics %I Springer New York %V 122 %P 23-36 %@ 978-0-387-98216-8 %G eng %U http://dx.doi.org/10.1007/978-1-4612-0699-6_3 %R 10.1007/978-1-4612-0699-6_3 %0 Journal Article %J Environmental and Ecological Statistics %D 1997 %T Spatial sampling and the environment: some issues and directions %A Dennis D. Cox %A Lawrence H. Cox %A ENSOR, KATHERINE B. %K environmental monitoring %K experimental design %K kriging %K multiphase sampling %K spatial statistics %B Environmental and Ecological Statistics %I Kluwer Academic Publishers %V 4 %P 219-233 %G eng %U http://dx.doi.org/10.1023/A%3A1018578513217 %R 10.1023/A:1018578513217 %0 Journal Article %J Atmospheric Environment %D 1996 %T Accounting for Meteorological Effects in Measuring Urban Ozone Levels and Trends %A Bloomfield, Peter %A Royle, Andy %A Steinberg, Laura J. %A Yang, Qing %K median polish %K meteorological adjustment %K nonlinear regression %K nonparametric regression %K Ozone concentration %XObserved ozone concentrations are valuable indicators of possible health and environmental impacts. However, they are also used to monitor changes and trends in the sources of ozone and of its precursors, and for this purpose the influence of meteorological variables is a confounding factor. This paper examines ozone concentrations and meteorology in the Chicago area. The data are described using least absolute deviations and local regression. The key relationships observed in these analyses are then used to construct a nonlinear regression model relating ozone to meteorology. The model can be used to estimate that part of the trend in ozone levels that cannot be accounted for by trends in meteorology, and to ‘adjust’ observed ozone concentrations for anomalous weather conditions.

%B Atmospheric Environment %V 30 %P 3067–3077 %G eng %N 17 %0 Journal Article %J Atmospheric Environment %D 1996 %T Accounting for meteorological effects in measuring urban ozone levels and trends %A Bloomfield, Peter %A Royle, Andy %A Yang, Qing %K median polish %K meteorological adjustment %K nonlinear regression %K nonparametric regression %K Ozone concentration %XObserved ozone concentrations are valuable indicators of possible health and environmental impacts. However, they are also used to monitor changes and trends in the sources of ozone and of its precursors, and for this purpose the influence of meteorological variables is a confounding factor. This paper examines ozone concentrations and meteorology in the Chicago area. The data are described using least absolute deviations and local regression. The key relationships observed in these analyses are then used to construct a nonlinear regression model relating ozone to meteorology. The model can be used to estimate that part of the trend in ozone levels that cannot be accounted for by trends in meteorology, and to ‘adjust’ observed ozone concentrations for anomalous weather conditions.

%B Atmospheric Environment %V 30 %P 3067-3077 %G eng %& 3067 %R 10.1016/1352-2310(95)00347-910.1016/1352-2310(95)00347-9 %0 Journal Article %J Journal of Environmental Engineering %D 1996 %T Bayesian Model for Fate and Transport of Polychlorinated Biphenyl in Upper Hudson River %A Steinberg, Laura J. %A Reckhow, Kenneth H. %A Wolpert, Robert L. %K Bayesian analysis %K Hudson River %K PCB %K simulation models %K transport phenomena %XModelers of contaminant fate and transport in surface waters typically rely on literature values when selecting parameter values for mechanistic models. While the expert judgment with which these selections are made is valuable, the information contained in contaminant concentration measurements should not be ignored. In this full-scale Bayesian analysis of polychlorinated biphenyl (PCB) contamination in the upper Hudson River, these two sources of information are combined using Bayes’ theorem. A simulation model for the fate and transport of the PCBs in the upper Hudson River forms the basis of the likelihood function while the prior density is developed from literature values. The method provides estimates for the anaerobic biodegradation half-life, aerobic biodegradation plus volatilization half-life, contaminated sediment depth, and resuspension velocity of 4,400 d, 3.2 d, 0.32 m, and 0.02 m/yr, respectively. These are significantly different than values obtained with more traditional methods, and are shown to produce better predictions than those methods when used in a cross-validation study.

%B Journal of Environmental Engineering %V 122 %P 341-349 %8 May 1996 %G eng %R http://dx.doi.org/10.1061/(ASCE)0733-9372(1996)122:5(341) %0 Journal Article %J Journal of Environmental Engineering %D 1996 %T Bayesian Model for Fate and Transport of Polychlorinated Biphenyl in Upper Hudson River %A Steinberg, Laura J. %A Reckhow, Kenneth H. %A Wolpert, Robert L. %B Journal of Environmental Engineering %V 122 %G eng %N 5 %0 Conference Paper %B Proceedings of the InternationalWorkshop on Empirical Studies of Software Maintenance %D 1996 %T An empirical exploration of code evolution %A A. F. Karr %A A. A. Porter %A L. G. Votta %B Proceedings of the InternationalWorkshop on Empirical Studies of Software Maintenance %G eng %0 Journal Article %J Journal of Agricultural Biological and Environmental Statistics %D 1996 %T Interval Censoring And Marginal Analysis In Ordinal Regression %A Simpson, Douglas G %A Carroll, Raymond %A Xie, Minge %K categorical data %K categorical response %K environmental statistics %XThis paper develops methodology for regression analysis of ordinal response data subject to interval censoring. This work is motivated by the need to analyze data from multiple studies in toxicological risk assessment. Responses are scored on an ordinal severity scale, but not all responses can be scored completely. For instance, in a mortality study, information on nonfatal but adverse outcomes may be missing. In order to address possible within–study correlations we develop a generalized estimating approach to the problem, with appropriate adjustments to uncertainty statements. We develop expressions relating parameters of the implied marginal model to the parameters of a conditional model with random effects, and, in a special case, we note an interesting equivalence between conditional and marginal modeling of ordinal responses. We illustrate the methodology in an analysis of a toxicological data-base.

%B Journal of Agricultural Biological and Environmental Statistics %V 4 %G eng %R 10.2307/1400524 %0 Journal Article %J Transportation Research Record %D 1996 %T Non - response and Urban Travel Models %A Piyushimita Thakuriah %A Sen, Ashish %A Sööt, Siim %A Christopher, Ed J. %B Transportation Research Record %V 1551 %P 82 - 87 %G eng %0 Journal Article %J Journal of Agricultural, Biological, and Environmental Statistics %D 1996 %T Predicting ozone levels and trends with semiparametric modeling %A Gao, Feng %A Jerome Sacks %A Welch, William %B Journal of Agricultural, Biological, and Environmental Statistics %V 1 %P 404-425 %G eng %& 404 %0 Journal Article %J In Transporta tion Research Part C: Emerging Technologies %D 1996 %T Quality of Information given by Advanced Traveler Information Systems %A Piyushimita Thakuriah %A Sen, Ashish %B In Transporta tion Research Part C: Emerging Technologies %V 4 %P 249 - 266 %G eng %0 Journal Article %J Environmetrics %D 1995 %T The ability of wet deposition networks to detect temporal trends %A Oehlert, Gary W. %K discrete smoothing %K wet deposition networks %XWe use the spatial/temporal model developed in Oehlert (1993) to estimate the detectability of trends in wet-deposition sulphate. Precipitation volume adjustments of sulphate concentration dramatically improve the detectability and quantifiability of trends. Anticipated decreases in sulphate of about 30 per cent in the Eastern U.S. by 2005 predicted by models should be detectable much earlier, say, 1997, but accurate quantification of the true decrease will require several additional years of monitoring. It is possible to delete a few stations from the East without materially affecting the detectability or quantifiability of trends. Careful siting of new stations can provide substantial improvement to regional trend estimation.

%B Environmetrics %V 6 %P 327–339 %G eng %R 10.1002/env.3170060402 %0 Journal Article %J Environmental Health Perspectives %D 1995 %T Effect of outdoor airborne particulate matter on daily death count %A P. Styer %A McMillan, N %A Gao, F %A Davis, J %A Jerome Sacks %XTo investigate the possible relationship between airborne particulate matter and mortality, we developed regression models of daily mortality counts using meteorological covariates and measures of outdoor PM10. Our analyses included data from Cook County, Illinois, and Salt Lake County, Utah. We found no evidence that particulate matter < or = 10 microns (PM10) contributes to excess mortality in Salt Lake County, Utah. In Cook County, Illinois, we found evidence of a positive PM10 effect in spring and autumn, but not in winter and summer. We conclude that the reported effects of particulates on mortality are unconfirmed.

%B Environmental Health Perspectives %V 103 %P 490–497 %G eng %0 Journal Article %J Mathematical and Computer Modelling %D 1995 %T Estimation of Static Travel Times in a Dynamic Route Guidance System %A Sen, Ashish %A Piyushimita Thakuriah %K Advanced Travel Information System %K Autonomous route guidance %K Dynamic Route Guidance %K Link travel time estimate %K Link Travel Time Process %XIn an Advanced Traveler Information System where route guidance is provided, a driver chooses a route before he/she actually traverses the links in the route. For such systems, link travel times need to be forecasted. However, information on several thousand links would take a fair amount of time to be conveyed to the driver, and very few drivers would be willing to wait very long to get route information, In the ADVANCE demonstration, to be implemented in suburban Chicago, the in-vehicle unit in each participating vehicle will be provided with the capability of accessing default travel time information, which will offer the vehicle with an autonomous navigation capability. The default estimates will be overwritten by dynamic up-to-the-minute forecasts if such forecasts are different from the default estimates. This paper describes the approach used to compute default travel times estimates.

%B Mathematical and Computer Modelling %V 22 %P 83–101 %G eng %0 Journal Article %J Journal of Statistical Planning and Inference %D 1995 %T On high level exceedance modeling and tail inference %A M. R. Leadbetter %K Central limit theory %K Exceedance modeling %K Extreme values %K Tail estimation %XThis paper discusses a general framework common to some varied known and new results involving measures of threshold exceedance by high values of stationary stochastic sequences. In particular these concern the following. (a) Probabilistic modeling of infrequent but potentially damaging physical events such as storms, high stresses, high pollution episodes, describing both repeated occurrences and associated ‘damage’ magnitudes. (b) Statistical estimation of ‘tail parameters’ of a stationary stochastic sequence {Xj}. This includes a variety of estimation problems and in particular cases such as estimation of expected lengths of clusters of high values (e.g. storm durations), of interest in (a). ‘Very high’ values (leading to Poisson-based limits for exceedance statistics) and ‘high’ values (giving normal limits) are considered and exhibited as special cases within the general framework of central limit results for ‘random additive interval functions’. The case of array sums of dependent random variables is revisited within this framework, clarifying the role of dependence conditions and providing minimal conditions for characterization of possible limit types. The methods are illustrated by the construction of confidence limits for the mean of an ‘exceedance statistic’ measuring high ozone levels, based on Philadelphia monitoring data.

%B Journal of Statistical Planning and Inference %V 45 %P 247-280 %G eng %R 10.1016/0378-3758(94)00075-1 %0 Journal Article %J Annual Review of Psychology %D 1995 %T Multiple Hypothesis Testing: A Review %A Shaffer, Juliet Popper %B Annual Review of Psychology %V 46 %P 561-584 %G eng %0 Journal Article %J Atmospheric Environment %D 1995 %T Point process approach to modeling trends in tropospheric ozone based on exceedances of a high threshold %A Smith, R.L. %A Shively, Thomas S. %B Atmospheric Environment %V 29 %P 3489–3499 %G eng %& 3489 %R 10.1016/1352-2310(95)00030-3 %0 Journal Article %J Journal of Educational and Behavioral Statistics %D 1995 %T Questioning Multilevel Models %A de Leeuw, Jan %A Kreft, Ita G.G. %XIn this article, practical problems with multilevel techniques are discussed. These problems, brought to our attention by the National Center for Education Statistics (NCES), have to do with terminology, computer programs employing different algorithms, and interpretations of the coefficients in one or two steps. We discuss the usefulness of the hierarchical linear model (HM) in the most common situation in education-that of a large number of relatively small groups. We also point to situations where the more complicated HMs can be replaced with simpler models, with statistical properties that are easier to study. We conclude that more studies need to be done to establish the claimed superiority of restricted versus unrestricted maximum likelihood, to study the effects of shrinkage on the estimators, and to explore the merits of simpler methods such as weighted least squares. Finally, distinctions must be made between choice of model, choice of technique, choice of algorithm, and choice of computer program. While HMs are an elegant conceptualization, they are not always necessary. Traditional techniques perform as well, or better, if there are large groups and small intraclass correlations, and if the researcher is interested only in the fixed-level regression coefficients.

%B Journal of Educational and Behavioral Statistics %V 20 %P 171-189 %G eng %0 Journal Article %J Atmospheric Environment %D 1995 %T Shrinking a wet deposition network %A Oehlert, Gary W. %K Monitoring network %K network design %K spatial smoothing %K trend analysis %XSuppose that we must delete stations from a monitoring network. Which stations should be deleted if we wish the remaining network to have the smallest possible trend estimate variances? We use the spatial-temporal model described in Oehlert (1993, J. Am. Statist. Assoc., 88, 390–399), to model concentration of sulfate in wet deposition. Based on this model and three criteria, we choose good sets of candidate stations for deletion from the NADP/NTN network. We use the criteria: that the sum of 11 regional trend estimate variances be as small as possible, that the sum of local trend estimation variance be as small as possible, and that the sum of local mean estimation variance be as small as possible. Good choices of stations for deletion result in a modest increase in criteria (about 7 to 34%) for 100 stations deleted from the network, while random sets of 100 stations can increase criteria by a factor of two or more.

%B Atmospheric Environment %V 30 %P 1347–1357 %G eng %R 10.1016/1352-2310(95)00333-9 %0 Journal Article %J Journal of Geophysical Research: Oceans %D 1994 %T Arctic sea ice variability: Model sensitivities and a multidecadal simulation %A Chapman, W.L. %A Welch, W. %A Bowman, K.P. %A Jerome Sacks %A Walsh, J.E. %K Arctic region %K Climate and interannual variability %K Climate and interannual variability Ice mechanics and air/sea/ice exchange processes %K Ice mechanics and air/sea/ice exchange processes %K Information Related to Geographic Region: Arctic region %K Numerical modeling %XA dynamic-thermodynamic sea ice model is used to illustrate a sensitivity evaluation strategy in which a statistical model is fit to the output of the ice model. The statistical model response, evaluated in terms of certain metrics or integrated features of the ice model output, is a function of a selected set of d (= 13) prescribed parameters of the ice model and is therefore equivalent to a d-dimensional surface. The d parameters of the ice model are varied simultaneously in the sensitivity tests. The strongest sensitivities arise from the minimum lead fraction, the sensible heat exchange coefficient, and the atmospheric and oceanic drag coefficients. The statistical model shows that the interdependencies among these sensitivities are strong and physically plausible. A multidecadal simulation of Arctic sea ice is made using atmospheric forcing fields from 1960 to 1988 and parametric values from the approximate midpoints of the ranges sampled in the sensitivity tests. This simulation produces interannual variations consistent with submarine-derived data on ice thickness from 1976 and 1987 and with ice extent variations obtained from satellite passive microwave data. The ice model results indicate that (1) interannual variability is a major contributor to the differences of ice thickness and extent over timescales of a decade or less, and (2) the timescales of ice thickness anomalies are much longer than those of ice-covered areas. However, the simulated variations of ice coverage have less than 50% of their variance in common with observational data, and the temporal correlations between simulated and observed anomalies of ice coverage vary strongly with longitude.A dynamic-thermodynamic sea ice model is used to illustrate a sensitivity evaluation strategy in which a statistical model is fit to the output of the ice model. The statistical model response, evaluated in terms of certain metrics or integrated features of the ice model output, is a function of a selected set of d (= 13) prescribed parameters of the ice model and is therefore equivalent to a d-dimensional surface. The d parameters of the ice model are varied simultaneously in the sensitivity tests. The strongest sensitivities arise from the minimum lead fraction, the sensible heat exchange coefficient, and the atmospheric and oceanic drag coefficients. The statistical model shows that the interdependencies among these sensitivities are strong and physically plausible. A multidecadal simulation of Arctic sea ice is made using atmospheric forcing fields from 1960 to 1988 and parametric values from the approximate midpoints of the ranges sampled in the sensitivity tests. This simulation produces interannual variations consistent with submarine-derived data on ice thickness from 1976 and 1987 and with ice extent variations obtained from satellite passive microwave data. The ice model results indicate that (1) interannual variability is a major contributor to the differences of ice thickness and extent over timescales of a decade or less, and (2) the timescales of ice thickness anomalies are much longer than those of ice-covered areas. However, the simulated variations of ice coverage have less than 50% of their variance in common with observational data, and the temporal correlations between simulated and observed anomalies of ice coverage vary strongly with longitude.

%B Journal of Geophysical Research: Oceans %V 99 %P 919-935 %G eng %& 919 %R 10.1029/93JC02564 %0 Journal Article %D 1994 %T Multiworker Household Travel Demand %A Sööt, Siim %A Sen, Ashish %A Marston, J. %A Piyushimita Thakuriah %K Automobile ownership %K Demographics %K Employed %K Highway travel %K Households %K Income %K New products %K Population density %K Travel behavior %K Travel surveys %K Trip generation %K Urban areas %K Vehicle miles of travel %X The purpose of this study is to examine the travel behavior and related characteristics of multiworker households (MWHs) (defined as households with at least two workers) and how they contribute to the ever-increasing demand for transportation services. On average they have incomes which exceed the national household average and often have multiple automobiles and as households they generate a considerable number of trips. The virtual dearth of previous studies of MWHs makes an overview of their characteristics and their travel behavior necessary. This study reveals that the number of MWHs has continued to grow, as has their use of highways; they are found in disproportionate numbers in low density urban areas distant from public transportation. They also have new vehicles, and drive each vehicle more miles than other households. As households, MWHs travel more than do other households. However, an individual worker’s ability and desire to travel is constrained by time factors, among others, and transportation use by MWHs, when calculated on a per worker basis, is relatively low. %I Federal Highway Administration %V 1 %P 30 p %G eng %U http://nhts.ornl.gov/1990/doc/demographic.pdf %0 Journal Article %J Journal of Marine Research %D 1994 %T Parameter space exploration of an ocean general circulation model using an isopycnal mixing parameterization %A Gough, William A. %A Welch, William J. %X In this study we have employed statistical methods to efficiently design experiments and analyze output of an ocean general circulation model that uses an isopycnal mixing parameterization. Full ranges of seven inputs are explored using 51 numerical experiments. Fifteen of the cases fail to reach satisfactory equilibria. These are attributable to numerical limitations specific to the isopycnal model. Statistical approximating functions are evaluated using the remaining cases to determine the dependency of each of the six scalar outputs on the inputs. With the exception of one output, the approximating functions perform well. Known sensitivities, particularly the importance of diapycnal (vertical) eddy diffusivity and wind stress, are reproduced. The sensitivities of the model to two numerical constraints specific to the isopycnal parameterization, maximum allowable isopycnal slope and horizontal background eddy diffusivity, are explored. Isopycnal modelling issues, convection reduction and the Veronis effect, are examined and found to depend crucially on the isopycnal modelling constraints. %B Journal of Marine Research %V 52 %P 773-796 %G eng %R 10.1357/0022240943076911 %0 Generic %D 1994 %T Statistics and Materials Science: Report of a Workshop %A A. F. Karr %I National Institute of Statistical Sciences %G eng %0 Book Section %B Dynamic Urban Transportation Network Models %D 1994 %T Variational Inequality Models of Ideal Dynamic User-Optimal Route Choice Problems %A Ran, Bin %A Boyce, David E. %XIn this chapter, we present both route-based and link-based variational inequality models for the ideal dynamic user-optimal route choice problem. In Section 13.1, a route-time-based VI model for ideal DUO route choice is proposed. This model is the most straight-forward formulation of route-time-based, ideal DUO route choice conditions. In Section 13.2, a multi-group route-time-based VI model is developed. In this model, each group of travelers is associated with a disutility function. Thus, the route-based ideal DUO route choice conditions are defined for each group of travelers on the basis of travel disutilities instead of travel times only.

%B Dynamic Urban Transportation Network Models %S Lecture Notes in Economics and Mathematical Systems %I Springer Berlin Heidelberg %V 417 %P 267-290 %@ 978-3-540-58360-8 %G eng %U http://dx.doi.org/10.1007/978-3-662-00773-0_13 %R 10.1007/978-3-662-00773-0_13 %0 Generic %D 1993 %T Multivariate Threshold Methods %A RICHARD L. SMITH %G eng %0 Journal Article %D 1993 %T Non - response Bias and Trip Generation Models %A Piyushimita Thakuriah %A Sen, Ashish %A Sööt, Siim %A Christopher, Ed J. %K Bias (Statistics) %K Travel surveys %K Trip generation %XThere is serious concern over the fact that travel surveys often overrepresent smaller households with higher incomes and better education levels and, in general, that nonresponse is nonrandom. However, when the data are used to build linear models, such as trip generation models, and the model is correctly specified, estimates of parameters are unbiased regardless of the nature of the respondents, and the issues of how response rates and nonresponse bias are ameliorated. The more important task then is the complete specification of the model, without leaving out variables that have some effect on the variable to be predicted. The theoretical basis for this reasoning is given along with an example of how bias may be assessed in estimates of trip generation model parameters. Some of the methods used are quite standard, but the manner in which these and other more nonstandard methods have been systematically put together to assess bias in estimates shows that careful model building, not concern over bias in the data, becomes the key issue in developing trip generation and other models.

%I Transportation Research Board %P 64-70 %@ 0309055598 %G eng %0 Journal Article %J Wiley StatsRef: Statistics Reference Online %D 0 %T Combining Estimates from Multiple Surveys %A Elliott, M. R. %A Raghunathan, T. E. %A Schenker, N. %K dual frame %K imputation %K missing data %K non-probability samples %K small-area estimation %K Weighting %XCombining estimates from multiple surveys can be very useful, especially when the question of interest cannot be addressed well by a single, existing survey. In this paper, we provide a brief review of methodology for combining estimates, with a focus on dual frame, weighting-based, joint-modeling, missing-data, and small-area methods. Many such methods are useful in situations outside the realm of combining estimates from surveys, such as combining information from surveys with administrative data and combining probability-sample data with non-probability sample, or “big” data. We also provide examples of comparability issues that must be kept in mind when information from different sources is being combined.

%B Wiley StatsRef: Statistics Reference Online %G eng %U https://www.niss.org/sites/default/files/Elliott%2C%20Raghunathan%2C%20%26%20Schenker%20for%20Wiley%20StatsRef.pdf %1https://www.niss.org/sites/default/files/Elliott%2C%20Raghunathan%2C%20%26%20Schenker%20for%20Wiley%20StatsRef.pdf