%0 Conference Proceedings %B JSM 2017 %T Evaluation of a New Approach for Estimating the Number of U.S. Farms %A Benecha, H. %A Abreu, D. %A Abernethy, J. %A Sartore, L. %A Young, L. Y. %K Agriculture %K Area-frame %K BigData %K Capture-Recapture %K List Frame %K Logistic Regression %K Misclassification Error %K NASS %X

USDA’s National Agricultural Statistics Service (NASS) employs the June Area Survey (JAS) to produce annual estimates of U.S. farm numbers. The JAS is an area-frame-based survey conducted every year during the first two weeks of June. NASS also publishes an independent estimate of the number of farms from the quinquennial Census of Agriculture. Studies conducted by NASS have shown that farm number estimates from the JAS can be biased, mainly due to misclassification of agricultural tracts during the pre-screening and data collection processes. To adjust for the bias, NASS has developed a capture-recapture model that uses NASS’s list frame as the second sample, where estimation is performed based on records in the JAS with matches in the list frame. In the current paper, we describe an alternative capture-recapture approach that uses all available data from the JAS and the Census of Agriculture to correct for biases due to misclassification and to produce more stable farm number estimates.

%B JSM 2017 %G eng %U https://www.niss.org/sites/default/files/Benecha_Estim_Farms_20170929.pdf %0 Journal Article %J Molecular & Cellular Proteomics %D 2014 %T Improved Normalization of Systematic Biases Affecting Ion Current Measurements in Label-free Proteomics Data %A P. A. Rudnick %A X. Wang %A E. Yan %A Sedransk, N. %A S. E. Stein %B Molecular & Cellular Proteomics %V 13 %P 1341-1351 %G eng %N 5 %0 Journal Article %J Molecular and Cellular Proteomics %D 2013 %T Design, Implementation and Multisite Evaluation of a System Suitability Protocol for the Quantitative Assessment of Instrument Performance in Liquid Chromatography-Multiple Reaction Monitoring-MS (LC-MRM-MS) %A Abbatiello, S. %A Feng, X. %A Sedransk, N. %A Mani, DR %A Schilling, B %A Maclean, B %A Zimmerman, LJ %A Cusack, MP %A Hall, SC %A Addona, T %A Allen, S %A Dodder, NG %A Ghosh, M %A Held, JM %A Hedrick, V %A Inerowicz, HD %A Jackson, A %A Keshishian, H %A Kim, JW %A Lyssand, JS %A Riley, CP %A Rudnick, P %A Sadowski, P %A Shaddox, K %A Smith, D %A Tomazela, D %A Wahlander, A %A Waldemarson, S %A Whitwell, CA %A You, J %A Zhang, S %A Kinsinger, CR %A Mesri, M %A Rodriguez, H %A Borchers, CH %A Buck, C %A Fisher, SJ %A Gibson, BW %A Liebler, D %A Maccoss, M %A Neubert, TA %A Paulovich, A %A Regnier, F %A Skates, SJ %A Tempst, P %A Wang, M %A Carr, SA %X

Multiple reaction monitoring (MRM) mass spectrometry coupled with stable isotope dilution (SID) and liquid chromatography (LC) is increasingly used in biological and clinical studies for precise and reproducible quantification of peptides and proteins in complex sample matrices. Robust LC-SID-MRM-MS-based assays that can be replicated across laboratories and ultimately in clinical laboratory settings require standardized protocols to demonstrate that the analysis platforms are performing adequately. We developed a system suitability protocol (SSP), which employs a predigested mixture of six proteins, to facilitate performance evaluation of LC-SID-MRM-MS instrument platforms, configured with nanoflow-LC systems interfaced to triple quadrupole mass spectrometers. The SSP was designed for use with low multiplex analyses as well as high multiplex approaches when software-driven scheduling of data acquisition is required. Performance was assessed by monitoring of a range of chromatographic and mass spectrometric metrics including peak width, chromatographic resolution, peak capacity, and the variability in peak area and analyte retention time (RT) stability. The SSP, which was evaluated in 11 laboratories on a total of 15 different instruments, enabled early diagnoses of LC and MS anomalies that indicated suboptimal LC-MRM-MS performance. The observed range in variation of each of the metrics scrutinized serves to define the criteria for optimized LC-SID-MRM-MS platforms for routine use, with pass/fail criteria for system suitability performance measures defined as peak area coefficient of variation <0.15, peak width coefficient of variation <0.15, standard deviation of RT <0.15 min (9 s), and the RT drift <0.5min (30 s). The deleterious effect of a marginally performing LC-SID-MRM-MS system on the limit of quantification (LOQ) in targeted quantitative assays illustrates the use and need for a SSP to establish robust and reliable system performance. Use of a SSP helps to ensure that analyte quantification measurements can be replicated with good precision within and across multiple laboratories and should facilitate more widespread use of MRM-MS technology by the basic biomedical and clinical laboratory research communities.

%B Molecular and Cellular Proteomics %V 12 %P 2623-2639 %G eng %R 10.1074/mcp.M112.027078 %0 Journal Article %J Statistics, Politics and Policy %D 2012 %T Data, Statistics and Controversy: Making Scientific Data Intelligible %A Sedransk, N. %A Young, L. %A Spiegelman, C. %K data availability %K Daubert rule %K inference verification %K meta-data %K proprietary data %K publication bias %K reuse of data %K secondary analysis %K synthetic data %X

Making published, scientific research data publicly available can benefit scientists and policy makers only if there is sufficient information for these data to be intelligible. Thus the necessary meta-data go beyond the scientific, technological detail and extend to the statistical approach and methodologies applied to these data. The statistical principles that give integrity to researchers’ analyses and interpretations of their data require documentation. This is true when the intent is to verify or validate the published research findings; it is equally true when the intent is to utilize the scientific data in conjunction with other data or new experimental data to explore complex questions; and it is profoundly important when the scientific results and interpretations are taken outside the world of science to establish a basis for policy, for legal precedent or for decision-making. When research draws on already public data bases, e.g., a large federal statistical data base or a large scientific data base, selection of data for analysis, whether by selection (subsampling) or by aggregating, is specific to that research so that this (statistical) methodology is a crucial part of the meta-data. Examples illustrate the role of statistical meta-data in the use and reuse of these public datasets and the impact on public policy and precedent.

%B Statistics, Politics and Policy %V 3 %P 1-20 %G eng %R 10.1515/2151-7509.1046 %0 Journal Article %J Statistical Science %D 2011 %T Make research data public? - Not always so simple: A Dialogue for statisticians and science editors %A Nell Sedransk %A Lawrence H. Cox %A Deborah Nolan %A Keith Soper %A Cliff Spiegelman %A Linda J. Young %A Katrina L. Kelner %A Robert A. Moffitt %A Ani Thakar %A Jordan Raddick %A Edward J. Ungvarsky %A Richard W. Carlson %A Rolf Apweiler %X

Putting data into the public domain is not the same thing as making those data accessible for intelligent analysis. A distinguished group of editors and experts who were already engaged in one way or another with the issues inherent in making research data public came together with statisticians to initiate a dialogue about policies and practicalities of requiring published research to be accompanied by publication of the research data. This dialogue carried beyond the broad issues of the advisability, the intellectual integrity, the scientific exigencies to the relevance of these issues to statistics as a discipline and the relevance of statistics, from inference to modeling to data exploration, to science and social science policies on these issues.

%B Statistical Science %V 5 %P 41-50 %G eng %R 10.1214/10-STS320 %0 Conference Paper %B Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS ’09. First Workshop on %D 2009 %T Evaluation of unmixing methods for the separation of Quantum Dot sources %A Fogel, P. %A Gobinet, C. %A Young, S.S. %A Zugaj, D. %K Bayesian methods %K Bayesian positive source separation %K BPSS %K cadmium compounds %K CdSe %K consensus nonnegative matrix factorization %K Fluorescence %K hyperspectral images %K Hyperspectral imaging %K hyperspectral system %K ICA %K II-VI semiconductors %K independent component analysis %K Nanobioscience %K Nanocrystals %K nanometer dimensions %K NMF %K Photonic crystals %K Probes %K quantum dot sources %K Quantum dots %K semiconductor crystals %K semiconductor quantum dots %K Source separation %K spatial localization %K ultraviolet spectra %K unmixing methods %X

Quantum Dots (QDs) are semiconductor crystals with nanometer dimensions, which have fluorescence properties that can be adjusted through controlling their diameter. Under ultraviolet light excitation, these nanocrystals re-emit photons in the visible spectrum, with a wavelength ranging from red to blue as their size diminishes. We created an experiment to evaluate unmixing methods for hyperspectral images. The wells of a matrix [3 times 3] were filled with individual or up to three of five QDs. The matrix was imaged by a hyperspectral system (Photon Etc., Montreal, QC, CA) and a data ldquocuberdquo of 512 rows times 512 columns times 63 wavelengths was generated. For unmixing, we tested three approaches: independent component analysis (ICA), Bayesian positive source separation (BPSS) and our new consensus non-negative matrix factorization (CNFM) method. For each of these methods, we assessed the ability to separate the different sources from both spectral and spatial localization points of view. In this situation, we showed that BPSS and CNMF model estimates were very close to the original design of our experiment and were better than the ICA results. However, the time needed for the BPSS model to converge is substantially higher than CNMF. In addition, we show how the CNMF coefficients can be used to provide reasonable bounds for the number of sources, a key issue for unmixing methods, and allow for an effective segmentation of the spatial signal.

%B Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, 2009. WHISPERS ’09. First Workshop on %P 1-4 %@ 978-1-4244-4686-5 %G eng %R 10.1109/WHISPERS.2009.5289020 %0 Journal Article %J Computational Statistics & Data Analysis %D 2008 %T Pooled ANOVA %A Michael Last %A Gheorghe Luta %A Alex Orso %A Adam Porter %A Stan Young %B Computational Statistics & Data Analysis %V 52 %P 5215 %G eng %0 Journal Article %J Journal of Chemical Information and Modeling %D 2007 %T Exploration of cluster structure-activity relationship analysis in efficient high-throughput screening %A Wang, X. S. %A Salloum, G.A. %A Chipman, H.A. %A Welch, W.J. %A Young, S.S. %X

Sequential screening has become increasingly popular in drug discovery. It iteratively builds quantitative structure-activity relationship (QSAR) models from successive high-throughput screens, making screening more effective and efficient. We compare cluster structure-activity relationship analysis (CSARA) as a QSAR method with recursive partitioning (RP), by designing three strategies for sequential collection and analysis of screening data. Various descriptor sets are used in the QSAR models to characterize chemical structure, including high-dimensional sets and some that by design have many variables not related to activity. The results show that CSARA outperforms RP. We also extend the CSARA method to deal with a continuous assay measurement.

%B Journal of Chemical Information and Modeling %V 47 %P 1206-1214 %G eng %R 10.1021/ci600458n %0 Journal Article %J Bioinformatics %D 2007 %T Inferential, robust non-negative matrix factorization analysis of microarray data %A Fogel, P. %A Young, S.S. %A Hawkins, D.M. %A Ledirac, N %X

Motivation: Modern methods such as microarrays, proteomics and metabolomics often produce datasets where there are many more predictor variables than observations. Research in these areas is often exploratory; even so, there is interest in statistical methods that accurately point to effects that are likely to replicate. Correlations among predictors are used to improve the statistical analysis. We exploit two ideas: non-negative matrix factorization methods that create ordered sets of predictors; and statistical testing within ordered sets which is done sequentially, removing the need for correction for multiple testing within the set. Results: Simulations and theory point to increased statistical power. Computational algorithms are described in detail. The analysis and biological interpretation of a real dataset are given. In addition to the increased power, the benefit of our method is that the organized gene lists are likely to lead better understanding of the biology. Availability: An SAS JMP executable script is available from http://www.niss.org/irMF

%B Bioinformatics %V 23 %P 44-49 %G eng %R 10.1093/bioinformatics/btl550 %0 Journal Article %J Q&SPES News %D 2006 %T Clustering Scotch Whiskies using Non-Negative Matrix Factorization %A Young, S.S. %A Fogel, P. %A Hawkins, D.M. %B Q&SPES News %V 14 %P 11-13 %G eng %0 Journal Article %J Technometrics %D 2006 %T Secure, privacy-preserving analysis of distributed databases %A Alan F. Karr %A Fulp, WJ %A F. Vera %A Young, S.S. %A X. Lin %A J. P. Reiter %X

There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology?specifically, secure multiparty computation and networking?can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the ?complement ? of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For low-risk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest.

%B Technometrics %V 48 %P 133-143 %G eng %R 10.1198/004017007000000209 %0 Journal Article %J Statistical Science %D 2005 %T Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers %A A. F. Karr %A J. Feng %A X. Lin %A J. P. Reiter %A A. P. Sanil %A Young, S.S. %B Statistical Science %V 20 %P 163-177 %G eng %0 Journal Article %J Journal of Chemical Information and Modeling %D 2005 %T PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation %A Liu, J. %A J. Feng %A Young, S.S. %X

Ideally, a team of biologists, medicinal chemists and information specialists will evaluate the hits from high throughput screening. In practice, it often falls to nonmedicinal chemists to make the initial evaluation of HTS hits. Chemical genetics and high content screening both rely on screening in cells or animals where the biological target may not be known. There is a need to place active compounds into a context to suggest potential biological mechanisms. Our idea is to build an operating environment to help the biologist make the initial evaluation of HTS data. To this end the operating environment provides viewing of compound structure files, computation of basic biologically relevant chemical properties and searching against biologically annotated chemical structure databases. The benefit is to help the nonmedicinal chemist, biologist and statistician put compounds into a potentially informative biological context. Although there are several similar public and private programs used in the pharmaceutical industry to help evaluate hits, these programs are often built for computational chemists. Our program is designed for use by biologists and statisticians.

%B Journal of Chemical Information and Modeling %V 45 %P 515-522 %G eng %R 10.1021/ci049847v %0 Journal Article %J Pharmacogenomics %D 2005 %T Recursive partitioning as a tool for pharmcogenetic studies of complex diseases: II. Statistical considerations %A Zaykin, D.V. %A Young, S.S. %X

Identifying genetic variations predictive of important phenotypes, such as disease susceptibility, drug efficacy, and adverse events, remains a challenging task. There are individual polymorphisms that can be tested one at a time, but there is the more difficult problem of the identification of combinations of polymorphisms or even more complex interactions of genes with environmental factors. Diseases, drug responses or side effects can result from different mechanisms. Identification of subgroups of people where there is a common mechanism is a problem for diagnosis and prescribing of treatment. Recursive partitioning (RP) is a simple statistical tool for segmenting a population into non-overlapping groups where the response of interest, disease susceptibility, drug efficacy and adverse events are more homogeneous within the segments. We suggest that the use of RP is not only more technically feasible than other search methods but it is less susceptible to multiple-testing problems. The numbers of combinations of gene?gene and gene?environment interactions is potentially astronomical and RP greatly reduces the effective search and inference space. Moreover, the certain reliance of RP on the presence of marginal effects is justifiable as was found by using analytical and numerical arguments. In the context of haplotype analysis, results suggest that the analysis of individual SNPs is likely to be successful even when susceptibilities are determined by haplotypes. Retrospective clinical studies where cases and controls are collected will be a common design. This report provides methods that can be used to adjust the RP analysis to reflect the population incidence of the response of interest. Confidence limits on the incidence of the response in the segmented subgroups are also discussed. RP is a straightforward way to create realistic subgroups, and prediction intervals for the within-subgroup disease incidence are easily obtained.

%B Pharmacogenomics %V 6 %P 77-89 %G eng %R 10.1517/14622416.6.1.77 %0 Journal Article %J J. Computer-Aided Molecular Design %D 2005 %T Secure analysis of distributed chemical databases without data integration %A Alan F. Karr %A Jun Feng %A Xiaodong Lin %A Ashish P. Sanil %A S. Stanley Young %A Jerome P. Reiter %B J. Computer-Aided Molecular Design %V 19 %P 739-747 %8 November %G eng %0 Journal Article %J Journal of Chemical Information and Computer Science %D 2002 %T The construction and assessment of a statistical model for the prediction of protein assay data %A Jennifer Pittman Clarke %A Jerome Sacks %A S. Stanley Young %X

The focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

%B Journal of Chemical Information and Computer Science %V 42 %P 729-741 %G eng %R 10.1021/ci0103828 %0 Journal Article %J Journal of Chemical Information and Computer Science %D 2002 %T The construction and assessment of a statistical model for the prediction of protein assay data %A Jennifer Pittman Clarke %A Jerome Sacks %A S. Stanley Young %X

The focus of this work is the development of a statistical model for a bioinformatics database whose distinctive structure makes model assessment an interesting and challenging problem. The key components of the statistical methodology, including a fast approximation to the singular value decomposition and the use of adaptive spline modeling and tree-based methods, are described, and preliminary results are presented. These results are shown to compare favorably to selected results achieved using comparitive methods. An attempt to determine the predictive ability of the model through the use of cross-validation experiments is discussed. In conclusion a synopsis of the results of these experiments and their implications for the analysis of bioinformatic databases in general is presented.

%B Journal of Chemical Information and Computer Science %V 42 %P 729-741 %G eng %R 10.1021/ci0103828 %0 Book Section %B Molecular Modeling and Prediction of Bioactivity %D 2000 %T Analysis of a Large, High-Throughput Screening Data Using Recursive Partitioning %A Young, S.Stanley %A Jerome Sacks %E Gundertofte, Klaus %E Jørgensen, Flemming Steen %X

As biological drug targets multiply through the human genome project and as the number of chemical compounds available for screening becomes very large, the expense of screening every compound against every target becomes prohibitive. We need to improve the efficiency of the drug screening process so that active compounds can be found for more biological targets and turned over to medicinal chemists for atom-by-atom optimization. We create a method for analysis of the very large, complex data sets coming from high throughput screening, and then integrate the analysis with the selection of compounds for screening so that the structure-activity rules derived from an initial compound set can be used to suggest additional compounds for screening. Cycles of screening and analysis become sequential screening rather than the mass screening of all available compounds. We extend the analysis method to deal with multivariate responses. Previously, a screening campaign might screen hundreds of thousands of compounds; sequential screening can cut the number of compounds screened by up to eighty percent. Sequential screening also gives SAR rules that can be used to mathematically screen compound collections or virtual chemical libraries.

%B Molecular Modeling and Prediction of Bioactivity %I Springer US %P 149-156 %@ 978-1-4613-6857-1 %G eng %U http://dx.doi.org/10.1007/978-1-4615-4141-7_17 %R 10.1007/978-1-4615-4141-7_17 %0 Book Section %B Statistics in Science and Technology: Case Studies 4 %D 1998 %T Bayesian Mixture Models in Exploration of Structure-Activity Relationships in Drug Design %A Susan Paddock %A Michael West %A S. Stanley Young %A M. Clyde %B Statistics in Science and Technology: Case Studies 4 %I Springer-Verlag %G eng %0 Journal Article %J Atmospheric Environment %D 1996 %T Accounting for meteorological effects in measuring urban ozone levels and trends %A Bloomfield, Peter %A Royle, Andy %A Yang, Qing %K median polish %K meteorological adjustment %K nonlinear regression %K nonparametric regression %K Ozone concentration %X

Observed ozone concentrations are valuable indicators of possible health and environmental impacts. However, they are also used to monitor changes and trends in the sources of ozone and of its precursors, and for this purpose the influence of meteorological variables is a confounding factor. This paper examines ozone concentrations and meteorology in the Chicago area. The data are described using least absolute deviations and local regression. The key relationships observed in these analyses are then used to construct a nonlinear regression model relating ozone to meteorology. The model can be used to estimate that part of the trend in ozone levels that cannot be accounted for by trends in meteorology, and to ‘adjust’ observed ozone concentrations for anomalous weather conditions.

%B Atmospheric Environment %V 30 %P 3067-3077 %G eng %& 3067 %R 10.1016/1352-2310(95)00347-910.1016/1352-2310(95)00347-9 %0 Journal Article %J Atmospheric Environment %D 1996 %T Accounting for Meteorological Effects in Measuring Urban Ozone Levels and Trends %A Bloomfield, Peter %A Royle, Andy %A Steinberg, Laura J. %A Yang, Qing %K median polish %K meteorological adjustment %K nonlinear regression %K nonparametric regression %K Ozone concentration %X

Observed ozone concentrations are valuable indicators of possible health and environmental impacts. However, they are also used to monitor changes and trends in the sources of ozone and of its precursors, and for this purpose the influence of meteorological variables is a confounding factor. This paper examines ozone concentrations and meteorology in the Chicago area. The data are described using least absolute deviations and local regression. The key relationships observed in these analyses are then used to construct a nonlinear regression model relating ozone to meteorology. The model can be used to estimate that part of the trend in ozone levels that cannot be accounted for by trends in meteorology, and to ‘adjust’ observed ozone concentrations for anomalous weather conditions.

%B Atmospheric Environment %V 30 %P 3067–3077 %G eng %N 17