%0 Journal Article %J Analytical Chemistry %D 2014 %T QC Metrics from CPTAC Raw LC-MS/MS Data Interpreted through Multivariate Statistics %A X. Wang %A M. C. Chambers %A L. J. Vega-Montoto %A D. M. Bunk %A S. E. Stein %A D. Tabb %X
Shotgun proteomics experiments integrate a complex sequence of processes, any of which can introduce variability. Quality metrics computed from LC-MS/MS data have relied upon identifying MS/MS scans, but a new mode for the QuaMeter software produces metrics that are independent of identifications. Rather than evaluating each metric independently, we have created a robust multivariate statistical toolkit that accommodates the correlation structure of these metrics and allows for hierarchical relationships among data sets. The framework enables visualization and structural assessment of variability. Study 1 for the Clinical Proteomics Technology Assessment for Cancer (CPTAC), which analyzed three replicates of two common samples at each of two time points among 23 mass spectrometers in nine laboratories, provided the data to demonstrate this framework, and CPTAC Study 5 provided data from complex lysates under Standard Operating Procedures (SOPs) to complement these findings. Identification-independent quality metrics enabled the differentiation of sites and run-times through robust principalcomponents analysis and subsequent factor analysis. Dissimilarity metrics revealed outliers in performance, and a nested ANOVA model revealed the extent to which all metrics or individual metrics were impacted by mass spectrometer and run time. Study 5 data revealed that even when SOPs have been applied, instrument-dependent variability remains prominent, although it may bereduced, while within-site variability is reduced significantly. Finally, identification-independent quality metrics were shown to bepredictive of identification sensitivity in these data sets. QuaMeter and the associated multivariate framework are available from http://fenchurch.mc.vanderbilt.edu and http://homepages.uc.edu/~wang2x7/, respectively
%B Analytical Chemistry %V 86 %P 2497 − 2509 %G eng %U http://pubs.acs.org/doi/pdf/10.1021/ac4034455 %R dx.doi.org/10.1021 %0 Journal Article %J Information Fusion %D 2012 %T Bayesian CAR models for syndromic surveillance on multiple data streams: Theory and practice %A A. F. Karr %A D. L. Banks %A G. Datta %A J. Lynch %A J. Niemi %A F. Vera %K Bayes %K CAR models %K Gibbs distribution %K Markov random field %K Syndromic surveillance %X

Syndromic surveillance has, so far, considered only simple models for Bayesian inference. This paper details the methodology for a serious, scalable solution to the problem of combining symptom data from a network of US hospitals for early detection of disease outbreaks. The approach requires high-end Bayesian modeling and significant computation, but the strategy described in this paper appears to be feasible and offers attractive advantages over the methods that are currently used in this area. The method is illustrated by application to ten quarters worth of data on opioid drug abuse surveillance from 636 reporting centers, and then compared to two other syndromic surveillance methods using simulation to create known signal in the drug abuse database.

%B Information Fusion %V 13 %P 105–116 %G eng %U http://dx.doi.org/10.1016/j.inffus.2009.10.005 %0 Journal Article %J Statistical Analysis and Data Mining %D 2012 %T Bayesian methodology for the analysis of spatial temporal surveillance data %A Zou, Jian %A Alan F. Karr %A Banks, David %A Heaton, Matthew J. %A Datta, Gauri %A Lynch, James %A Vera, Francisco %K conditional autoregressive process %K Markov random field %K spatial statistics %K spatio-temporal %K Syndromic surveillance %X

Early and accurate detection of outbreaks is one of the most important objectives of syndromic surveillance systems. We propose a general Bayesian framework for syndromic surveillance systems. The methodology incorporates Gaussian Markov random field (GMRF) and spatio-temporal conditional autoregressive (CAR) modeling. By contrast, most previous approaches have been based on only spatial or time series models. The model has appealing probabilistic representations as well as attractive statistical properties. Based on extensive simulation studies, the model is capable of capturing outbreaks rapidly, while still limiting false positives. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 194–204, 2012

%B Statistical Analysis and Data Mining %I Wiley Subscription Services, Inc., A Wiley Company %V 5 %P 194–204 %G eng %U http://dx.doi.org/10.1002/sam.10142 %R 10.1002/sam.10142 %0 Journal Article %J Statistics in Medicine %D 2012 %T A spatio-temporal absorbing state model for disease and syndromic surveillance %A M. J. Heaton %A A. F. Karr %A J. Zou %A D. L. Banks %A G. Datta %A J. Lynch %A F. Vera %X

Reliable surveillance models are an important tool in public health because they aid in mitigating disease outbreaks, identify where and when disease outbreaks occur, and predict future occurrences. Although many statistical models have been devised for surveillance purposes, none are able to simultaneously achieve the important practical goals of good sensitivity and specificity, proper use of covariate information, inclusion of spatio-temporal dynamics, and transparent support to decision-makers. In an effort to achieve these goals, this paper proposes a spatio-temporal conditional autoregressive hidden Markov model with an absorbing state. The model performs well in both a large simulation study and in an application to influenza/pneumonia fatality data.

%B Statistics in Medicine %V 31 %P 2123-2136 %G eng %0 Journal Article %J Technometrics %D 2006 %T Secure, privacy-preserving analysis of distributed databases %A Alan F. Karr %A Fulp, WJ %A F. Vera %A Young, S.S. %A X. Lin %A J. P. Reiter %X

There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology?specifically, secure multiparty computation and networking?can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the ?complement ? of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For low-risk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest.

%B Technometrics %V 48 %P 133-143 %G eng %R 10.1198/004017007000000209 %0 Journal Article %J Statistical Science %D 2001 %T Computer intrusion: detecting masqueraders %A Alan Karr %A William DuMouchel %A Wen-Hua Ju %A Martin Theus %A Yehuda Vardi %K Anomaly %K Bayes %K compression %K computer security %K high-orderMarkov %K profiling %K Unix %X

Masqueraders in computer intrusion detection are people who use somebody else?s computer account. We investigate a number of statistical approaches for detecting masqueraders. To evaluate them, we collected UNIX command data from 50 users and then contaminated the data with masqueraders. The experiment was blinded. We show results from six methods, including two approaches from the computer science community.

%B Statistical Science %V 16 %P 1-17 %G eng %0 Journal Article %D 2001 %T A Hybrid High-Order Markov Chain Model for Computer Intrusion Detection %A Ju, W-H %A Yehuda Vardi %X

A hybrid model based mostly on a high-order Markov chain and occasionally on a statistical-independence model is proposed for profiling command sequences of a computer user in order to identify a "signature behavior" for that user. Based on the model, an estimation procedure for such a signature behavior driven by maximum likelihood (ML) considerations is devised. The formal ML estimates are numerically intractable, but the ML-optimization problem can be substituted by a linear inverse problem with positivity constraint (LININPOS), for which the EM algorithm can be used as an equation solver to produce an approximate ML-estimate. The intrusion detection system works by comparing a user’s command sequence to the user’s and others’ estimated signature behaviors in real time through statistical hypothesis testing. A form of likelihood-ratio test is used to detect if a given sequence of commands is from the proclaimed user, with the alternative hypothesis being a masquerader user. Applying the model to real-life data collected from AT&T Labs-Research indicates that the new methodology holds some promise for intrusion detection.

%V 10 %P 277-295 %G eng %0 Conference Paper %B Proceedings of the InternationalWorkshop on Empirical Studies of Software Maintenance %D 1996 %T An empirical exploration of code evolution %A A. F. Karr %A A. A. Porter %A L. G. Votta %B Proceedings of the InternationalWorkshop on Empirical Studies of Software Maintenance %G eng