National Institute of Statistical Sciences
karr@niss.org; sacks@niss.org
National Institute of Statistical Sciences
P.O. Box 14006
Research Triangle Park, NC 27709-4006
February 28, 1997
Federal information services face both unique and more generic problems associated with the scale and complexity of data sets. Confidentiality, geographical dispersion, legacy hardware and software, and data quality, while not unique, are more acute in Federal agencies than elsewhere, and their confluence presents special challenges. Progress in accessing data is rapid, however, and exciting new techniques for understanding information (especially visualization, as in Eick97) are currently being developed.
That solutions to some problems are in sight exacerbates rather than attenuates two central questions:
For example, detection of incidents (e.g., accidents, vehicle breakdowns) prediction of breakdown (A technical term connoting sudden decreases in speed and volume not caused by identifiable external events such as incidents.) on freeways and arterial networks requires identification of anomalous patterns in high-dimensional (for example, space, time, flow and speed) data. The available data are from sensors (most often, loop detectors) at disparate temporal aggregation and spatial coverage, but in massive quantities. (Some systems with which we have dealt, such as (a relatively small portion of) I-5 in Seattle, WA, produce megabytes of data per day.) Heterogeneity and multiplicity issues are rampant for detection on a link-by-link, time period-by-time period basis. Also involved is data integration -- for example, of video data, detector output, data from probe vehicles and anecdotal ("*999" on cellular telephones) reports.
The question of what data to collect -- how to instrument an Intelligent Transportation System informatively, manageably and cost effectively -- is yet another critical and under-addressed problem.
Analogous systems at the Federal level -- notably, the software that runs air traffic control, that used by the Internal Revenue Service, or that used for Federal Information Services -- deal with data sets that are terabytes in size. The modular (and hence graphical) structure of software raises issues of analysis and visualization. Integration of data sets (regarding, among other things, change history of the code, organizational structure, personnel and project management information) overlaps with similar needs in other venues.
Improved statistical modeling of large insurance claims, for example, combining extreme value theory with Bayesian hierarchical modeling to produce predictive distributions of future losses Smith97, helps cope with enormous uncertainties about key parameters. Future research, however, must make explicit use of statistics from extreme environmental events such as hurricanes, since extreme losses arise from hurricanes that are extreme in other ways as well (for example, duration, precipitation, wind speed, and route). Integration of data on meteorology, sociodemography, commerce, land use and infrastructure is necessary. Sequential sampling from databases is likely to be necessary, coupled with theoretical ideas for extremes in multivariate and spatially distributed data and Bayesian analyses.
Characterizing extreme heterogeneity among customers will require application of tools for clustering and classification to very large databases. Multiplicity is extreme: millions of decisions must be made every day, which is not possible without innovative strategies to reduce errors. Design of query sequences and inference on the basis of multiple, data-dependent queries call for Bayesian strategies, which allow for characterization and updating of time-varying uncertainty. Construction of call patterns is a classification problem, approachable using tools such as CART, neural networks, multivariate correspondence analysis and HOMALS (Homogeneity Analysis by Alternating Least Squares).
The last of these examples bears on important Federal concerns such as Medicare fraud.
The size of these problems militates against strategies that bombard the problem with an array of methods (such as a slew of regression techniques, incorporating different model selection methods and cross-validations). Early, rapid assessment of a few good strategies is essential. At the same time, inadequate approaches must be recognized and re-oriented quickly, lest the whole exploration bog down. These thoughts run counter to statistical practice customarily applied to modest-sized data sets, and therefore pose novel obstacles to a penetrating analysis.
Querying the data amounts to measuring or computing a function or functions f(x) of data points x; some queries are themselves suggested by the data. Therefore, analysis of a large data set can be difficult not only because there are a lot of data points or because they are high-dimensional, but also because there are a lot of relevant queries. Multiplicity must be dealt with in order to analyze such "meta-queries."
Heterogeneity is especially acute when there are many data points, since its discovery and characterization, rather than techniques that mitigate its effects, are an essential component of the problem.
High-dimensional data can make otherwise routine computations bedeviling. When the dimension of the data increases with time, additional feasibility issues, especially recursive updating of models, must be dealt with.
Following a national site selection process, NISS was created as a non-profit North Carolina corporation in January, 1991, with ten parent organizations, including the American Statistical Association, the International Biometric Society and the Institute of Mathematical Statistics. (The others are Section U of the American Association for the Advancement of Science, the Committee of Presidents of Statistical Societies, Duke University, North Carolina State University, the University of North Carolina at Chapel Hill, the Research Triangle Institute and the Triangle Universities Center for Advanced Studies, Inc.) NISS is governed by its own Board of Trustees, currently chaired by John C. Bailar, III (University of Chicago). Management and scientific leadership are vested in Jerome Sacks, Director (and also Professor, Statistics and Decision Sciences, Duke University) and Alan F. Karr, Associate Director (and also Professor, Statistics and Biostatistics, University of North Carolina at Chapel Hill).
Currently, NISS is engaged in a series of industrially partnered pilot projects involving large data sets, which individually and collectively target the central issues of (sequential) design, heterogeneity and multiplicity. In addition to the telecommunication fraud project mentioned in § 2.1, these include work on statistical techniques for drug discovery, in collaboration with Glaxo Wellcome, Research Triangle Park, NC (Tobias95, Welch97); and work with General Motors Research, Warren, MI, directed at identifying new, "niche" automobile markets. Current efforts with government agencies include research supported by the National Center for Education Statistics address multiplicity issues in educational data sets, and work with the Environmental Protection Agency dealing with complexities of data reporting and visualization in connection with large data sets on the quality of drinking water. Details about NISS, its programs, its history and its governance are available on the World Wide Web at http://www.niss.org.
|
Navigation: www.niss.org
> Publications > FedInfo White
Paper
|
