Turning Data into Knowledge: Pilot Projects

Publication Date: 
February 28, 1997

National Institute of Statistical Sciences
P.O. Box 14006
Research Triangle Park, NC 27709-4006



The power of new hardware and software technologies will be unrealized unless the data they access and display can be turned into knowledge. This process is inherently statistical, and raises hitherto untreated issues of heterogeneity, multiplicity and design. We outline pilot projects targeted at these long-run issues, but in the context of high-stakes, immediate problems.

1. The Urgent Need

Large data sets are a given in modern industry, commerce, technology, science and government, and central issues concerning them demand statistical attention. A recent report (Cats96) by the Committee on Applied and Theoretical Statistics (CATS) of the National Research Council focused on the relevance of statistics to "data mining" and "Knowledge Discovery in Databases" in contexts ranging from space science to retail merchandising. A principal finding is that central issues for large data sets -- clustering, sampling, sequential design, heterogeneity, the multiplicity of queries made of the data and decisions based upon the results, uncertainty of forecasts and predictions, design in data and query space and feasibility of new methods -- are either innately statistical or approachable by means of statistical strategies.

Federal information services face both unique and more generic problems associated with the scale and complexity of data sets. Confidentiality, geographical dispersion, legacy hardware and software, and data quality, while not unique, are more acute in Federal agencies than elsewhere, and their confluence presents special challenges. Progress in accessing data is rapid, however, and exciting new techniques for understanding information (especially visualization, as in Eick97) are currently being developed.

That solutions to some problems are in sight exacerbates rather than attenuates two central questions:

  • What can we learn using techniques such as visualization that we did not know beforehand, and how sure can we be of it? Put differently, how can statistical inference be integrated with new technologies for data access and display? A surfeit of data, no matter how organized or rapidly accessible, and no matter how cleverly displayed, is not a surfeit of knowledge, and so the essential need to transform data to knowledge must be confronted. 
  • What data should be collected? This design issue may be pivotal: if irrelevant, uninformative data are not collected, they need not be assessed, mined or visualized. The financial costs of collecting uninformative data may be enormous. 

We propose below a series of pilot projects that begin to answer these questions.

2. Addressing the Need

The questions in § 1 can be approached by means of incisively chosen pilot projects that stimulate new modes of thought, collaboration and partnering; identify entry points for the research; and assess the credibility of speculative lines of investigation rapidly and cost-effectively. To work, pilots must have strong cross-disciplinary character, be bold and reach out to major issues, two of the most central of which are


Different models are necessary for different parts of a large, complex data space; and 


When many statements or comparisons are made, the need to understand accuracy is underscored by the fact that a (potentially large) number of the statements can be wrong purely by chance. 

2.1 Pilot Projects

The pilot projects described here target specific scientific issues with high-stakes implications for the Federal government at large. They employ available data sets, enabling immediate progress, but at the same time address long-run issues such as heterogeneity and multiplicity. Substantial scientific overlap among the projects will catalyze effective transfer of methods. Necessarily, pilot projects are speculative, in the sense that the path from data to information to knowledge is not known in advance. Potential pilot projects building on past and current NISS experience (see § 3) include


The Department of Transportation's Intelligent Transportation Systems (ITS) initiative depends intimately on acquisition and processing of large volumes of data. Failure to convert these data to knowledge has major economic, environmental and safety consequences. 

For example, detection of incidents (e.g., accidents, vehicle breakdowns) prediction of breakdown (A technical term connoting sudden decreases in speed and volume notcaused by identifiable external events such as incidents.) on freeways and arterial networks requires identification of anomalous patterns in high-dimensional (for example, space, time, flow and speed) data. The available data are from sensors (most often, loop detectors) at disparate temporal aggregation and spatial coverage, but in massive quantities. (Some systems with which we have dealt, such as (a relatively small portion of) I-5 in Seattle, WA, produce megabytes of data per day.) Heterogeneity and multiplicity issues are rampant for detection on a link-by-link, time period-by-time period basis. Also involved is data integration -- for example, of video data, detector output, data from probe vehicles and anecdotal ("*999" on cellular telephones) reports.

The question of what data to collect -- how to instrument an Intelligent Transportation System informatively, manageably and cost effectively -- is yet another critical and under-addressed problem.

Legacy Software.

A NISS project in partnership with Lucent Technologies is targeted at describing and quantifying how changes to large software systems become increasingly difficult over time. Ultimately, the goal is to devise organizational and other strategies to retard or remediate this "decay." 

Analogous systems at the Federal level -- notably, the software that runs air traffic control, that used by the Internal Revenue Service, or that used for Federal Information Services -- deal with data sets that are terabytes in size. The modular (and hence graphical) structure of software raises issues of analysis and visualization. Integration of data sets (regarding, among other things, change history of the code, organizational structure, personnel and project management information) overlaps with similar needs in other venues.

Extreme Risks.

Recent environmental disasters such as the Mississippi and Ohio River floods, and Hurricanes Andrew and Fran, have focused attention on the financial and other (for example, health) risks that the Federal government and other institutions faces from very large claims. (Further evidence of this is the crisis that faced Lloyds of London in the early 1990s, and the efforts being expended by the large reinsurance companies to assess and manage their risks better.) Informed action (prospective or retrospective) requires characterization of risks. 

Improved statistical modeling of large insurance claims, for example, combining extreme value theory with Bayesian hierarchical modeling to produce predictive distributions of future losses Smith97, helps cope with enormous uncertainties about key parameters. Future research, however, must make explicit use of statistics from extreme environmental events such as hurricanes, since extreme losses arise from hurricanes that are extreme in other ways as well (for example, duration, precipitation, wind speed, and route). Integration of data on meteorology, sociodemography, commerce, land use and infrastructure is necessary. Sequential sampling from databases is likely to be necessary, coupled with theoretical ideas for extremes in multivariate and spatially distributed data and Bayesian analyses.

Telecommunications Fraud,

a pilot project partnered by AT&T Laboratories, Murray Hill, NJ, is planned to start later this year. The need is to detect the less than 1% of telecommunications traffic that is fraudulent, from transaction data comprising 200 million calls per day and involving thousands of customer accounts. The system is a time-varying network, in which nodes correspond to telephone numbers and links to in-progress calls. The primary data are characteristics, possibly aggregated, of individual telephone calls. The key need is to identify what factors, including those from other data bases (such as characteristics of the originating numbers) are common to a set of suspect calls, in order to define a potential pattern of fraudulent behavior. Should these calls actually be fraudulent, one must design (sequential) queries that detect it. 

Characterizing extreme heterogeneity among customers will require application of tools for clustering and classification to very large databases. Multiplicity is extreme: millions of decisions must be made every day, which is not possible without innovative strategies to reduce errors. Design of query sequences and inference on the basis of multiple, data-dependent queries call for Bayesian strategies, which allow for characterization and updating of time-varying uncertainty. Construction of call patterns is a classification problem, approachable using tools such as CART, neural networks, multivariate correspondence analysis and HOMALS (Homogeneity Analysis by Alternating Least Squares).

The last of these examples bears on important Federal concerns such as Medicare fraud.

The size of these problems militates against strategies that bombard the problem with an array of methods (such as a slew of regression techniques, incorporating different model selection methods and cross-validations). Early, rapid assessment of a few good strategies is essential. At the same time, inadequate approaches must be recognized and re-oriented quickly, lest the whole exploration bog down. These thoughts run counter to statistical practice customarily applied to modest-sized data sets, and therefore pose novel obstacles to a penetrating analysis.


2.2 A Formulation of the Problem

Discovery in large data sets arises often through a process of "querying" the data and utilizing the answers to draw inferences. This process is hampered by the inability of standard algorithms to capture structure and relationships, as well as by heterogeneity, contributing to the difficulty of employing conventional tactics, but even more importantly, requiring identification and ways to cope with the differing components. Analyses involving vast numbers of queries can produce spurious results unless the multiplicity issue is faced.

Querying the data amounts to measuring or computing a function or functions  f(x) of data points x; some queries are themselves suggested by the data. Therefore, analysis of a large data set can be difficult not only because there are a lot of data points or because they are high-dimensional, but also because there are a lot of relevant queries. Multiplicity must be dealt with in order to analyze such "meta-queries."

Heterogeneity is especially acute when there are many data points, since its discovery and characterization, rather than techniques that mitigate its effects, are an essential component of the problem.

High-dimensional data can make otherwise routine computations bedeviling. When the dimension of the data increases with time, additional feasibility issues, especially recursive updating of models, must be dealt with.


3. About NISS

The National Institute of Statistical Sciences, an initiative of the national statistics community, is charged to perform and stimulate cross-disciplinary statistical research; to stimulate new paths and modes of investigation; to produce results with genuine impact on science and policy; and to engage young scientists in the research. The need for this to happen and the rationale for an independent institute to make it happen are detailed in the 1988 report of a panel of the Institute of Mathematical Statistics (IMS88).

Following a national site selection process, NISS was created as a non-profit North Carolina corporation in January, 1991, with ten parent organizations, including the American Statistical Association, the International Biometric Society and the Institute of Mathematical Statistics. (The others are Section U of the American Association for the Advancement of Science, the Committee of Presidents of Statistical Societies, Duke University, North Carolina State University, the University of North Carolina at Chapel Hill, the Research Triangle Institute and the Triangle Universities Center for Advanced Studies, Inc.) NISS is governed by its own Board of Trustees, currently chaired by John C. Bailar, III (University of Chicago). Management and scientific leadership are vested in Jerome Sacks, Director (and also Professor, Statistics and Decision Sciences, Duke University) and Alan F. Karr, Associate Director (and also Professor, Statistics and Biostatistics, University of North Carolina at Chapel Hill).

Currently, NISS is engaged in a series of industrially partnered pilot projects involving large data sets, which individually and collectively target the central issues of (sequential) design, heterogeneity and multiplicity. In addition to the telecommunication fraud project mentioned in § 2.1, these include work on statistical techniques for drug discovery, in collaboration with Glaxo Wellcome, Research Triangle Park, NC (Tobias95, Welch97); and work with General Motors Research, Warren, MI, directed at identifying new, "niche" automobile markets. Current efforts with government agencies include research supported by the National Center for Education Statistics address multiplicity issues in educational data sets, and work with the Environmental Protection Agency dealing with complexities of data reporting and visualization in connection with large data sets on the quality of drinking water. Details about NISS, its programs, its history and its governance are available on the World Wide Web at http://www.niss.org.



Committee on Applied and Theoretical Statistics (1996). Massive Data Sets: Proceedings of a Workshop. National Academy Press, Washington. 


Eick, S. G., Graves, T. L., Karr, A. F., and Mockus, A. (1996). Web-based text visualization. Technical memorandum, Bell Labs. 


IMS Panel on Cross-Disciplinary Research in the Statistical Sciences (1988).  Cross-Disciplinary Research in the Statistical Sciences. Institute of Mathematical Statistics. Reprinted in Statist. Sci. 5 121-146 (1990). 


Smith, R. L. (1997). Extreme value analysis of insurance risk. Submitted to ASTIN Bull.


Tobias, R. (1995). SAS QC Software, Volume I: Usage and Reference. SAS Institute, Cary, NC. 


Welch, W. J. (1997). Criteria and algorithms for space-filling designs. (In preparation)