National Institute of Statistical Sciences
19 T. W. Alexander Drive
P.O. Box 14006
Research Triangle Park, NC 27709-4006
Tel: 919.685.9300
FAX: 919.685.9310
admin@niss.org

Pilot Projects to Explore Large Data Sets

Project Information


PIs Alan F. Karr, NISS, and Jerome Sacks, NISS
Funded by Grant Opportunities for Academic Liaison with Industry (GOALI), NSF
Funding $800,582 over three years
Duration May 15, 1997 - May 14, 2000

Project Description

Large data sets are a given in modern industry, technology and science, and central issues concerning them demand statistical attention. While the need is apparent, paths to bringing statistical science to bear on large data sets are not clearly mapped. General approaches will emerge most rapidly if adequate focus is placed on specific industrial and scientific problems, in collaboration with those who hold the data. This research effort comprises two interconnected pilot projects dealing with large data sets, each involving a major industrial partner:

Drug Discovery: Glaxo-Wellcome, Research Triangle Park, NC
 
Network Computer Intrusion: AT&T Laboratories, Murray Hill, NJ
In each instance, there are specific scientific issues with high-stakes implications for the industry at large; the available data sets are for current and prospective needs, lending immediacy to the research. At the same time, the pilot projects address long-run issues (such as heterogeneity and multiplicity) whose relevance extends beyond the corporations and industries they involve. Each pilot project is speculative, in the sense that the path from data to information to knowledge is not known in advance. The substantial scientific overlap among the projects will lead to effective transfer of methods from one project to the other. Finally, there are major commitments of resources by the industrial partners, without which these projects could not be undertaken.

The project constitutes a response by the statistics community to the July, 1995, workshop held by the Committee on Applied and Theoretical Statistics (CATS) of the National Research Council (NRC) on Massive Data Sets. Furthering the purposes of CATS and the NRC, these pilot projects expose new problems and generate approaches whose exploration may be taken up by others.

The CATS workshop focused on the relevance of statistics to "data mining" and "knowledge discovery in databases," in contexts ranging from space science to retail merchandising. A principal finding was that central issues for large data sets - clustering, sampling, sequential design, heterogeneity of the data, the multiplicity of queries made of the data and decisions based upon the results, uncertainty of forecasts and predictions, design in data and query space and feasibility of new methods - are either innately statistical or approachable by means of statistical strategies.

The size of the problems we treat militates against strategies that bombard the problem with an array of methods (such as a slew of regression techniques, incorporating different model selection methods and cross-validations). Early, rapid assessment of a few good strategies is essential. At the same time, inadequate approaches must be recognized and re-oriented quickly, lest the whole exploration bog down. These thoughts run counter to statistical practice customarily applied to modest-sized data sets, and therefore pose novel obstacles to a penetrating analysis.

An experienced, NISS-and-partner-led research team, comprising senior researchers from the partner, universities and NISS, postdoctoral fellows and graduate students, will address each pilot project. Because of proximity to the partners, researchers at Rutgers University will play important roles. Teams teams embody close collaboration among computer scientists, statisticians and subject-matter experts; interactions among the teams will drive the transfer of results across projects.

 

Additional Information

 

NISS Home Page
Help