National
Institute of Statistical Sciences
19 T. W. Alexander
Drive
P.O. Box 14006
Research Triangle Park, NC 27709-4006
Tel: 919.685.9300
FAX: 919.685.9310
admin@niss.org
![]()
| PIs | Alan F. Karr, NISS, and Jerome Sacks, NISS |
|---|---|
| Funded by | Grant Opportunities for Academic Liaison with Industry (GOALI), NSF |
| Funding | $800,582 over three years |
| Duration | May 15, 1997 - May 14, 2000 |
Large data sets are a given in modern industry, technology and science, and central issues concerning them demand statistical attention. While the need is apparent, paths to bringing statistical science to bear on large data sets are not clearly mapped. General approaches will emerge most rapidly if adequate focus is placed on specific industrial and scientific problems, in collaboration with those who hold the data. This research effort comprises two interconnected pilot projects dealing with large data sets, each involving a major industrial partner:
The project constitutes a response by the statistics community to the July, 1995, workshop held by the Committee on Applied and Theoretical Statistics (CATS) of the National Research Council (NRC) on Massive Data Sets. Furthering the purposes of CATS and the NRC, these pilot projects expose new problems and generate approaches whose exploration may be taken up by others.
The CATS workshop focused on the relevance of statistics to "data mining" and "knowledge discovery in databases," in contexts ranging from space science to retail merchandising. A principal finding was that central issues for large data sets - clustering, sampling, sequential design, heterogeneity of the data, the multiplicity of queries made of the data and decisions based upon the results, uncertainty of forecasts and predictions, design in data and query space and feasibility of new methods - are either innately statistical or approachable by means of statistical strategies.
The size of the problems we treat militates against strategies that bombard the problem with an array of methods (such as a slew of regression techniques, incorporating different model selection methods and cross-validations). Early, rapid assessment of a few good strategies is essential. At the same time, inadequate approaches must be recognized and re-oriented quickly, lest the whole exploration bog down. These thoughts run counter to statistical practice customarily applied to modest-sized data sets, and therefore pose novel obstacles to a penetrating analysis.
An experienced, NISS-and-partner-led research team, comprising senior researchers from the partner, universities and NISS, postdoctoral fellows and graduate students, will address each pilot project. Because of proximity to the partners, researchers at Rutgers University will play important roles. Teams teams embody close collaboration among computer scientists, statisticians and subject-matter experts; interactions among the teams will drive the transfer of results across projects.
Additional Information
![]() |
![]() |
| Help |