Large data sets are a given in modern industry, technology and science and central issues concerning them demand statistical attention. While the need is apparent, paths to bringing statistical science to bear on large data sets are not clearly mapped. General approaches will emerge most rapidly if adequate focus is placed on specific industrial and scientific problems, in collaboration with those who hold the data. This research effort comprised two interconnected pilot projects dealing with large data sets, each involving a major industrial partner.
- Drug Discovery - Glaxo Wellcome, Research Triangle Park, NC
- Network Computer Intrusion: AT&T Laboratories, Murray Hill, NJ
In each instance, there are specific scientific issues with high-stakes implications for the industry at large; the available data sets are for current and prospective needs, lending immediacy to the research. At the same time, the pilot projects addressed long-run issues such as heterogeneity and multiplicity, whose relevance extends beyond the corporations and industries they involved. Each pilot project was speculative, in the sense that the path from data to information to knowledge was not known in advance.
Workshop(s): Committee on Applied and Theoretical Statistics (CATS) of the National Research Council (NRC) on Massive Data Sets, July 1995.
Work with industry partners on large data sets.
Principal Investigator(s):Jerome Sacks, NISS; Alan Karr, NISS
Senior Investigator(s): Michael West, Duke: S. Stanley Young, GlaxoWellcome, M. Clyde