National
Institute of Statistical Sciences
19 T. W. Alexander
Drive
P.O. Box 14006
Research Triangle Park, NC 27709-4006
Tel: 919.685.9300
FAX: 919.685.9310
admin@niss.org
![]()
PROGRESS REPORT (DMS-9700867)
Pilot Projects to Explore Large
Data Sets
Alan F. Karr, Principal Investigator
Jerome Sacks, Project Director
June 16, 1998
Contents
1. Drug Discovery
Work on large data sets in drug discovery, carried out in close collaboration with corporate partner Glaxo Wellcome (GW), has been moved forward in several directions.
1.1. Recursive Partitioning
An efficient algorithm for classifying large numbers of chemical compounds into groups of molecules with similar potencies (determined by high throughput screening of the compounds in biological assays) is described and applied in [4]. The algorithm capitalizes on recursive partitioning of high dimensional vectors with 0/1 entries. Each vector is description of a chemical compound, based on pairs (or even triples) of atoms; the associated with a numerical response is measured potency.
The algorithm can handle hundreds of thousands of compounds and vectors of dimension in the millions. It can also be applied to deal with classes of compounds in the same data set that might be binding in different ways. Research is under way to extend it to utilize more complex, three-dimensional molecular descriptions that take conformation of the molecule into account.
The algorithm is utilized for sequential screening (Section 1.2).
Participants: S. Kazachki (undergraduate student, CalTech), C. Keefer (graduate student, North Carolina State), C. Lambert (graduate student, Duke), S. Paddock (graduate student, Duke), A. Rusinko (GW), M. West (Statistics and Decision Sciences, Duke), S. Young (GW).
1.2. Sequential Screening
The challenge to produce sequential design methods to reduce the costs inherent in synthesizing and assaying thousands of compounds is addressed in [1]. Here, an initial screening of a modest number of selected compounds (first-stage) is used to construct a structure-activity relationship (SAR). Based on this model, a second stage sample is then selected, the SAR can be updated and used for subsequent sampling stages and, upon stopping, to predict the activities of not-yet-tested compounds.
Using as testbed existing data on the potency and chemical structure of 70,223 compounds, a number of sequential schemes were explored. The evidence is that sequential methods can double the rate at which active compounds are identified while producing effective SAR rules that enable prediction of the potency of untested compounds.
One surprising conclusion of the study is that the selection of the initial sample stage may be unimportant: random selection or systematic methods based on chemical structures are equally effective. The size of the initial sample is relevant, as is the design at the second stage. Interestingly, it does not appear to be useful to go beyond two stages, both for practical reasons and because performance is not noticeably improved.
Participants: M. Abt (NISS postdoc), Sacks, M. Xie (NISS postdoc, now at Rutgers), Y. Lim (visitor to NISS), Young.
1.3. Multiple Trees
The algorithm in Section 1.1 currently relies on a single "greedy" classification tree. For many applications primary interest is focused on predicting those cases with an extreme response rather than uniformly accurate predictions for a full range of cases. Generating and averaging multiple trees offers improvement over greedy-tree models when the criterion is uniform accuracy. The structure rules of the aggregated predictor, however, are difficult to understand. Reference [5] introduces ways of growing and combining multiple trees when the criterion is accuracy for extreme values.
Two predictors based on multiple trees are especially tailored to this setting. The first increases accuracy over greedy-tree models while maintaining the interpretability of single tree descriptors. The second is often more accurate than aggregation, but still lacks interpretability. Used in high throughput screening of chemical compounds, these methods lead to as much as to 50% improvement over greedy trees.
Participants: C. Gu (Statistics, Purdue, visiting NISS), K. Tatsuoka (NISS postdoc), Sacks, Young.
1.4. Pooling; Group Testing
Assaying compounds in groups (pools) is a classical approach: individual compounds need not be tested unless the pool tests positive for potency. However, many contexts are complicated by the possible presence of blockers, which can lead to pools that test inactive but contain active compounds.
Several testing schemes to reduce the rates of false negatives were devised and analyzed in [6]. Algorithms for square array pooling schemes were developed to provide probability estimates for a compound to be active or a blocker. Methods to produce optimal choices of pool size were developed as well. The methodology copes with imperfect test procedures.
An alternate approach uses the molecular descriptions of the compounds as covariates for each member of a pool and applies recursive partitioning to analyze the pool data. The rules coming from the analysis can be used to identify active compounds in the pool without resorting to further physical assay.
Participants: Tatsuoka, Xie, Young.
1.5. Plans for Year 2
The research planned for Year 2 continues that from Year 1, but at the same time moves in important new directions.
Recursive Partitioning. Research to apply the algorithm of Section 1.1 to gene finding has been initiated. Given a large population of individuals, without pedigree information, and a large number of genetic markers for each individual, find the association of phenotype with the markers. The problems involve numbers of markers in the tens of thousands, but numbers of individuals are comparatively small (in the thousands, arising from clinical trials). Complications arise because the hidden genes can interact, and the disease phenotype may result from several different underlying mechanisms. A simulation model is under development to explore the potential of the recursive partitioning approach; actual data are not yet available in sufficient quantity.
Sequential Screening. The methods described in Section 1.2 will be explored in the context of "virtual libraries:" collections of combinatorially generated and compounds made using standard reactions and reagents. These libraries can be very large -- millions to hundreds of millions of compounds. A sequential scheme is critical in these contexts; it is plausible that more than two sampling stages may be necessary to treat such large spaces of compounds.
Multiple Trees. Forthcoming work will bring these methods to bear on sequential screening problems (Section 1.2).
Proposed research to refine techniques, developed by corporate partner AT&T Research (ATT), for detection of telephone fraud has been re-directed to detection of intrusion into networked computer systems. The reasons for this are (1) Rapid progress on fraud by ATT, leaving reduced opportunity for important contributions; (2) The timeliness, importance and difficulty of the intrusion detection problem; and (3) The possibility to transfer toll fraud detection technology to intrusion detection.
However, the analogy between the two problems is incomplete at best. By contrast with toll fraud, intrusion detection is complicated by multiple intruder motives, hard-to-quantify losses, massive data (gigabytes per day in a large organization) and elusiveness of precise formulations.
Work in Year 1 first focused, therefore, on problem formulations. A reading group was formed to survey existing literature, formulate problems and identify data needs. Participants, in addition to others mentioned below, included R. Becker, C. Mallows, D. Pregibon, J. Reeds and A. Wilks from ATT.
Because of accessibility and tractability of data, one central problem was formulated and investigated: Characterization of and differentiation among users of a computer system. Four approaches were explored: User-unique (on nearly unique) commands (Section 2.1); One-step command transitions (Section 2.2); Multiple-step command transitions (Section2.3); and Compression methods (Section 2.4). As a by-product, this work has yielded novel, quantitative insights into the usage of large computer systems.
The data -- 5,000,000 commands constituting several weeks' usage of a Unix system by approximately 50 researchers -- comprise only command names, not qualifiers or arguments. (Becker and Wilks assembled the data. The Unix acct facility was used, which does record commands generated by shell scripts or other commands.) For several methods, data are divided into historical data used to construct algorithms and test data used to evaluate them.
2.1. User-Unique Commands
Of the set of some 700 commands appearing in the data, more than one-half were used by only one user. A visualization of the data appears in Figure 1.

Figure 1. Visualization of Data for User-Unique Commands. The rectangles correspond to users; within each, the x-axis represent time. Starting from the bottom, the y-axis is divided into blocks representing commands used by 1, 2, ... users.}
Methods to detect signal in unique and rare commands were developed. They are based on uniqueness scores (for user i) of the form Score(i) = (1/Ni) Sumj Iij Sj, where j indexes commands, Ni is the number of commands used by i, Iij is one if i has used j before and -1 otherwise, and Sj is the number of users minus the number of users of j.
While this method discriminates effectively among users, as shown in Figure 2, it places too much weight on users who use rare commands frequently. Schemes incorporating weights can alleviate this problem.

Figure 2. ROC Curve for User-Unique Commands.
Participants: Karr, M. Schonlau (NISS postdoc, based at ATT), M. Theus (ATT postdoc).
2.2. One-Step Transitions
Building on previous research (The principal threads are attack signatures and anomaly detection based on user profiles derived from command frequencies.) statistical techniques have been developed based on command transition probabilities
Pu(j,k) = P(next command = k | current command = k, user u).
Estimates of these probabilities are constructed by smoothing empirical counts from the historical data.
Suppose that User 0 generates commands (in the test data) C0, ..., CT: could these have come from another user instead? The statistical hypothesis test used to answer such a question has as null hypothesis that the transition probabilities P generating these commands are those from user zero:
H0: P(j,k) = P0(j,k)
and as alternative that P deviates from P0 in the direction of one or more other users:
H1: P(j,k) = P(j,k,b) = P0(j,k) + Sumu = 1, ..., U bju[Pu(j,k) - P0(j,k)].
The key difficulty to be overcome was the high dimensionality of the alternative hypothesis. This was done using principal components (which in effect define classes of users); five components sufficed. A Fisher score statistic was used, based on weighted principal components. Once computations based on historical data are completed, the method is implementable in real time.
The resultant technique [2,3] is very effective in the user population under study, as shown in Figures 3 and 4.
Participants: W. DuMouchel (ATT), Schonlau.
Figure 3. Test Results for One-Step Command Transition Probabilities. In each column, the "true" user is shown by a plus sign and other users by dots. The true user should be, and in most cases is, at the bottom, with the lowest score.

Figure 4. ROC Curve for One-Step Transition Probabilities. Here, every block of 100 commands was treated as a separate experiment.
2.3. Multiple-Step Transitions
Complementing the work on one-step command transitions, models are being examined for m-step transitions, assuming that commands from a single user constitute an mth order Markov chain with transition probabilities P(jm+1|j1, ,jm) = Sumk = 1, m lk Q(jm+1|jk), where Q is a single unknown transition matrix and the lk are positive and add to one.
The computational approach alternatively maximizes the log likelihood with respect to Q and l. Initial results are promising, and the research is continuing.
Participants: W.-H. Ju (graduate student, Rutgers), Schonlau, Y. Vardi (Statistics, Rutgers).
2.4. Compression Methods
This approach is based on the concept that "dictionary-based" compression methods, such as Lempel-Ziv, can capture high-order structure that may be too complicated to identify explicitly. As for command transitions, the data used for user i were an historical (training) data set H(i) of and a test data set T(i), in this case, each of approximately 1000 commands.
The method calculates, for each pair of users,
D(i,j) = |compress(H(i) + T(j))| - |compress(H(i))|,
where "compress" is the Unix compress command and H(i) + T(j) is the concatenation of the two data sets in that order.
Effective performance of the algorithm means that for i different from j, D(j,j) < D(i,j); the error rate for the initial implementation of the algorithm is 9.6%.
Participants: Karr, Schonlau.
2.5. Plans for Year 2
The goals are to complete research on command data and move on to network data.
Command Data. Issues remaining to be addressed are:
This line of research will be complete by the end of the summer of 1998.
Network Data. Collaborating network security researchers from ATT will begin making network traffic data available in the Fall of 1998; "trial" data set is expected during the summer. Both for tractability and because of privacy concerns, initial attention will focus on IP packet headers, which contain source and destination machine addresses, session type (e.g., telnet, FTP, HTTP) and time stamp information. The scale of the data and (relative) uninformativeness of individual elements of the data are issues to be confronted at once. The approach will be to develop informative visualizations.
Presentations of this research were made at the Gordon Research Conference (1997), Monsanto (1997) and the 30th Symposium on the Interface of Computing Science and Statistics (1998).
[1] Abt, M., Lim, Y., Sacks, J., Xie, M., and Young, S. (1997). A sequential approach for screening large chemical databases. Technical Report, NISS.
[2] DuMouchel, W., and Schonlau, M. (1998a). A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities. Proc. KDD 98. (to appear)
[3] DuMouchel, W., and Schonlau, M. (1998b). A comparison of test statistics for computer intrusion detection based on principal components regression of transition probabilities. Proc. 30th Symp. Interface of Computing Sci. and Statist. (to appear)
[4] Rusinko, A., Farmen, M. W., Lambert, C. G., Brown, P. L., and Young, S. (1998). Analysis of a large structure/biological activity data set using recursive partitioning. Submitted to J. Amer. Chem. Soc.
[5] Tatsuoka, K., Gu, C., and Young, S. (1998). Predicting extreme values in large data sets. Technical Report, NISS.
[6] Xie, M., Tatsuoka, K, and Young, S. (1998). Square array group screening of samples with blockers. (in preparation)
![]() |
![]() |
| Help |