19 T.W. Alexander Drive P.O. Box 14006 Research Triangle Park, NC 27709-4006
919.685.9300  FAX 919.685.9310  admin@niss.org


Current NISS Research Projects

Digital Government II
Data Confidentialty, Data Quality and Data Integration for Federal Databases:
Foundations to Software Prototypes


Proposal Summary:
This is a proposal for a large-scale, cross-disciplinary, high-impact research program to create abstractions, theory, implementable methodology and software prototypes to meet three central, interacting, data-driven challenges facing Federal statistical agencies—data confidentiality (DC), data quality (DQ) and data integration (DI).

Federal government-unique problems—especially the necessity, in a electronic world, to balance privacy and confidentiality against user access to high-quality statistical data—define the research. The project will create effective, credible ways to ensure DC in the face of strong, even competing, concern about DQ and the growing need and capability for DI.

The project addresses fundamental research questions in multiple disciplines: computer science, to formulate abstractions and design algorithms that accommodate interactions among DC, DQ and DI; the statistical sciences, to provide decision-theoretic formulations that account for both the risk and the utility of disseminating information, and of the consequences of DC, DQ and DI for inference; and software and systems engineering, to build prototype systems that operate at realistic scales, in order to evaluate and refine new theory and methodology. Complementing these are domain knowledge, to link uses of information to requirements for DC and DQ; and visualization, to support understanding of abstractions, algorithms, and system operation.

Scalability pervades the research: many techniques for ensuring DC, improving DQ and performing DI are untried, and must be evaluated, in the context of the size and dimension of the databases, the diversity of user needs and the complexity of analyses that we will address. One central challenge is to build systems that correctly implement solutions to technical problems and scale to the large, complex databases maintained by Federal statistical agencies.

The project will be carried out by statistical and computer scientists from the National Institute of Statistical Sciences, Carnegie Mellon University, the University of Maryland College Park, the Institute for Social Research at the University of Michigan, Purdue University, Southern Methodist University and the Los Alamos National Laboratory.

As partners in the project, five leading Federal statistical agencies—the Bureau of Labor Statistics, the Bureau of
Transportation Statistics, the Census Bureau, the National Agricultural Statistics Service and the National Center for Education Statistics—will ensure that the research is relevant, timely and applicable. The partners will provide essential access to data and participation of personnel in development and evaluation of methods and software systems.


DG II Project Web Site


 

Software Engineering
Collaborative Research: Acquiring Accurate Dynamic Field Data
Using Lightweight Instrumentation

Proposal Summary: Dynamic analyses, such as profiling and testing, are a key part of state-of-the-art analysis and validation approaches to software quality assurance. The effectiveness of dynamic analyses depends, in part, on the degree to which the sample input data that are exercised reflect the way programs are actually used in the field. Acquiring accurate data is hard, expensive, and rarely done.

Software developers use a variety of program analysis and testing techniques to increase their confidence that a program has particular properties, such as functional correctness, liveness, and scalability. State-of-the-art analysis is powerful and effective, but limited in two ways. First, resources—particularly time, money, and machine cycles—are finite, albeit extensive. Second, it is challenging for developers and testers to fully predict the actual environment and the usage patterns for programs in the field, but the effectiveness of their analysis activities often depend on the accuracy of that prediction

In this proposal, we address these two limitations with an approach that augments conventional analysis techniques by instrumenting fielded instances of programs, e.g., during beta test. Our approach increases confidence in particular program properties by combining information acquired from the collection of fielded instances with earlier local analysis and testing information. Each fielded instance will be instrumented in a lightweight way, perturbing it only slightly, and different instances will in general be instrumented to gather different information. Information gathered from the distributed instances will allow us to perform more effective analyses when combined with information computed during the earlier conventional analysis and testing phase.

Preliminary experimental results show our approach (1) uses resources more effectively by parallelizing key aspects of the analysis process and (2) builds better predictive models that capture actual environments and usage patterns by getting early feedback from executions of the program from users as part of the analysis process. As a result, we are confident that distributed dynamic analyses using lightweight instrumentation will succeed for the following reasons:

To demonstrate the feasibility of our techniques and tools in practice, we will focus on two large-scale, production-quality, performance-intensive infrastructure software projects: ACE and TAO, which are widely-used, open-source middleware. We will use ACE+TAO to demonstrate empirically that our technologies and processes can enable real-world developers and users to tailor their QA tools, techniques and processes to improve such areas as fault detection, performance evaluation, memory footprint minimization, and power reduction. We have chosen to focus on the ACE+TAO projects because (1) we control their development process and source code, (2) they are production-quality software that embody many characteristics of performance-intensive infrastructure software, and (3) they exemplify key trends in software R&D.

 

Web Data
Bayesian Models Linking Web Site Structure

Proposal Summary: This is a proposal to create a set of four increasingly complex, but scalable, Bayesian models that relate the usage (specifically, user page transitions) of a Web site to its structure, and to apply, validate and refine the models using real data from four qualitatively different Web sites—an E-commerce site, a site operated by a large financial institution, a content site and an information site.

The Bayesian models share the one essential characteristic that makes them scalable: the destinations from a given page (whether it is static or generate dynamically) are classes of pages that mirror the tree structure of the site, rather than
individual pages. Examples are the parent, children and siblings of a page. Scalability results from replacing the full [page
X page] transition matrix by the much “narrower” [page X destination classes] matrix. All four models assume Dirichlet prior distributions for transitions from each page. The first three employ very aggregated classes of transitions, and
differ according to whether the transition distributions and the priors are the same for all pages. The fourth model disaggregates the “child” and “sibling” destinations. Calculation of posterior distributions varies in difficulty: some are available in closed form, while others require intensive MCMC computation.

Applications include relating user behavior to site structure (For example, pages with frequent transitions to other than parent, siblings, children and special destinations such as the home page are dissonant with respect to the Web site structure.); comparison of site usage at different times, or for different classes of users; segmentation of sessions; quantification of inter-relationships among pages (which also may not respect the site structure); simulation (for example, to evaluate hardware or application server capabilities); and prediction of user behavior, including forecasts, for example, of the economic impact of promotional campaigns. The ultimate impact is more efficient Web sites that serve users more effectively.

The Bayesian framework allows these applications to be addressed quantitatively, using formal hypothesis tests and predictive distributions. In addition, rigorous model assessment will provide insight into what level of aggregation is appropriate to which analyses of Web data.

 

Social Networks
Dynamics for Social Network Processes:
Comparing Statistical Models with Intelligent Agents

Proposal Summary: The goal of this research is to reconcile two methods for modeling change in social networks over time—p* models and intelligent agent models. The latter family has received much attention from social scientists but little from mathematicians and almost none from statisticians, and so constitutes a promising and important opportunity for collaboration.

We will contrast the properties of these two approaches, exploring in particular what kinds of qualitative behavior in social networks are captured usefully and interpretably by each. The primary tools will be latent variable representations and dynamical systems analysis. We seek not to declare that one class on models is “better” than the other, but to construct a framework that yields insight into their strengths and limitations in multiple settings. The impact on the social sciences will be dramatic: researchers will be able to choose in a principled manner a model whose dynamic properties are most appropriate to each application.

To evaluate and refine the our framework, to assess its scalability, we will use as a testbed a very complex simulation model that describes the formation of terrorist networks.

Specific components of the research include time-evolving networks, comparison of the evolution of multiple models, which presents challenging “calibration” problems in order to compare models whose fundamental formulations do not map readily onto one another, a unified dynamics for social networks and intelligent agents, sensitivity analyses, selection of comparison and validation criteria, and visualization.

The project will have two phases, each lasting one year. The first will be construction of the framework, and the second its implementation and evaluation on the testbed, with results fed back to improve the framework.

The project team spans the social, mathematical, statistical and computer sciences, and is drawn from NISS, the lead
institution on the proposal, Carnegie Mellon University, Duke University, North Carolina State University and the University of North Carolina at Chapel Hill. The participants are experienced, influential researchers, each with an established record of cross-disciplinary collaboration. Many subsets of the team have worked with one another previously.

 

Chemical Informatics
Web-Enabled Virtual Screening

Proposal Summary: The long-term objective of this project is to develop computational algorithms and software to gain theoretical and empirical insights in the use of chemical diversity for determining quantitative structure-activity relationships (QSARs). In addition to addressing scientific and technical goals with respect to QSAR modeling, planning-period tasks will include specific activities to bring together the researchers and to facilitate inter-disciplinary communication. Specific Aim 1 is to develop and enhance collaborations between three broad disciplines: statistics, computer science, and chemistry. This will be accomplished primarily through several intense workshops per year and regularly scheduled status meetings. Specific Aim 2 is to initiate a benchmarking study to compare structural descriptors, modeling strategies, and methods of model assessment. Through a web server, results will be posted from analyzing several datasets using many QSAR modeling techniques, a variety of molecular descriptors, and a number of assessment criteria. Specific Aim 3 is to design and beta-test web-accessibility of modeling software. PowerMV, a cheminformatics software tool created at the National Institute of Statistical Sciences, will be upgraded and made web-accessible. And Specific Aim 4 is to develop a broad view of cheminformatics tools based on the singular value decomposition and other similar decompositions where computations take advantage of the high degree of sparseness often exhibited by HTS data sets. The significance of these specific aims, in support of the long-term objective, is to reduce resource requirements for, and thus streamline, the process of drug discovery.

 

Education Statistics

Education Statistics Services Institute-Statistics

Further Information

NAEP Education Statistics Services Institute

Further Information

 




Events  |  Programs  |  Projects  |  Publications  |  People  |  Software  |  About NISS  |  Home
Entire site © 2000-2003, National Institute of Statistical Sciences. All Rights Reserved.
This page updated on May 1, 2006 3:23 PM