National Institute of Statistical Sciences


Digital Government Prototype Software
Table Servers

 

Table servers being developed by NISS disseminate tabular summaries of statistical data in response to user queries for marginal sub-tables of a large (e.g., 40 dimensions with 4 categories each) contingency table containing counts or sums. Table servers evaluate disclosure risk dynamically, in light of previously answered queries.

The central challenges to building such a system are development of the abstractions that underlie it and of scalable algorithms that implement the abstractions.

Abstractions. The query space Q, which contains all 2K sub-tables of a K-way table, is partially ordered by set inclusion of variables in subtables. At any time t, the set R(t) of all tables released through time t, contains both direct releases in response to queries and indirect releases (previously unreleased children of direct releases). This set is specified completely by the released frontier R(t), which consists of the maximal elements of R(t). In Figures 1 and 2, the released frontier lies in the lower left portion of the query space.

Underlying release decisions is a risk criterion defined on subsets of the query: at all times the system must maintain the risk below a threshold a set by the operators. Reflecting historical usage, a typical risk criterion is accuracy of bounds based on for sensitive (small count) cells in the full table. Bounds can be computed using network methods and the "shuttle algorithm." There are also exact techniques for special cases: if the released marginals constitute the minimal sufficient statistics of a decomposable graphical model, then bounds can be expressed as explicit functions of these marginals.

Whenever an answered query releases previously unreleased information, other queries become unanswerable. Consequently, at t there are also an unreleasable set U(t) of subtables whose release would be too risky, with an unreleasable frontier UF(t) of its minimal elements. In Figure 1, the unreleasable set lies to the upper right; the unreleasable frontier is its "lower boundary."

Release rules determine which requests for unreleased tables will be fulfilled. The simplest is the myopic rule of releasing any requested table that is not too risky: table T will be released if requested as long as doing so does not violate the risk threshold, but taking into account all previous releases.

The myopic rule, however, allows the table server to take very large steps, by releasing tables containing many more variables than those that have been released. To prevent this, one can allow only tables adding but one variable to a previously released table to be eligible for release (see Figure 1). More subtle issues arise, though. Typically, it is necessary to prevent a single user (or a set of colluding users) from driving the table server into a region of the query space that suits their needs but not necesarily those of other users. A response is to use release rules that are biased against releases that add large numbers of other tables to U(t). Another alternative is rules that incorporate a suitably defined value of releases, such as the accuracy of reconstruction of the full table by iterative proportional fitting.

System Design and Prototypes. A prototype table server, written as a Java application, is shown in Figures 1 and 2. It operates on an 8-dimensional table of data from the Current Population Survey. This prototype is valuable for its engaging but non-scalable visualization of the query space.

Figure 3 shows the architecture of a more powerful table server written using the Java 2 Enterprise Edition (J2EE) platform, with HTTP processing performed by Java Servlets. This prototype uses a 14-dimensional table derived from another Current Population Survey. This table has approximately 435,000,000 cells, but is extremely sparse.

Figure 4 shows the user input screen. If the requested table lies on or below the released frontier, it is provided immediately, via a screen display if the table is "small," and otherwise via downloaded XML. Releases are governed by the myopic and "at most one step away from the released frontier" rules. Disclosure risk is evaluated in real time. The query history database, with tables for users, queries and the time trajectories of RF(t) and UF(t), is maintained in a MySQL database server. A frontier display facility (Figure 5) monitors evolution of the system.

The risk computation procedure employs data structures based on hash tables for storing tables, and algorithms that exploit sparsity. The risk criterion is accuracy of cell bounds computed via a generalized shuttle algorithm.

 

 

Figure 1: Java Prototype, showing Unreleased Frontier

 

Figure 2: Java prototype, showing effect of releasing one 5-way table

 

Figure 3: J2EE prototype system architecture

 

Figure 4: J2EE prototype input screen

 

Figure 5: J2EE prototype frontier display

 

Additional Information

Web Systems That Disseminate Information
But Protect Confidential Data
Alan F. Karr and Ashish P. Sanil
To appear in Proc. Internat. Assoc. Survey Statist.

View/Download PDF (1014 KB)

 

 

Navigation: www.niss.org > Digital Government > Software > Table Servers