National Institute of Statistical Sciences


Digital Government Prototype Software

NASS System for Geographic Aggregation

 

This system, developed by NISS for the National Agricultural Statistics Service, disseminates survey data on usage of agricultural chemicals in far greater geographical detail than previously, but protects the identities of farms in the survey.

The NASS Data. The database contains 194,410 records, from 30,500 farms, detailing use of 322 chemicals (fertilizers, fungicides, herbicides, pesticides) on 67 crops in the years 1996-1998. Record attributes are Farm ID, size in acres, crop, chemical, pounds of the chemical applied, state, county and year.

User queries are for application rates (pounds applied per acre) of certain chemicals on particular crops, ideally at the county level. Currently NASS releases application rates only at the state level. The system we describe produces more informative aggregations than state-level releases, but preserves confidentiality.

Aggregation for Disclosure-Risk Reduction. For the application rate in a geographical unit to be disclosable, NASS requires that two widely employed rules be satisfied. The N-rule requires that the unit contain at least N = 3 surveyed farms for the specified chemical, crop and year. The p-rule prohibits a dominant farm that comprises more than p = 60% of the total acreage of all farms surveyed in the unit.

At the county level these rules do not work: more than 50% of counties are undisclosable. Our system aggregates undisclosable counties with neighboring counties (in the same state) to form disclosable "supercounties," allowing NASS to release data at the highest resolution consistent with the risk criteria.

Aggregations must be computed automatically in response to user queries. Computing aggregations can be formulated as a (NP-hard) combinatorial optimization problem over the edge-set of the adjacency graph of the counties in a state, and solved (which we have done) using simulated annealing methods, but long running times make this infeasible in practice. Instead, we employ heuristic, "greedy" algorithms, which produce aggregations differing only insignificantly from those produced by simulated annealing.

Two heuristic algorithms have been developed, which share a common structure: examine the undisclosable (super)counties in a random order and merge them with a neighboring (super)county until only disclosable (super)counties remain.

The algorithms differ only in the rule that governs merging. The pure rule favors leaving disclosable counties unmerged (preserving "purity" of their data), but can create large supercounties comprised of many undisclosable counties. The small rule, by contrast, favors forming small supercounties by merging an undisclosable region with a neighboring region most likely to achieve disclosability.

Both algorithms randomize the order in which candidate mergers are considered (and break ties randomly). Each can produce aggregations in which some supercounties can be decomposed. To alleviate this, our implementation first runs the algorithm, and then the pure algorithm within each supercounty produced by small. This composite procedure works fast and well.

NASS System Architecture and Operation. Figure 1 shows the system architecture.A major design decision is to store (in ASCII text files) the result of every query answered by the system, which can retrieved if the query is ever received again. This avoids recomputation of aggregations, but the primary motivation is to prevent inconsistent (and even privacy-threatening) responses arising from the two randomized steps in the aggregation algorithms.

All queries are recorded in the Query History Database (QHDB). The QHDB allows us to monitor usage of the system, to understand user behavior, and to check if a query has been requested previously.

For previously answered queries, the stored results are passed to the output display module. Otherwise, CGI scripts (in Perl) extract the relevant raw data from the NASS database, and the adjacency structure for counties is retrieved from a GIS (we use ArcIMS). The survey and adjacency data are then input to the aggregation routine, which is a stand-alone program written in C. The output from the aggregation routine is then stored in a file, for both future use and access by the output display module. The directory and file naming convention we use makes it easy to identify and retrieve results corresponding to previous queries.

Figures 2 and 3 show screenshots from the prototype NASS system.The user first selects (Figure 2) a state and year(s) of interest. JavaScipt routines then dynamically generate drop-down menus of relevant crops and chemicals. The user next selects either a crop (in which case the chemical menu is regenerated to contain only chemicals applied to that crop) or a chemical (causing the crop menu to be regenerated). Finally, an output format is selected: map (the default), on-screen table or XML download (see Figure 4). The XML DTD mirrors the hierarchical nature of the aggregated data.

If not available from a previous query, the aggregation is computed on-the-fly. The result is stored in case the query is received again, and transaction information is written to a query history database.

Map output is shown in Figure 3. Supercounties are colored according to the application rate of the chosen chemical on the chosen crop; the color bar also shows the state-wide average rate. Supercounty and county-within-supercounty boundaries are shown, but differently. Multiple years appear on separate maps with a common color scale.

 

 

Figure 1: System Architecture

Figure 2: Input Screen

 

Figure 3: Output Screen

 

Figure 4: Excerpt of XML download

 

Additional Information

Analysis of Aggregated Data in Survey Sampling
with Application to Fertilizer/Pesticide Usage Surveys

J. Lee, C. Holloman, A. F. Karr and A. P. Sanil
Research in Official Statistics 4 (2001) 101-116

View/Download PDF (301 KB)

 

Web-Based Systems that Disseminate Information
from Data but Protect Confidentiality

Alan F. Karr, Jaeyong Lee, Ashish P. Sanil,
Joel Hernandez, Sousan Karimi and Karen Litwin
To appear in Advances in Digital Government
by A. Elmagarmid and W. McIver
Kluwer Academic Publishers

View/Download PDF (102 KB)

 

Disseminating Information but Protecting Confidentiality
Alan Karr, Jaeyong Lee, Ashish Sanil, Joel Hernandez, Sousan Karimi, Karen Litwin
IEEE Computer 34(2) (2001) 36-37.

View/Download PDF (148 KB)

 

 

Navigation: www.niss.org > Digital Government > Software > NASS System