Abstract:
Classification methods have troubles with missing data. Even CART, which was designed to deal with missing data, performs poorly when run with over 90% of the predictors unobserved. We use the Apriori algorithm to fit decision trees by converting the continuous predictors to categorical variables, bypassing the missing data problem by treating missing data as absent items. We demonstrate our methodology in a setting simulating a distributed, low-overhead, quality assurance system, where we have control over which predictors are missing for each observation. We also demonstrate how performance can be improved by the introduction of a simple adaptive sampling method.
Publication Date:
Wednesday, November 1, 2006File Attachment:

Report Number:
164