Abstract
In this paper, we explore how anonymizing data to preserve privacy affects the utility of the classification rules discoverable in the data. In order for an analysis of anonymized data to provide useful results, the data should have as much of the information contained in the original data as possible. Therein lies a problem – how does one make sure that anonymized data still contains the information it had before anonymization? This question is not the same as asking if an accurate classifier can be built from the anonymized data. Often in the literature, the prediction accuracy of a classifier made from anonymized data is used as evidence that the data are similar to the original. We demonstrate that this is not the case, and we propose a new methodology for measuring the retention of the rules that existed in the original data. We then use our methodology to design three measures that can be easily implemented, each measuring aspects of the data that no pre-existing techniques can measure. These measures do not negate the usefulness of prediction accuracy or other measures – they are complementary to them, and support our argument that one measure is almost never enough.
Original language | English |
---|---|
Pages (from-to) | 175-201 |
Number of pages | 27 |
Journal | Transactions on Data Privacy |
Volume | 10 |
Issue number | 3 |
Publication status | Published - Dec 2017 |