Abstract
In the data-driven society of the 21st century, mining data to discover information about people is becoming increasingly valuable. The information can be used to learn more about society and humanity, or to build models that enable us to predict future events. Applications of data mining range from commercial endeavors, to contributing to the common good through demographic and medical studies. Unfortunately, sometimes there are real-world considerations that conflict with the goals of data mining; sometimes the privacy of the people being data mined needs to be considered. This necessitates that the output of data mining algorithms be modified to protect sensitive information, while simultaneously not ruining the informative or predictive power of the outputted model.
Many techniques have been developed to preserve privacy over the years, but one stands out above the rest: differential privacy. Differential privacy is an enforceable definition of privacy that can be used in data mining algorithms, guaranteeing that nothing will be learned about the people in the data that could not already be discovered without their personal information.
In this thesis, we focus on one particular data mining algorithm - decision trees - and how differential privacy interacts with each of the components that constitute decision tree algorithms. We analyze the conflicts that arise when balancing privacy requirements with the utility of a model. We view "utility" as a two-sided coin; on one side there is prediction accuracy, and on the other there is knowledge discovery. Optimal results for both sides cannot be achieved at the same time, and the importance of each side is dependent on the user's needs. We explore the trade-offs that need to be made when prioritizing one side over the other.
Many techniques have been developed to preserve privacy over the years, but one stands out above the rest: differential privacy. Differential privacy is an enforceable definition of privacy that can be used in data mining algorithms, guaranteeing that nothing will be learned about the people in the data that could not already be discovered without their personal information.
In this thesis, we focus on one particular data mining algorithm - decision trees - and how differential privacy interacts with each of the components that constitute decision tree algorithms. We analyze the conflicts that arise when balancing privacy requirements with the utility of a model. We view "utility" as a two-sided coin; on one side there is prediction accuracy, and on the other there is knowledge discovery. Optimal results for both sides cannot be achieved at the same time, and the importance of each side is dependent on the user's needs. We explore the trade-offs that need to be made when prioritizing one side over the other.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Supervisors/Advisors |
|
Award date | 01 Jun 2017 |
Publisher | |
Publication status | Published - 2017 |