Team for Research in
Ubiquitous Secure Technology

Achieving Anonymity via Clustering
Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, An Zhu

Citation
Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, An Zhu. "Achieving Anonymity via Clustering". 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June, 2006.

Abstract
Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code~\cite{swe00}. Sweeney~\cite{swe02} proposed the $k$-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least $k-1$ other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than $k$-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter $k$. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an $\epsilon$ fraction of points to remain unclustered, \ie, deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

Electronic downloads

Citation formats  
  • HTML
    Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Samir
    Khuller, Rina Panigrahy, Dilys Thomas, An Zhu. <a
    href="http://www.truststc.org/pubs/99.html"
    >Achieving Anonymity via Clustering</a>, 25th ACM
    SIGMOD-SIGACT-SIGART Symposium on Principles of Database
    Systems, June, 2006.
  • Plain text
    Gagan Aggarwal, Tomas Feder, Krishnaram Kenthapadi, Samir
    Khuller, Rina Panigrahy, Dilys Thomas, An Zhu.
    "Achieving Anonymity via Clustering". 25th ACM
    SIGMOD-SIGACT-SIGART Symposium on Principles of Database
    Systems, June, 2006.
  • BibTeX
    @inproceedings{AggarwalFederKenthapadiKhullerPanigrahyThomasZhu06_AchievingAnonymityViaClustering,
        author = {Gagan Aggarwal, Tomas Feder, Krishnaram
                  Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys
                  Thomas, An Zhu},
        title = {Achieving Anonymity via Clustering},
        booktitle = {25th ACM SIGMOD-SIGACT-SIGART Symposium on
                  Principles of Database Systems},
        month = {June},
        year = {2006},
        abstract = {Publishing data for analysis from a table
                  containing personal records, while maintaining
                  individual privacy, is a problem of increasing
                  importance today. The traditional approach of
                  de-identifying records is to remove identifying
                  fields such as social security number, name etc.
                  However, recent research has shown that a large
                  fraction of the US population can be identified
                  using non-key attributes (called
                  quasi-identifiers) such as date of birth, gender,
                  and zip code~\cite{swe00}. Sweeney~\cite{swe02}
                  proposed the $k$-anonymity model for privacy where
                  non-key attributes that leak information are
                  suppressed or generalized so that, for every
                  record in the modified table, there are at least
                  $k-1$ other records having exactly the same values
                  for quasi-identifiers. We propose a new method for
                  anonymizing data records, where quasi-identifiers
                  of data records are first clustered and then
                  cluster centers are published. To ensure privacy
                  of the data records, we impose the constraint that
                  each cluster must contain no fewer than a
                  pre-specified number of data records. This
                  technique is more general since we have a much
                  larger choice for cluster centers than
                  $k$-Anonymity. In many cases, it lets us release a
                  lot more information without compromising privacy.
                  We also provide constant-factor approximation
                  algorithms to come up with such a clustering. This
                  is the first set of algorithms for the
                  anonymization problem where the performance is
                  independent of the anonymity parameter $k$. We
                  further observe that a few outlier points can
                  significantly increase the cost of anonymization.
                  Hence, we extend our algorithms to allow an
                  $\epsilon$ fraction of points to remain
                  unclustered, \ie, deleted from the anonymized
                  publication. Thus, by not releasing a small
                  fraction of the database records, we can ensure
                  that the data published for analysis has less
                  distortion and hence is more useful. Our
                  approximation algorithms for new clustering
                  objectives are of independent interest and could
                  be applicable in other clustering scenarios as
                  well. },
        URL = {http://www.truststc.org/pubs/99.html}
    }
    

Posted by Dilys Thomas on 30 May 2006.
For additional information, see the Publications FAQ or contact webmaster at www truststc org.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.