Streaming Document Filtering using Distributed Non-Parametric Representations

Citation
"Streaming Document Filtering using Distributed Non-Parametric Representations". The Twenty-Third Text REtrieval Conference (TREC 2014) Proceedings (ed.), NIST, December, 2014; Pre-publication.

Abstract
When dealing with large, streaming corpora of text documents, practitioners are often interested in identifying references to entities of interest, and studying their prominence and topics over time. Current tools are quite restrictive for this setting; they are unable to handle streaming data, do not partition the references according to topics, and do not identify the vitality of the references. In this work, we propose a system that uses a flexible representation of entity contexts that are updated in a streaming fashion. Each entity context is represented by topic clusters, that are estimated in a non-parametric manner by assuming that the context of each entity in a single document belongs to a single topic. To address the lexical sparsity and generalize to unseen documents, each document is represented by its mean word embedding, while each topic cluster is represented by the mean embedding vector of the documents in the cluster. Further, we associate a staleness measure to each topic cluster, dynamically estimating the relevance of each entity based on document frequencies. We update the topic identities, number of topics, and the staleness of topics in an online fashion, observing only a single document at a time. This combination of non-parametric clustering, staleness, and distributed word embeddings provides an efficient yet accurate representation of entity contexts that can be updated in a streaming manner.

Electronic downloads

Citation formats  
  • HTML
     <a
    href="http://www.terraswarm.org/pubs/435.html"
    ><i>Streaming Document Filtering using Distributed
    Non-Parametric Representations</i></a>, The
    Twenty-Third Text REtrieval Conference (TREC 2014)
    Proceedings (ed.), NIST, December, 2014; Pre-publication.
  • Plain text
     "Streaming Document Filtering using Distributed
    Non-Parametric Representations". The Twenty-Third Text
    REtrieval Conference (TREC 2014) Proceedings (ed.), NIST,
    December, 2014; Pre-publication.
  • BibTeX
    @proceedings{Proceedings14_StreamingDocumentFilteringUsingDistributedNonParametric,
        title = {Streaming Document Filtering using Distributed
                  Non-Parametric Representations},
        editor = {The Twenty-Third Text REtrieval Conference (TREC
                  2014) Proceedings},
        organization = {NIST},
        month = {December},
        year = {2014},
        note = {Pre-publication},
        abstract = {When dealing with large, streaming corpora of text
                  documents, practitioners are often interested in
                  identifying references to entities of interest,
                  and studying their prominence and topics over
                  time. Current tools are quite restrictive for this
                  setting; they are unable to handle streaming data,
                  do not partition the references according to
                  topics, and do not identify the vitality of the
                  references. In this work, we propose a system that
                  uses a flexible representation of entity contexts
                  that are updated in a streaming fashion. Each
                  entity context is represented by topic clusters,
                  that are estimated in a non-parametric manner by
                  assuming that the context of each entity in a
                  single document belongs to a single topic. To
                  address the lexical sparsity and generalize to
                  unseen documents, each document is represented by
                  its mean word embedding, while each topic cluster
                  is represented by the mean embedding vector of the
                  documents in the cluster. Further, we associate a
                  staleness measure to each topic cluster,
                  dynamically estimating the relevance of each
                  entity based on document frequencies. We update
                  the topic identities, number of topics, and the
                  staleness of topics in an online fashion,
                  observing only a single document at a time. This
                  combination of non-parametric clustering,
                  staleness, and distributed word embeddings
                  provides an efficient yet accurate representation
                  of entity contexts that can be updated in a
                  streaming manner.},
        URL = {http://terraswarm.org/pubs/435.html}
    }
    

Posted by Ignacio Cano on 4 Nov 2014.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.