Streaming Document Filtering using Distributed Non-Parametric Representations
Ignacio Cano, Sameer Singh, Carlos Guestrin

Citation
Ignacio Cano, Sameer Singh, Carlos Guestrin. "Streaming Document Filtering using Distributed Non-Parametric Representations". Talk or presentation, 30, October, 2014; Poster presented at the 2014 TerraSwarm Annual Meeting.

Abstract
When dealing with large, streaming datasets (such as of text documents), practitioners are often interested in identifying references to certain entities, and studying their prominence and topics over time. Current tools are quite restrictive for this setting; they are unable to handle streaming data, do not partition the references according to topics, and do not identify the vitality of the references. In this work, we propose a system that uses a flexible representation of entity contexts that are updated in a streaming fashion. We use distributed vector based representations of contexts, which address lexical sparsity in the data and generalize to unseen documents. Since each entity has several different aspects in which it may be mentioned (its topics), we propose a non-parametric clustering approach to identify which aspect of the entity is mentioned in each document. It is also crucial to capture temporal dynamics of each content; we propose a "staleness" measure to each entity topic, dynamically estimating the relevance of each entity. To enable analysis on massive scale streams, we propose updates to the topic identities, number of topics, and the staleness of topics in an online fashion, observing only a single document at a time. This combination of non-parametric clustering, staleness, and distributed word embeddings provides an efficient yet accurate representation of entity context that can be updated in a streaming manner. Our browser-based visualization demo provides an easy to use interface that enables users to switch between multiple entities of interest, select the time ranges to explore over, explore the prominence of topics over time, and understand the topics using word clouds.

Electronic downloads


Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.
Citation formats  
  • HTML
    Ignacio Cano, Sameer Singh, Carlos Guestrin. <a
    href="http://www.terraswarm.org/pubs/433.html"><i>Streaming
    Document Filtering using Distributed Non-Parametric
    Representations</i></a>, Talk or presentation, 
    30, October, 2014; Poster presented at the <a
    href="http://www.terraswarm.org/conferences/14/annual"
    >2014 TerraSwarm Annual Meeting</a>.
  • Plain text
    Ignacio Cano, Sameer Singh, Carlos Guestrin. "Streaming
    Document Filtering using Distributed Non-Parametric
    Representations". Talk or presentation,  30, October,
    2014; Poster presented at the <a
    href="http://www.terraswarm.org/conferences/14/annual"
    >2014 TerraSwarm Annual Meeting</a>.
  • BibTeX
    @presentation{CanoSinghGuestrin14_StreamingDocumentFilteringUsingDistributedNonParametric,
        author = {Ignacio Cano and Sameer Singh and Carlos Guestrin},
        title = {Streaming Document Filtering using Distributed
                  Non-Parametric Representations},
        day = {30},
        month = {October},
        year = {2014},
        note = {Poster presented at the <a
                  href="http://www.terraswarm.org/conferences/14/annual"
                  >2014 TerraSwarm Annual Meeting</a>},
        abstract = {When dealing with large, streaming datasets (such
                  as of text documents), practitioners are often
                  interested in identifying references to certain
                  entities, and studying their prominence and topics
                  over time. Current tools are quite restrictive for
                  this setting; they are unable to handle streaming
                  data, do not partition the references according to
                  topics, and do not identify the vitality of the
                  references. In this work, we propose a system that
                  uses a flexible representation of entity contexts
                  that are updated in a streaming fashion. We use
                  distributed vector based representations of
                  contexts, which address lexical sparsity in the
                  data and generalize to unseen documents. Since
                  each entity has several different aspects in which
                  it may be mentioned (its topics), we propose a
                  non-parametric clustering approach to identify
                  which aspect of the entity is mentioned in each
                  document. It is also crucial to capture temporal
                  dynamics of each content; we propose a "staleness"
                  measure to each entity topic, dynamically
                  estimating the relevance of each entity. To enable
                  analysis on massive scale streams, we propose
                  updates to the topic identities, number of topics,
                  and the staleness of topics in an online fashion,
                  observing only a single document at a time. This
                  combination of non-parametric clustering,
                  staleness, and distributed word embeddings
                  provides an efficient yet accurate representation
                  of entity context that can be updated in a
                  streaming manner. Our browser-based visualization
                  demo provides an easy to use interface that
                  enables users to switch between multiple entities
                  of interest, select the time ranges to explore
                  over, explore the prominence of topics over time,
                  and understand the topics using word clouds. },
        URL = {http://terraswarm.org/pubs/433.html}
    }
    

Posted by Ignacio Cano on 4 Nov 2014.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.