Summarization of and Learning in TerraSwarm Big Data
Jeffrey A. Bilmes

Citation
Jeffrey A. Bilmes. "Summarization of and Learning in TerraSwarm Big Data". Talk or presentation, 30, October, 2014.

Abstract
TerraSwarm collects unprecedentedly vast amounts of heterogeneous data, and makes well-informed external-world timely decisions and actions based thereon. New computational strategies are needed to deal with such masses of data, which are likely also to be highly redundant. Redundancy is detrimental to scientific discovery, computational resources (storage, CPU, communications), and multi-pass machine learning methods. Such redundancy is unavoidable, however, since it is not cost-effective to deploy, at data collection time, coordinated strategies at network leaf-nodes (the sensors) to remove redundancy. Our approach is to develop mathematical models of the "information" in the data to be able to summarize it cost effectively --- this means we compute subsets that have minimal information loss. We utilize generalized "information functions" (in the form of submodular functions) to find data subsets that retain the information of the whole. In this poster, we describe how submodular summarization methods have been very effective for summarizing quite different types of data, including large sets of text documents, machine learning training data sets (e.g., for speech recognition and machine translation), and also photo collections.

Electronic downloads


Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.
Citation formats  
  • HTML
    Jeffrey A. Bilmes. <a
    href="http://www.terraswarm.org/pubs/405.html"
    ><i>Summarization of and Learning in TerraSwarm Big
    Data</i></a>, Talk or presentation,  30,
    October, 2014.
  • Plain text
    Jeffrey A. Bilmes. "Summarization of and Learning in
    TerraSwarm Big Data". Talk or presentation,  30,
    October, 2014.
  • BibTeX
    @presentation{Bilmes14_SummarizationOfLearningInTerraSwarmBigData,
        author = {Jeffrey A. Bilmes},
        title = {Summarization of and Learning in TerraSwarm Big
                  Data},
        day = {30},
        month = {October},
        year = {2014},
        abstract = {TerraSwarm collects unprecedentedly vast amounts
                  of heterogeneous data, and makes well-informed
                  external-world timely decisions and actions based
                  thereon. New computational strategies are needed
                  to deal with such masses of data, which are likely
                  also to be highly redundant. Redundancy is
                  detrimental to scientific discovery, computational
                  resources (storage, CPU, communications), and
                  multi-pass machine learning methods.	 Such
                  redundancy is unavoidable, however, since it is
                  not cost-effective to deploy, at data collection
                  time, coordinated strategies at network leaf-nodes
                  (the sensors) to remove redundancy. Our approach
                  is to develop mathematical models of the
                  "information" in the data to be able to summarize
                  it cost effectively --- this means we compute
                  subsets that have minimal information loss. We
                  utilize generalized "information functions" (in
                  the form of submodular functions) to find data
                  subsets that retain the information of the whole.
                  In this poster, we describe how submodular
                  summarization methods have been very effective for
                  summarizing quite different types of data,
                  including large sets of text documents, machine
                  learning training data sets (e.g., for speech
                  recognition and machine translation), and also
                  photo collections. },
        URL = {http://terraswarm.org/pubs/405.html}
    }
    

Posted by Jeffrey A. Bilmes on 28 Oct 2014.
Groups: services

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.