StreamLab: Large-Scale Machine Learning on Streams
Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro, Carlos Guestrin

Citation
Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro, Carlos Guestrin. "StreamLab: Large-Scale Machine Learning on Streams". Talk or presentation, 28, October, 2014; Poster presented at the 2014 TerraSwarm Annual Meeting. .

Abstract
We are exploring a novel abstraction for distributed predictive modeling over streaming data, called StreamLab. Existing architectures, such as Graphlab, are designed to handle large collections of structured data, however the same modeling paradigms do not hold when data is in form of multiple streams of high-volume items. The team is exploring machine learning for parallel, online training that can capture dependencies across streams, and designing primitives that can be easily used to implement a large family of complex learning algorithms. Along with the implementation that includes a distributed parameter server that allows parallel online learners to share parameters, and stream aggregators that compute approximate statistics over large streams, the team has developed a novel “learning over counts” algorithm. This training algorithm uses aggregated statistics for prediction, and experiments on a variety of datasets from KDDCup and Kaggle demonstrate that “learning by counts” provides a scalable, accurate, and generalizable machine learning approach. Future work on StreamLab includes design and implementation of a number of existing machine learning algorithms into the current framework, in the process identifying primitive operations and reusable modules common to large-scale stream analytics.

Electronic downloads


Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.
Citation formats  
  • HTML
    Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro,
    Carlos Guestrin. <a
    href="http://www.terraswarm.org/pubs/455.html"><i>StreamLab:
    Large-Scale Machine Learning on Streams</i></a>,
    Talk or presentation,  28, October, 2014; Poster presented
    at the <a
    href="http://www.terraswarm.org/conferences/14/annual"
    >2014 TerraSwarm Annual Meeting</a>.
    .
  • Plain text
    Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro,
    Carlos Guestrin. "StreamLab: Large-Scale Machine
    Learning on Streams". Talk or presentation,  28,
    October, 2014; Poster presented at the <a
    href="http://www.terraswarm.org/conferences/14/annual"
    >2014 TerraSwarm Annual Meeting</a>.
    .
  • BibTeX
    @presentation{SinghCanoChenRibeiroGuestrin14_StreamLabLargeScaleMachineLearningOnStreams,
        author = {Sameer Singh and Ignacio Cano and Tianqi Chen and
                  Marco Ribeiro and Carlos Guestrin},
        title = {StreamLab: Large-Scale Machine Learning on Streams},
        day = {28},
        month = {October},
        year = {2014},
        note = {Poster presented at the <a
                  href="http://www.terraswarm.org/conferences/14/annual"
                  >2014 TerraSwarm Annual Meeting</a>.
    },
        abstract = {We are exploring a novel abstraction for
                  distributed predictive modeling over streaming
                  data, called StreamLab. Existing architectures,
                  such as Graphlab, are designed to handle large
                  collections of structured data, however the same
                  modeling paradigms do not hold when data is in
                  form of multiple streams of high-volume items. The
                  team is exploring machine learning for parallel,
                  online training that can capture dependencies
                  across streams, and designing primitives that can
                  be easily used to implement a large family of
                  complex learning algorithms. Along with the
                  implementation that includes a distributed
                  parameter server that allows parallel online
                  learners to share parameters, and stream
                  aggregators that compute approximate statistics
                  over large streams, the team has developed a novel
                  âlearning over countsâ algorithm. This
                  training algorithm uses aggregated statistics for
                  prediction, and experiments on a variety of
                  datasets from KDDCup and Kaggle demonstrate that
                  âlearning by countsâ provides a scalable,
                  accurate, and generalizable machine learning
                  approach. Future work on StreamLab includes design
                  and implementation of a number of existing machine
                  learning algorithms into the current framework, in
                  the process identifying primitive operations and
                  reusable modules common to large-scale stream
                  analytics.},
        URL = {http://terraswarm.org/pubs/455.html}
    }
    

Posted by Sameer Singh on 10 Nov 2014.
Groups: tools

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.