StreamLab: Large-Scale Machine Learning on Streams

StreamLab: Large-Scale Machine Learning on Streams
Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro, Carlos Guestrin

Citation
Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro, Carlos Guestrin. "StreamLab: Large-Scale Machine Learning on Streams". Talk or presentation, 28, October, 2014; Poster presented at the 2014 TerraSwarm Annual Meeting. .

Abstract
We are exploring a novel abstraction for distributed predictive modeling over streaming data, called StreamLab. Existing architectures, such as Graphlab, are designed to handle large collections of structured data, however the same modeling paradigms do not hold when data is in form of multiple streams of high-volume items. The team is exploring machine learning for parallel, online training that can capture dependencies across streams, and designing primitives that can be easily used to implement a large family of complex learning algorithms. Along with the implementation that includes a distributed parameter server that allows parallel online learners to share parameters, and stream aggregators that compute approximate statistics over large streams, the team has developed a novel “learning over counts” algorithm. This training algorithm uses aggregated statistics for prediction, and experiments on a variety of datasets from KDDCup and Kaggle demonstrate that “learning by counts” provides a scalable, accurate, and generalizable machine learning approach. Future work on StreamLab includes design and implementation of a number of existing machine learning algorithms into the current framework, in the process identifying primitive operations and reusable modules common to large-scale stream analytics.

Electronic downloads

Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.

Citation formats

HTML

Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro,
Carlos Guestrin. <a
href="http://www.terraswarm.org/pubs/455.html"><i>StreamLab:
Large-Scale Machine Learning on Streams</i></a>,
Talk or presentation,  28, October, 2014; Poster presented
at the <a
href="http://www.terraswarm.org/conferences/14/annual"
>2014 TerraSwarm Annual Meeting</a>.
.

Plain text

Sameer Singh, Ignacio Cano, Tianqi Chen, Marco Ribeiro,
Carlos Guestrin. "StreamLab: Large-Scale Machine
Learning on Streams". Talk or presentation,  28,
October, 2014; Poster presented at the <a
href="http://www.terraswarm.org/conferences/14/annual"
>2014 TerraSwarm Annual Meeting</a>.
.

BibTeX

@presentation{SinghCanoChenRibeiroGuestrin14_StreamLabLargeScaleMachineLearningOnStreams,
    author = {Sameer Singh and Ignacio Cano and Tianqi Chen and
              Marco Ribeiro and Carlos Guestrin},
    title = {StreamLab: Large-Scale Machine Learning on Streams},
    day = {28},
    month = {October},
    year = {2014},
    note = {Poster presented at the <a
              href="http://www.terraswarm.org/conferences/14/annual"
              >2014 TerraSwarm Annual Meeting</a>.
},
    abstract = {We are exploring a novel abstraction for
              distributed predictive modeling over streaming
              data, called StreamLab. Existing architectures,
              such as Graphlab, are designed to handle large
              collections of structured data, however the same
              modeling paradigms do not hold when data is in
              form of multiple streams of high-volume items. The
              team is exploring machine learning for parallel,
              online training that can capture dependencies
              across streams, and designing primitives that can
              be easily used to implement a large family of
              complex learning algorithms. Along with the
              implementation that includes a distributed
              parameter server that allows parallel online
              learners to share parameters, and stream
              aggregators that compute approximate statistics
              over large streams, the team has developed a novel
              â��learning over countsâ�� algorithm. This
              training algorithm uses aggregated statistics for
              prediction, and experiments on a variety of
              datasets from KDDCup and Kaggle demonstrate that
              â��learning by countsâ�� provides a scalable,
              accurate, and generalizable machine learning
              approach. Future work on StreamLab includes design
              and implementation of a number of existing machine
              learning algorithms into the current framework, in
              the process identifying primitive operations and
              reusable modules common to large-scale stream
              analytics.},
    URL = {http://terraswarm.org/pubs/455.html}
}

Posted by Sameer Singh on 10 Nov 2014.
Groups: tools

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.