Machine Learning Module for Big Data Analysis in Kepler

Machine Learning Module for Big Data Analysis in Kepler
Mai Nguyen, Daniel Crawl, Jianwu Wang, Ilkay Altintas

Citation
Mai Nguyen, Daniel Crawl, Jianwu Wang, Ilkay Altintas. "Machine Learning Module for Big Data Analysis in Kepler". Talk or presentation, 16, October, 2015; Presented at the Eleventh Biennial Ptolemy Miniconference, Berkeley.

Abstract
Kepler is a scientific workflow system that is built on the Ptolemy II framework. Kepler provides a graphical user interface that allows users to easily design scientific workflows by simply dragging and dropping actors implementing different operations and linking them together to create the steps necessary for a specific workflow. Machine learning techniques provide a way to analyze the problem being studied using a data-driven approach, and is an essential part of many scientific processes. The machine learning module in Kepler allows users to integrate machine learning functionality into a workflow, even if the machine learning algorithms are implemented on different tools or platforms. For example, an R script can be executed within Kepler using the RExpression actor. A Mahout algorithm or KNIME workflow can be executed in command-line mode using the ExternalExecution actor. For big data processing, actors implementing Spark MLLib algorithms are being developed in Kepler. Spark is a cluster computing framework, and MLlib is a distributed machine learning library on top of Spark. Spark’s distributed in-memory architecture provides fast and scalable processing of iterative operations, which is ideal for machine learning algorithms. The machine learning module in Kepler can also create an actor for a single machine learning algorithm based on different implementations. As an example, the kmeans-all actor in Kepler implements the k-means clustering algorithm using R, Spark MLlib, Mahout, and KNIME. This feature allows the user to compare accuracy and processing results for a single algorithm using different implementations. This can be accomplished using the same actor, with the only change being the choice of implementation when the actor is executed. The user does not need to know much about any of the underlying tools (e.g., R or Spark) in order to use these actors. Each actor in the machine learning module can also be connected to other actors available in Kepler to build complex workflows.

Electronic downloads

Nguyen_ML4BigDataAnalysisInKepler_PtolemyMiniConference_16Oct2015.pptx · application/vnd.openxmlformats-officedocument.presentationml.pre · 1538 kbytes

Citation formats

HTML

Mai Nguyen, Daniel Crawl, Jianwu Wang, Ilkay Altintas. <a
href="http://chess.eecs.berkeley.edu/pubs/1123.html"><i>Machine
Learning Module for Big Data Analysis in
Kepler</i></a>, Talk or presentation,  16,
October, 2015; Presented at the <a
href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
>Eleventh Biennial Ptolemy Miniconference</a>,
Berkeley.

Plain text

Mai Nguyen, Daniel Crawl, Jianwu Wang, Ilkay Altintas.
"Machine Learning Module for Big Data Analysis in
Kepler". Talk or presentation,  16, October, 2015;
Presented at the <a
href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
>Eleventh Biennial Ptolemy Miniconference</a>,
Berkeley.

BibTeX

@presentation{NguyenCrawlWangAltintas15_MachineLearningModuleForBigDataAnalysisInKepler,
    author = {Mai Nguyen and Daniel Crawl and Jianwu Wang and
              Ilkay Altintas},
    title = {Machine Learning Module for Big Data Analysis in
              Kepler},
    day = {16},
    month = {October},
    year = {2015},
    note = {Presented at the <a
              href="http://ptolemy.eecs.berkeley.edu/conferences/15/"
              >Eleventh Biennial Ptolemy Miniconference</a>,
              Berkeley},
    abstract = {Kepler is a scientific workflow system that is
              built on the Ptolemy II framework. Kepler provides
              a graphical user interface that allows users to
              easily design scientific workflows by simply
              dragging and dropping actors implementing
              different operations and linking them together to
              create the steps necessary for a specific
              workflow. Machine learning techniques provide a
              way to analyze the problem being studied using a
              data-driven approach, and is an essential part of
              many scientific processes. The machine learning
              module in Kepler allows users to integrate machine
              learning functionality into a workflow, even if
              the machine learning algorithms are implemented on
              different tools or platforms. For example, an R
              script can be executed within Kepler using the
              RExpression actor. A Mahout algorithm or KNIME
              workflow can be executed in command-line mode
              using the ExternalExecution actor. For big data
              processing, actors implementing Spark MLLib
              algorithms are being developed in Kepler. Spark is
              a cluster computing framework, and MLlib is a
              distributed machine learning library on top of
              Spark. Sparkâ��s distributed in-memory
              architecture provides fast and scalable processing
              of iterative operations, which is ideal for
              machine learning algorithms. The machine learning
              module in Kepler can also create an actor for a
              single machine learning algorithm based on
              different implementations. As an example, the
              kmeans-all actor in Kepler implements the k-means
              clustering algorithm using R, Spark MLlib, Mahout,
              and KNIME. This feature allows the user to compare
              accuracy and processing results for a single
              algorithm using different implementations. This
              can be accomplished using the same actor, with the
              only change being the choice of implementation
              when the actor is executed. The user does not need
              to know much about any of the underlying tools
              (e.g., R or Spark) in order to use these actors.
              Each actor in the machine learning module can also
              be connected to other actors available in Kepler
              to build complex workflows. },
    URL = {http://chess.eecs.berkeley.edu/pubs/1123.html}
}

Posted by Christopher Brooks on 19 Oct 2015.
Groups: ptolemy
For additional information, see the Publications FAQ or contact webmaster at chess eecs berkeley edu.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.