MODEC: Multimodal Decomposable Models for Human Pose Estimation
Ben Sapp, Ben Taskar

Citation
Ben Sapp, Ben Taskar. "MODEC: Multimodal Decomposable Models for Human Pose Estimation". CVPR Computer Vision Conference, June, 2013.

Abstract
The PIs propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, the PIs instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, the approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. It also employs a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. The model outperforms state-of-the-art approaches across the accuracy-speed trade-off curve for several pose datasets. This includes a newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are available online.

Electronic downloads

Citation formats  
  • HTML
    Ben Sapp, Ben Taskar. <a
    href="http://www.terraswarm.org/pubs/65.html"
    >MODEC: Multimodal Decomposable Models for Human Pose
    Estimation</a>, CVPR Computer Vision Conference, June,
    2013.
  • Plain text
    Ben Sapp, Ben Taskar. "MODEC: Multimodal Decomposable
    Models for Human Pose Estimation". CVPR Computer Vision
    Conference, June, 2013.
  • BibTeX
    @inproceedings{SappTaskar13_MODECMultimodalDecomposableModelsForHumanPoseEstimation,
        author = {Ben Sapp and Ben Taskar},
        title = {MODEC: Multimodal Decomposable Models for Human
                  Pose Estimation},
        booktitle = {CVPR Computer Vision Conference},
        month = {June},
        year = {2013},
        abstract = {The PIs propose a multimodal, decomposable model
                  for articulated human pose estimation in monocular
                  images. A typical approach to this problem is to
                  use a linear structured model, which struggles to
                  capture the wide range of appearance present in
                  realistic, unconstrained images. In this paper,
                  the PIs instead propose a model of human pose that
                  explicitly captures a variety of pose modes.
                  Unlike other multimodal models, the approach
                  includes both global and local pose cues and uses
                  a convex objective and joint training for mode
                  selection and pose estimation. It also employs a
                  cascaded mode selection step which controls the
                  trade-off between speed and accuracy, yielding a
                  5x speedup in inference and learning. The model
                  outperforms state-of-the-art approaches across the
                  accuracy-speed trade-off curve for several pose
                  datasets. This includes a newly-collected dataset
                  of people in movies, FLIC, which contains an order
                  of magnitude more labeled data for training and
                  testing than existing datasets. The new dataset
                  and code are available online. },
        URL = {http://terraswarm.org/pubs/65.html}
    }
    

Posted by Mila MacBain on 13 May 2013.

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.