CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis

1IIIS, Tsinghua University 2Shanghai Artificial Intelligence Laboratory 3Shanghai Qi Zhi Institute
*Equal contribution with the order determined by rolling dice
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Abstract

In this work, we focus on a novel task of category-level functional hand-object manipulation synthesis covering both rigid and articulated object categories. Given an object geometry, an initial human hand pose as well as a sparse control sequence of object poses, our goal is to generate a physically reasonable hand-object manipulation sequence that performs like human beings. To address a such challenge, we first design CAnonicalized Manipulation Spaces (CAMS), a two-level space hierarchy that canonicalizes the hand poses in an object-centric and contact-centric view. Benefiting from the representation capability of CAMS, we then present a two-stage framework for synthesizing human-like manipulation animations. Our framework achieves state-of-the-art performance for both rigid and articulated categories with impressive visual effects.

Video

Method

Our framework mainly consists of a CVAE-based planner module and an optimization-based synthesizer module. Given the generation condition as the input, the planner first generates a per-stage CAMS representation containing contact reference frames and sequences of finger embedding. Then the synthesizer optimizes the whole manipulation animation based on the CAMS embedding.

CAMS Representation of Hand Motion

CAnonicalized Manipulation Spaces have a two-level canonicalization for manipulation representation. At the root level, the canonicalized contact targets (top right) describe the discrete contact information. At the leaf level, the canonicalized finger embedding (bottom right) transforms finger motion from global space into local reference frames defined on the contact targets.

CAMS-CVAE

A CVAE-based motion planner module takes take configuration and object shape as inputs, and generates a CAMS sample of motion corresponding to the input.

Optimization-based Synthesizer

The synthesizer adopts a two-stage optimization method that first optimizes the MANO pose parameters to best fit the CAMS finger embedding and then optimizes the contact effect to improve physical plausibility.

Experiments

Result

Kettle

Input

View 2

View 3

Laptop

Input

View 2

View 3

Pliers

Input

View 2

View 3

Scissors

Input

View 2

View 3

Mode Diversity

Laptop

View 2

View 3

Pliers

Scissors

Comparison

Ours vs GraspTTA vs ManipNet

View 2

View 3

Ours vs GraspTTA vs ManipNet

View 2

View 3

BibTeX

@InProceedings{Zheng_2023_CVPR,
    author    = {Zheng, Juntian and Zheng, Qingyuan and Fang, Lixing and Liu, Yun and Yi, Li},
    title     = {CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {585-594}
}

Contact

If you have any questions, please feel free to contact us:
Juntian Zheng: jt-zheng20@mails.tsinghua.edu.cn
Lixing Fang: flx20@mails.tsinghua.edu.cn