Animating the Uncaptured
Humanoid Mesh Animation with Video Diffusion Models

Technical University of Munich

Abstract

Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations.

We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions.

From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh.

We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.

Video

Method Overview

Sizes of model trees

We present a novel approach for text-driven animation of 3D humanoid meshes.

  1. Given a static 3D humanoid mesh and a text prompt, we generate a motion video using a Video Diffusion Model (VDM) conditioned on the rendering of the mesh from a single view.
  2. To animate the input mesh, we use the SMPL body model as a deformation proxy. This involves: 1) Fitting the SMPL model to the input mesh, and 2) Anchoring mesh vertices with SMPL faces.
  3. Finally, we optimize for the neural parameters to track the motion from the video using body landmarks, silhouette mask and DINOv2 dense features.

Comparisons with MDM

More Results

"protecting his eyes from the sun"

"pushing a trolly"

"practicing ice skating"


See more results here!(It may take a few seconds to load the videos on device with slow internet connection or low memory)

Related Links

  • MotionDreamer: A method for animating arbitrary 3D meshes from text prompts using motion priors from VDMs.
  • MDM: A diffusion-based generative model for human motion enabling various conditioning modes.
  • SMPL-X: A unified 3D human model that captures full-body pose, hand articulation, and facial expressions from images.
  • WHAM: A method for reconstructing 3D human motion in global coordinates from video.
  • 3D meshes sourced from Mixamo.

BibTeX


      @misc{millán2025animatinguncapturedhumanoidmesh,
        title={Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models}, 
        author={Marc Benedí San Millán and Angela Dai and Matthias Nießner},
        year={2025},
        eprint={2503.15996},
        archivePrefix={arXiv},
        primaryClass={cs.GR},
        url={https://arxiv.org/abs/2503.15996}, 
      }