Animating the Uncaptured

Abstract

Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations.

We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions.

From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh.

We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.

Method Overview

Sizes of model trees

We present a novel approach for text-driven animation of 3D humanoid meshes.

Given a static 3D humanoid mesh and a text prompt, we generate a motion video using a Video Diffusion Model (VDM) conditioned on the rendering of the mesh from a single view.
To animate the input mesh, we use the SMPL body model as a deformation proxy. This involves: 1) Fitting the SMPL model to the input mesh, and 2) Anchoring mesh vertices with SMPL faces.
Finally, we optimize for the neural parameters to track the motion from the video using body landmarks, silhouette mask and DINOv2 dense features.

Comparisons with MDM