Alibaba released video AI Wan 2.2 Animate: single-photo animation and face swapping

Image

Alibaba has introduced Wan 2.2 Animate — an open-source video model for animating a character from a single photo and replacing faces in existing videos. It is part of the Wan 2.2 family of models, which includes text-to-video (T2V), speech-to-video (S2V), and other variations. The model is available on Hugging Face and GitHub under the Apache-2.0 license, and you can also try it online.

The Animate-14B architecture is based on a Mixture of Experts (MoE) with two specialists: one handles high-noise stages, the other handles low-noise stages. In total, it has 27 billion parameters, but only 14 billion are used at each step, which saves on computation. Progress in cinematic aesthetics and handling complex movements was achieved thanks to an expanded dataset: 65.6% more images and 83.2% more videos compared to the previous version.

The process is simple: you provide a reference photo of the character and a driving video with the desired movements. The system extracts poses and masks. Then there are two modes. In Animation, a new clip is 'assembled' from the photo — the model transfers the movements and facial expressions from the driving video to the character (resulting in a video with your character making the same gestures and at the same angles). In Replacement, the original video remains the same (scene, background, camera, timing), but the model replaces the person in it with the character from the photo — you can limit it to the face or do a full body replacement, while preserving poses and lip sync.

For local execution, the full version of Animate-14B requires about 80 GB of VRAM, but it can also run on 24 GB (e.g., an RTX 4090) using offloading (unloading part of the data to RAM) or FP8. The simplified TI2V-5B version runs on a 4090 and produces 720p video at 24 fps.