LongCat-Video: Meituan's Open-Source 13.6B Video Generation Model — Long Video Is the Real Battleground

Why Long Video Generation Is Hard

You’ve probably tried several video generation tools by now — Kling, Runway, Pika, Wan2.1. Five-second clips all look decent. But try generating 30 seconds or even a minute of video, and the problems start.

Colors begin drifting after a few seconds — that blue sky gradually shifts to green. Facial details on people dissolve frame by frame. Objects in motion suddenly jump, breaking physical consistency. Most commonly, quality drops off a cliff in the later seconds, as if a different model took over.

This isn’t a bug in any one model. It’s a structural problem with the current video generation paradigm.

The mainstream approach is “concatenative long video”: generate the first 5-second clip, use the last few frames as conditioning input for the next clip, and repeat. The problem is that each clip is independently inferred. The model was never trained to “continue” video, so information breaks at the seams. It’s like having ten people write consecutive chapters of a novel without reading each other’s work — the joins will inevitably crack.

LongCat-Video’s core idea: make Video-Continuation a native pretraining task, so the model learns how to continue video during training, rather than hacking it together at inference time.

Project Overview

LongCat-Video is an open-source foundational video generation model from Meituan’s LongCat team — 13.6B parameters, built on the Diffusion Transformer (DiT) architecture, MIT licensed.

It unifies three tasks in a single model: Text-to-Video, Image-to-Video, and Video-Continuation. Video-Continuation is a native pretraining task, not an inference-time workaround.

All model weights are open-sourced, including the base model and two Avatar variants for audio-driven human video generation.

Core Technical Analysis

Unified Architecture: One Model, Three Tasks

The most notable design choice in LongCat-Video is task unification.

Many teams train separate models — one for T2V, one for I2V, one for continuation. At inference, you load different weights depending on the task. This is engineering-simple, but each model’s training data only covers a single task distribution, capping model capability.

LongCat-Video unifies all three tasks within a single DiT framework. The input conditions differentiate task types: text-only for T2V, text + first-frame image for I2V, text + preceding video for Video-Continuation. The model shares all parameters across tasks, and training signals from different tasks reinforce each other.

The benefit: the model sees more diverse data distributions. T2V training teaches “text description to visual scene mapping.” I2V teaches “first-frame consistency.” Video-Continuation teaches “temporal and style continuity from prior context.” Shared parameters allow knowledge transfer across tasks.

The cost is more complex training — you need to balance loss weights across tasks to prevent one from dominating gradients. But from the evaluation results, LongCat-Video’s performance on each individual task hasn’t been significantly compromised by multi-task training.

Coarse-to-Fine Generation Strategy

A direct challenge in long video generation is compute. A 1-minute video at 720p, 30fps contains 1800 frames. Denoising frame-by-frame is prohibitively expensive.

LongCat-Video employs a “spatiotemporal coarse-to-fine” strategy:

  • Temporal axis: First generate a video skeleton at low framerate (e.g., only keyframes per second), then interpolate to fill intermediate frames
  • Spatial axis: First generate a low-resolution version, then upsample to 720p

The core insight: video information density is uneven. Keyframes carry the core motion and semantics; intermediate frames are mostly smooth transitions. Low-resolution versions carry scene structure and layout; details can be added later. Getting the big structure right first, then adding details level by level, is more efficient than full-resolution generation in one pass.

Combined with Block Sparse Attention, the model skips unimportant attention blocks at high resolution, further reducing computation. In practice, 720p 30fps video generates in minutes.

Multi-Reward RLHF: GRPO

Video generation models trained with only pretraining and SFT tend to produce inconsistent quality — misaligned text understanding, unnatural motion, rough visual fidelity.

LongCat-Video uses Multi-reward GRPO (Group Relative Policy Optimization) for reinforcement learning alignment. The core idea is simultaneously optimizing multiple reward signals: text alignment, visual quality, motion quality, etc. GRPO’s advantage over traditional PPO is that it doesn’t require a separate value network, reducing training instability.

From the evaluation data, RLHF brings substantial improvements. LongCat-Video achieves performance comparable to closed-source solutions on both internal and public benchmarks. A 13.6B dense model’s overall quality is close to a 28B MoE model (Wan 2.2-A14B).

Video-Continuation: Native Pretraining for Long Video

This is LongCat-Video’s biggest differentiator from other open-source video models.

The problem with concatenative long video is that each segment is independently inferred, lacking global temporal consistency. LongCat-Video makes Video-Continuation a pretraining task — the model learns to “generate subsequent content given preceding video” during training. This means:

  • The model natively understands temporal correlation between segments, no inference-time patching needed
  • Color consistency and style consistency are guaranteed by training, not post-processing
  • Long video generation doesn’t suffer quality degradation — later segments match the quality of the first

The model supports generating minute-length videos without color drift or quality degradation.

LongCat-Video-Avatar: Audio-Driven Digital Humans

Beyond the base video generation model, the project also open-sources two Avatar variants for audio-driven human video generation.

Avatar 1.0 uses a Wav2Vec2 audio encoder, supporting single-person and multi-person audio input. It handles Audio-Text-to-Video, Audio-Image-to-Video, and continuation-based long video generation.

Avatar 1.5 is the latest upgrade with these key improvements:

  • Audio encoder upgraded from Wav2Vec2 to Whisper-large-v3 for significantly better lip synchronization
  • Step distillation compresses inference to 8 steps for faster generation
  • INT8 quantization support reduces VRAM usage
  • Better generalization — supports stylized domains including anime, animals, and complex real-world conditions
  • Supports both single-stream and multi-stream audio inputs

Evaluation Results

Text-to-Video

MetricVeo3PixVerse-V5Wan 2.2-T2V-A14BLongCat-Video
AccessibilityClosedClosedOpen SourceOpen Source
Architecture--MoEDense
Total Params--28B13.6B
Activated Params--14B13.6B
Text-Alignment3.993.813.703.76
Visual Quality3.233.133.263.25
Motion Quality3.863.813.783.74
Overall Quality3.483.363.353.38

A 13.6B dense model achieving overall quality of 3.38, close to the 28B MoE model’s 3.35, exceeding expectations for this parameter count. Visual quality at 3.25 even slightly exceeds Wan 2.2.

Image-to-Video

MetricSeedance 1.0Hailuo-02Wan 2.2-I2V-A14BLongCat-Video
Image-Alignment4.124.184.184.04
Text-Alignment3.703.853.333.49
Visual Quality3.223.183.233.27
Motion Quality3.773.803.793.59
Overall Quality3.353.273.263.17

On I2V, LongCat-Video leads in visual quality (3.27, highest), but shows a gap in motion quality and image alignment compared to top solutions. This is the trade-off of multi-task unification — resources allocated to T2V and continuation capabilities come at some cost to I2V motion detail.

Quick Start

Installation

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

conda create -n longcat-video python=3.10
conda activate longcat-video

pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install ninja psutil packaging flash_attn==2.7.4.post1
pip install -r requirements.txt

For Avatar support, additionally install:

conda install -c conda-forge librosa ffmpeg
pip install -r requirements_avatar.txt

Download Model Weights

pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video

Text-to-Video

Single GPU:

torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

Multi-GPU:

torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile

Long Video Generation

torchrun run_demo_long_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

Interactive Video Generation

torchrun run_demo_interactive_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

Avatar 1.5 Audio-Driven Generation

torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py \
  --context_parallel_size=2 \
  --checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
  --stage_1=ai2v \
  --input_json=assets/avatar/single_example_1.json \
  --use_distill --model_type avatar-v1.5 --use_int8

Avatar 1.5 requires the --use_distill flag for step distillation. INT8 quantization is optional and reduces VRAM usage.

Web UI

streamlit run ./run_streamlit.py --server.fileWatcherType none --server.headless=false

When to Use It

  • Long-form video production: Ad clips, product demos, short films — anything needing more than 10 seconds of temporally consistent video. Native continuation is LongCat-Video’s core advantage
  • Digital human videos: The Avatar series supports audio-driven generation, suitable for virtual presenters, educational videos, customer service dialogues
  • Video continuation and extension: When you have existing footage that needs to be extended in duration, LongCat-Video can seamlessly continue it
  • Interactive video generation: Supports real-time interactive generation workflows, suitable for creative exploration and prototyping

Limitations and Caveats

  • I2V motion quality gap: From evaluation data, I2V motion quality (3.59) is notably lower than top competitors (Hailuo-02 at 3.80). If your primary use case is image-to-video with strict motion naturalness requirements, benchmark it yourself first
  • High VRAM requirements: 13.6B parameter model requires multi-GPU or large-VRAM GPUs for inference. INT8 quantization is only supported on Avatar 1.5, not the base model
  • Avatar 1.5 Audio CFG tuning: Lip sync quality is sensitive to the Audio CFG parameter. Recommended range is 3-5; you’ll need to tune per audio clip
  • Avatar repeated action issues: Mitigable by adjusting --ref_img_index (0-24 for consistency, 30 to reduce repetition) and --mask_frame_range (larger reduces repetition but may introduce artifacts)
  • Training code not open-sourced: Only inference code and weights are available; the training pipeline is not public, so fine-tuning isn’t possible yet
  • Early-stage community: The project was open-sourced in October 2025; community contributions and ecosystem are still developing

Comparison with Alternatives

DimensionLongCat-VideoWan 2.2CogVideoX
Parameters13.6B Dense28B MoE (14B activated)5B
Native long videoYesNoNo
Unified T2V/I2V/ContinuationYesNoNo
Avatar digital humanYes (1.0 + 1.5)NoNo
LicenseMITApache 2.0Apache 2.0
T2V overall quality3.383.35-

LongCat-Video’s differentiator isn’t crushing any single metric. It’s long video and task unification. If you need 5-second high-quality clips, Wan 2.2 may be more straightforward. If your use case demands minute-length video, continuation capability, or audio-driven digital humans, LongCat-Video is currently the most pragmatic open-source choice.

Conclusion

The next battleground for video generation models isn’t 5-second clip quality — that problem is largely solved. The real challenge is long video: how to make AI generate minutes of temporally consistent, quality-stable video.

LongCat-Video offers an answer through native Video-Continuation pretraining. It’s not the best model at any single task, but it’s the only open-source framework that treats long video as a first-class citizen.

If your video generation needs go beyond 5 seconds, it’s worth trying.

Repo: https://github.com/meituan-longcat/LongCat-Video

Related Posts

AntV Infographic Turns AI Output Into Editable Visuals

AntV Infographic Turns AI Output Into Editable Visuals

The Problem The awkward part of AI-generated infographics is not that AI cannot draw. It is that ...

skills.sh: Vercel Is Building the npm for Agent Skills

The Problem Open any coding agent today—Claude Code, Cursor, Codex, OpenCode—and you will find a ...

Claw Code: Why AI Programming Needs Open-Source Agent Harnesses

Claw Code: Why AI Programming Needs Open-Source Agent Harnesses

Why This Project Exists AI programming tools have evolved fast in the past year. First, develop ...

Pi: A Coding Agent That Refuses to Own Your Workflow

Pi: A Coding Agent That Refuses to Own Your Workflow

The Problem Coding agent tools are turning into full IDE-shaped products: plan mode, sub-agents, ...