FramePack

Packing Input Frame Context for Video Generation

With an innovative next-frame prediction neural network architecture, FramePack continuously generates videos by compressing input frame context to a fixed length, making the generation workload independent of video length.

FramePack Hero Animation

Loading... 7s

Video Diffusion That Feels Like Image Diffusion

FramePack employs a next-frame prediction neural network structure to generate videos continuously by compressing input context to a fixed length, enabling length-invariant generation.

  • Process a large number of frames even on laptop GPUs
  • Only requires 6GB GPU memory
  • Can be trained with a much larger batch size
  • Generate 1-minute, 30FPS videos (1800 frames)

Key Features

Minimal Memory Requirements

Generate 60-second, 30fps (1800 frames) videos with a 13B model using only 6GB VRAM. Laptop GPUs can handle it easily.

Instant Visual Feedback

As a next-frame prediction model, you'll directly see the generated frames, getting plenty of visual feedback throughout the entire generation process.

Compressed Input Context

Compresses input contexts to a constant length, making generation workload invariant to video length and supporting ultra-long video generation.

Standalone Desktop Software

Provides a feature-complete desktop application with minimal standalone high-quality sampling system and memory management.

Amazing Demos

Anime

anime.mp4

Girl2

girl2.mp4

Boy

boy.mp4

Boy2

boy2.mp4

Girl3

girl3.mp4

Girl4

girl4.mp4

Foxpink

foxpink.mp4

Girlflower

girlflower.mp4

Girl

girl.mp4

How It Works

  1. Installation & Setup

    Clone FramePack from GitHub and install all dependencies in your environment.

  2. Define Your Initial Frame

    Upload an image or generate one from a text prompt to start your video sequence.

  3. Create Motion Prompts

    Describe the desired movement and action in natural language to guide the video generation.

  4. Generate & Review

    FramePack generates your video frame by frame with impressive temporal consistency. Download and share your results.

No credit card required. Start creating amazing videos today.

Get Started

### Manual Installation on Windows

1. Create a folder and open Command Prompt
   git clone https://github.com/lllyasviel/FramePack.git
   cd FramePack

2. Create and activate a Python virtual environment (Python 3.10 recommended)
   python -m venv venv
   venv\Scripts\activate.bat

3. Upgrade pip and install dependencies
   python -m pip install --upgrade pip
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
   pip install -r requirements.txt

4. Install Triton and Sage Attention
   pip install triton-windows
   pip install https://github.com/woct0rdho/SageAttention/releases/download/v2.1.1-windows/sageattention-2.1.1+cu126torch2.6.0-cp312-cp312-win_amd64.whl
   ※Adjust the URL according to your CUDA or Python version

5. Optional: Install Flash Attention
   pip install packaging ninja
   set MAX_JOBS=4
   pip install flash-attn --no-build-isolation

6. Launch the Gradio UI
   python demo_gradio.py
# We recommend having an independent Python 3.10 environment
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

# Start the GUI
python demo_gradio.py
### Online Run on Windows (GUI)

1. Clone the repository:
   git clone https://github.com/lllyasviel/FramePack.git
   cd FramePack

2. Create and activate a Python virtual environment:
   python -m venv venv
   venv\Scripts\activate.bat

3. Install dependencies:
   python -m pip install --upgrade pip
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
   pip install -r requirements.txt

4. Launch the Gradio GUI:
   python demo_gradio.py

5. Open in browser:
   http://localhost:7860

Research Paper

Paper Preview

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

FramePack is a revolutionary video generation technology that compresses input contexts to a constant length, making the generation workload invariant to video length. Learn about our methods, architecture, and experimental results in detail.