Video generation models as world simulators

OpenAI Blog ·

OpenAI released Sora, a diffusion transformer model generating up to one-minute high-fidelity videos from text, signaling a major leap toward general-purpose physical-world simulators.

Categories: Model Releases, Research

Excerpt

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.