NVIDIA Cosmos is an open platform for building Physical AI with omnimodal world models, datasets, and tools, centered on Cosmos 3 for reasoning and generation across text, images, video, audio, and actions.
Cosmos is NVIDIA’s open platform for Physical AI, aimed at developers working on robots, autonomous vehicles, smart infrastructure, and related systems. In the README, the centerpiece is Cosmos 3, a model family that combines understanding and generation across several modalities, with separate runtime surfaces for reasoning and for generation.
The project addresses the need for models and tools that can understand physical environments, predict what happens next, and generate realistic multimodal outputs for tasks like simulation, planning, and robot training. The README frames this as unifying capabilities that are often handled by separate systems, such as vision-language models, video generators, world simulators, and world-action models.
At a high level, Cosmos 3 uses a unified Mixture-of-Transformers design with two modes. In Reasoner mode, it processes text and vision for next-token prediction to support understanding, grounding, planning, and forecasting; in Generator mode, it denoises multimodal tokens to produce images, video, sound, and action outputs. The README also says both modes share multimodal attention layers and a unified 3D rotary position embedding to represent spatial and temporal structure across modalities.
It is drawing attention because it is a large, newly presented model family for Physical AI with clear developer paths for both research and deployment. The README highlights multiple access options—Diffusers and Transformers for Python-first use, plus vLLM-Omni, vLLM, and NIM for serving—which makes it attractive to builders who want one platform spanning understanding, generation, and deployment.
Based on the README, the closest alternatives are not named as direct competitors, but Cosmos 3 itself is positioned as combining roles typically split across vision-language models, video generators, world simulators, and world-action models. For integration choices, the README explicitly points to Diffusers, Transformers, vLLM-Omni, vLLM, and NIM as different ways to use the models rather than as competing model families.
AI-explained · grounded in each repo's README