www.ptreview.co.uk

02 {{ "2026-06-02T00:00:00+00:00" | date "MMM" }} '26

Written on {{ "2026-06-02T00:00:00+00:00" | date "longDate" }} Modified on {{ "2026-06-02T00:00:00+00:00" | date "longDate" }}

Nvidia News

NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI

This open omnimodel combines vision reasoning and action prediction to slash training cycles from months to days for robots and autonomous vehicles.

www.nvidia.com

NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI

NVIDIA Corporation has introduced a mixture-of-transformers architecture designed to integrate vision reasoning, world generation, and action prediction within a unified computing system. This framework targets physical artificial intelligence applications across robotics, autonomous vehicles, and industrial vision systems to accelerate training and simulation workflows.

Architectural Integration of Multimodal Processing
The computing architecture addresses generalization challenges in physical artificial intelligence by pairing a reasoning transformer with an expert generation transformer. This dual-component system processes object interactions, spatial-temporal relationships, and motion vectors prior to executing video generation or action trajectories. By processing text, image, video, ambient sound, and action trajectories natively within a single system, the architecture eliminates the data fragmentation typically found in decoupled simulation stacks.

Benchmark Performance and Implementation Variants
Evaluation metrics across open-source model benchmarks indicate specific performance positions for this framework. The architecture ranks first in world generation accuracy on the Artificial Analysis, Physics-IQ, PAI-Bench, and R-Bench datasets. For action policy evaluation, it leads on RoboLab and RoboArena, while securing top positions on the VANTAGE-Bench and TAR leaderboards for vision understanding.

The framework is deployed in distinct configurations tailored to specific computational constraints:

Super Configuration: Optimized for post-training workflows in robotics and autonomous vehicles where high physics accuracy and generation quality are required.
Nano Configuration: Designed for low-latency video and action reasoning applications executing in fractions of a second.
Edge Configuration: Developed for localized, real-time inference deployment at the edge.

Ecosystem Integration and Industrial Use Cases
A global coalition including Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI has been established to standardize open world models and evaluation techniques using shared training tools and cloud infrastructure.

In industrial operations, companies such as Doosan Robotics, LG Electronics, and Samsung Electronics utilize the platform for robotics development. Li Auto applies the architecture to autonomous vehicle training, while enterprises including Centific, Fogsphere, Linker Vision, Milestone Systems, and Yuan deploy the system for industrial vision agents and spatial reasoning in smart environments. The underlying platform provides specialized datasets covering human motion, warehouse safety, and neural scene reconstruction to generate synthetic data and augment defect-image classification.

Additional Context
This section details technical specifications and competitive benchmarking not included in the original product announcement.

The mixture-of-transformers approach represents a shift from traditional single-modality pipelines, such as standalone vision-language models paired with separate reinforcement learning policies. While traditional setups introduce cumulative latency during cross-model communication, unified architectures process multimodal inputs within a single shared latent space.

In comparative benchmarks for physical simulation, standard video generation models often exhibit physical inconsistencies, such as object permanence failures or incorrect gravity scaling. This architecture competes directly with proprietary world simulation models by incorporating explicit action-vector inputs, allowing the system to predict environmental state changes conditioned on specific robotic forces. This approach shortens the sim-to-real gap, a metric where open-source alternatives historically require extensive domain randomization to match real-world performance.

Edited by Evgeny Churilov, Induportals Media - Adapted by AI.

www.nvidia.com

Ask For More Information…

Facebook

Twitter

NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI

This open omnimodel combines vision reasoning and action prediction to slash training cycles from months to days for robots and autonomous vehicles.

www.nvidia.com

Related Articles