

Hosts: Junfan Zhu, Aurora Feng
discord.gg/WH7DrTHRXK
🤖🥘 Saturday Robotics x Manycore Tech x Neural Motion | CVPR 2026 Denver Research Night | Robotics & World Models Reading Club 11
🤖🥘 Saturday Robotics x SpatialVerse/Manycore x Neural Motion | CVPR 2026 Denver Research Night & Academic Salon, with Dinner | Robotics & World Models Reading Club 11
Date: June 6, 2026 (Saturday)
Time: 5:30 PM — 9:30 PM (tentative)
Location: Denver (Downtown), CO 80202
Organized by: Saturday Robotics & World Models Reading Club x SpatialVerse by Manycore Tech x Neural Motion (Saturday Robotics, @neuralmotion, @junfanzhu98, @aurorafeng_01)
Since CVPR 2026 is taking place in Denver, our regular San Francisco Saturday Robotics & World Models Reading Club session on June 6 will be upgraded into a special CVPR Denver gathering.
This will keep the same high-signal, deep-discussion format that defines our weekly Saturday sessions: technical sharing, sharp questions, open discussion, and real exchange between people actively working on robotics, world models, embodied AI, computer vision, and physical intelligence.
Cohosts for this special session include:
Saturday Robotics: Saturday Robotics is a high signal reading group for robotics & world models researchers, founders, and builders in SF. Previous sessions have hosted researchers and builders from teams including Boston Dynamics, Google DeepMind, NVIDIA, Stanford, UC Berkeley, Rhoda AI, Meta FAIR, Generalist, Dyna Robotics, and leading Bay Area robotics startups. The weekly discussions have also generated in-depth public technical writeups and community posts that have drawn attention from researchers across the field, including engagement from Yann LeCun.
SpatialVerse by Manycore Tech: At the forefront of spatial intelligence with a focus on deepening our collective physical-world understanding. They collaborate across the ecosystem, from world models and synthetic 3D environments to embodied AI and robotics.
Neural Motion: Neural Motion is a robotics neolab building a generative video-action model for universal embodiment transfer. Their model powers robots to learn from any other robot in both the dynamics and observation space, reaching seamless transfer across different robot embodiments and domains.
Opening Remarks (3 min each)
Junfan Zhu & Aurora Feng, Founders of Saturday Robotics
Anthony Zhao, Head of North America at Manycore Tech SpacialVerse
Research Night Lightning Talks (1 hour total, 10min each)
Aurora Feng, Founder at Neural Motion. NM-GenET.
Neural Motion will be releasing NM-GenET, a generative video-action model for universal embodiment transfer and cross-embodiment, cross-domain policy learning.
Max Zhaoshuo Li, Robotics and World Model Tech Lead at NVIDIA Cosmos. Cosmos 3.
TLDR: Next-gen Cosmos3 will be released soon before CVPR. The model will be an omni model that is SoTA for image generation/video generation/sound generation/embodied reasoning AND robot policy control. Yes, it is a real omni model!
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI—effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state of the art across a diverse suite of understanding and generation tasks, establishing omnimodal world models as scalable, general-purpose backbones for embodied agents. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation’s Open- MDW1.1 License at huggingface.co/collections/nvidia/cosmos3. The project website is available at research.nvidia.com/labs/cosmos-lab/cosmos3.
Xiaofan Li, World Model Tech Lead at X Square Robot. WALL-WM.
TLDR: From next chunk prediction to next event prediction. Wall-WM is a new training and inference workflow for world modeling, focusing on semantic event signals rather than rigid frame chunks. Further exploration on the integration of agent intelligence and WAM to better perceive and predict real-world dynamic behaviors.
WALL-WM is a World Action Model (WAM) built around event-level Vision-Language-Action(V-L-A) pretraining. Existing WAMs often initialize from multimodal and video foundation models, then train and infer fixed-length chunks directly conditioned on the current observation and instruction. However, because text, vision, and action live on different manifolds and temporal scales, this direct joint optimization can distort the pretrained prior. WALL-WM therefore treats the semantic event, rather than a fixed frame window, as the atomic unit of video-action learning, and pairs this training scheme with a data ecosystem organized around event-level captions and cluster-balanced sampling. WALL-WM supports two inference modes on the same event-pretrained backbone: an event mode that consumes a next-event description and allows variable execution chunk sizes, and a unified mode that uses a VLM with Staircase Layer-Relay CoT Decoding to condition conventional fixed-length chunk inference while keeping the V-L-A path gradient-continuous. Together with Muon-optimizer-based large-scale pretraining infrastructure, this forms a scale-up recipe for general-purpose WAMs. WALL-WM demonstrates broad generalisation across language, scenes, and tasks, and achieves the best performance in large-scale real-world generalisation evaluation.
Zesen Zhao, University of Michigan. Test-Time Scaling for World Action Models via Zero-Shot Geometric Verification.
World Action Models jointly predict future visual observations and actions, but their imagined futures can contain visual or geometric artifacts that degrade downstream action quality. In this talk, I will present our recent CoRL submission on a training-free and model-agnostic verifier for WAM rollouts. The core idea is to use cross-view geometric consistency as a proxy for rollout quality: if a predicted multi-view future is physically plausible, the different camera views should describe a consistent 3D structure. We instantiate this with a frozen VGGT model and a cross-view depth reprojection consistency score, then use it for Best-of-N test-time rollout selection. I will also discuss an adaptive variant that triggers additional sampling only when predicted video motion conflicts with action-implied end-effector motion, reducing inference cost while recovering most of the Best-of-N gain.
The talk would be technical and research-driven, focusing on test-time scaling, geometric verification, and how to evaluate rollout quality in embodied world models.
Jie Wang, University of Pennsylvania, GRASP Lab. Toward a Robotics MMLU: Lessons from Sim & Real Evaluations of Generalist Policies.
Robot policies are transitioning into foundation models, but our evaluation methodology is not. The LLM community iterates on benchmarks like MMLU that decompose capability into reproducible, comparable axes. Robotics has no equivalent. Drawing on four works I have contributed to (π0 in the Wild, RoboArena, TiPToP, and MolmoAct2 Evals), this talk argues that current practice consistently misses the failure modes that matter most for generalist deployment, and surfaces structural issues with current real-world evaluation. I outline what a "robotics MMLU" might require, and call for the community to treat evaluation methodology as a first-class research problem.
Gordon Qian, Senior AI Researcher at Snap. Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning.
TLDR: Scalar rewards are too coarse for video diffusion fine-tuning. Diffusion-DRF turns rich VQA explanations from a VLM into rewards that provide spatially and semantically precise credit assignment for video alignment tuning.
RL-style post-training has become a powerful recipe for LLM reasoning and image generation, but video diffusion models expose a harder problem: a single scalar reward often cannot tell the model which object, motion, frame, or physical inconsistency caused failure. In practice, GRPO-style methods that work well in text or image settings can become unstable or weak for video, often leading to reward-hacked results within roughly 300 training steps. In this lightning talk, I will share insights from Diffusion-DRF, a reward framework for video diffusion fine-tuning that enables stable training beyond 3K steps. Diffusion-DRF decomposes prompts into structured questions across text-video alignment, physical fidelity, and visual quality. The VLM’s free-form explanations and next-token logits become rich differentiable rewards, whose gradients can be backpropagated through the VAE decoder and final denoising steps. The key takeaway: VLM gradients are not just evaluators; they can provide precise credit assignment for video generation failures.
Pre-Readings: https://arxiv.org/abs/2601.04153
Open Floor Discussions + Q&A
We'd also love to hear your hot takes on:
"What's still the bottleneck around embodied AI systems? more by model architecture, embodied data, physical grounding, or evaluation?"
We welcome CVPR attendees, researchers, engineers, founders, and builders working on relevant frontier topics, including but not limited to:
Robotics world models and action-conditioned prediction
JEPA / V-JEPA-style predictive representation learning
Vision-language-action models, world-action models, and robot foundation models
Human egocentric video, UMI-style data, teleoperation, and robot-collected interaction data
Multimodal world models across vision, tactile, proprioception, language, and action
Cross-embodiment transfer, long-horizon planning, and embodied generalization
Physical grounding, spatial understanding, simulation, and evaluation for robot learning
Video generation/video world models for robotics
CVPR 2026 robotics, embodied AI, world-model, and 3D vision papers
And most importantly, your hot 🔥 takes
Schedule (tentative)
5:30 to 6:15 PM
Doors open, dinner, drinks, strawberries, and networking. 🥘🍾🍓
6:15 to 6:30 PM
Opening remarks from Saturday Robotics and SpatialVerse by Manycore Tech.
6:30 to 7:15 PM
Keynote/lightning talks from CVPR authors or domain experts.
7:15 to 8:15 PM
Open technical Q&A and roundtable discussion.
8:15 to 9:30 PM
Free-form discussion, curated introductions, and open mingling.
Registration
This event is intended for relevant researchers, PhD students, postdocs, faculty, senior engineers, technical founders, and builders working in AI, robotics, computer vision, reinforcement learning, world models, 3D vision, simulation, or embodied intelligence.
This Denver edition continues that tradition during CVPR week: a solid offline gathering for robotics, world-model, and embodied AI people who want to go deeper than the usual conference hallway conversation.
When you RSVP, please briefly note your affiliation and research focus. We review applications to keep the room high-signal, relevant, and productive.
Whether you are building physical robots, training world models, studying embodied generalization, or debating what actually helps machines understand and act in the physical world, this is the room you want to be in at CVPR.
See you in Denver! 🤖🍓📚
Join Discord Community
https://discord.gg/WH7DrTHRXK
Follow Saturday Robotics
https://x.com/saturdayrobotic
https://www.linkedin.com/company/saturdayrobotic/
Follow Our YouTube Channel
Hosts: Junfan Zhu, Aurora Feng
discord.gg/WH7DrTHRXK