

Robotics World Model Reading Club 01 – San Francisco
Robotics World Model Reading Club 01 – San Francisco
A high-signal reading group for AI researchers pushing the frontiers of embodied intelligence, world models, and robotic foundation models.
We conduct technical deep-dives into the latest papers and industry breakthroughs at the intersection of video-based world models, diffusion policies, VLAs, cross-embodiment learning, spatial intelligence, and precise manipulation refinement. Expect detailed discussions on model architectures, training recipes, generalization mechanisms, sim-to-real strategies, online adaptation, and scaling paths toward physical AGI.
Core Themes & Spotlight Papers
Generalist Robotic Foundation Models & Precise Refinement
DreamZero (NVIDIA): World Action Models (WAMs) that jointly predict dense future video states and actions via pretrained video diffusion backbones. Enables strong zero-shot generalization and few-shot embodiment adaptation using heterogeneous data.
RL Tokens (RLT) – Precise Manipulation with Efficient Online RL (Physical Intelligence): Extracts a compact RL token from frozen VLA models to interface with lightweight actor-critic networks for fast online RL. Targets the "last millimeter" problem in precise, contact-rich tasks (e.g., screwdriver alignment, zip tie fastening, Ethernet/power-cord insertion) — achieves 2–3× throughput speedup in critical phases with only 15–120 minutes of real-world data, surpassing human teleoperation speed in some cases via efficient, real-time adaptation without full-model fine-tuning.
4D Scene Understanding & Dynamic World Modeling
D4RT (DeepMind): A unified feedforward transformer that jointly infers depth, spatio-temporal correspondence (tracking), and full camera parameters from monocular video. Disentangles camera motion, object motion, and static geometry for accurate 4D reconstruction. Up to 300× more efficient than prior methods, enabling real-time inference. Directly supports robotics needs such as dynamic navigation, safe interaction in populated environments, dexterous manipulation, and building true predictive world models for physical AGI.
Human Demonstration Interfaces & Policy Improvement
Compliant Residual DAgger (CR-DAgger, Shuran Song Lab): Compliant intervention interface using compliance control for safe, precise human delta corrections in contact-rich tasks, paired with force-feedback residual policies for efficient on-policy learning.
UMI-FT (Shuran Song Lab): Force/torque-augmented Universal Manipulation Interface with fingertip sensors and mobile multimodal vision (iPhone + ARKit) for collecting rich compliant manipulation data directly in-the-wild, enabling robot-free teaching of forceful skills.
EgoScale (NVIDIA): Human-to-dexterous-manipulation transfer framework pretrained on >20,000 hours of action-labeled egocentric human video. Uncovers a clear log-linear scaling law (R² ≈ 0.998) between human data scale and action prediction loss, which strongly predicts real-robot success. Enables long-horizon dexterous tasks (e.g., shirt rolling, syringe injection, card sorting) and one-shot adaptation with minimal robot data (~4 hours mid-training alignment), improving success rates by 54% over scratch baselines on 22-DoF hands and transferring to lower-DoF embodiments.
EgoVerse (NVIDIA): Ecosystem and "living" dataset for curating, accessing, and learning from diverse global human egocentric data tailored for robot learning. Supports rigorous cross-task, cross-embodiment human-to-robot transfer research with dense annotations, continuous expansion by consortium/community, and focus on scalable supervision for foundation models in manipulation.
Cross-Embodiment Generalization
AirExo-2: Low-cost exoskeleton platform with demonstration adaptors that transform large-scale in-the-wild human demonstrations into pseudo-robot data for scalable, generalizable imitation learning.
LAP (Language-Action Pre-training, LAP-3B): Novel VLA recipe that encodes low-level actions in natural language to align with VLM distributions, achieving strong zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning.
Industry Frontier Technical Analysis
We break down latest releases and blogs from leading labs and startups:
Physical Intelligence (π): General-purpose robotic foundation models and VLAs (e.g., π series progression to π*0.6 with Recap RL for experience-driven improvement, doubling throughput on hard tasks like espresso making, box assembly, laundry folding; Multi-Scale Embodied Memory (MEM) for long/short-term memory enabling >10-min complex tasks; RL token-based online RL for precise manipulation; emergence of human-to-robot transfer in scaled VLAs; real-time chunking for low-latency inference in dynamic tasks)
Figure AI: Humanoid scaling laws and real-world whole-body control
1X: Mobile manipulation systems and data-efficient learning
DeepMind: Robotic foundation models, multimodal simulators, and long-horizon planning
Skild AI: Unified omni-bodied foundation models with cross-embodiment transfer
OpenAI: Generative world models and embodied agent architectures
Sunday Robotics: ACT-1 foundation model trained on zero robot teleoperation data using Skill Capture Gloves across real homes — enabling long-horizon autonomous household manipulation and navigation
World Labs (Fei-Fei Li): Frontier spatial intelligence and generative 3D world models (Marble) for consistent multimodal scene generation, persistent 3D simulation, and interactive spatial reasoning
Video Generation as Predictive World Models
Frontier video diffusion and generative models as the backbone for embodied simulation, long-horizon planning, and joint world-action prediction. We analyze consistency over time, integration with action experts, and their role in closing the reality gap for robotics.
Format
Pre-read focused sections of these papers
Member-led deep technical discussions, critiques, ablation insights, and implementation brainstorming
Open floor on open challenges: video world model consistency, data efficiency, hierarchical planning, cross-embodiment scaling, force-aware control, and online RL for precision refinement
Casual networking with top AI & robotics researchers in the Bay Area
Date & Time
Saturday, March 28, 2026 | 2:00 PM – 5:00 PM
Location
San Francisco
What to expect
Light refreshments: fresh strawberries 🍓 and other juicy fruits, assorted drinks
Roundtable Discussion
Roundtable open-floor discussion (everyone 10–20 minutes per topic) focused on the spotlight papers — or any paper you’d like to highlight. Feel free to share why the paper matters, its potential impact, and technical details.
After the structured discussion, the floor will remain open for free-flowing Q&A and casual conversations/networking.
Logistics
Please bring your laptop if you’d like to share slides or keynotes.
If you have additional papers (not listed above) that you’d like to discuss, please share them in the Discord channel in advance so we can prepare accordingly.: https://discord.gg/jmqa55PD
Future sessions: invited keynotes + deep technical talks from leading researchers and industry engineers, followed by open discussion
If you are actively researching or building video world models, VLAs, diffusion policies, compliant manipulation, cross-embodiment systems, or efficient online RL for dexterous precision tasks, this is the room for you. High-signal technical environment. Space is limited.
Join Discord Community