Benchmarks evolving environments as sequences of progressive updates and introduces EvoMem, a patch-based memory that records structured update histories so LLM agents can reason about environment evolution. Demonstrates measurable gains on EvoArena and other benchmarks.
Provides a training-free, code-as-action framework that lets VLM-backed agents write and run stateful Python cells to compose perception and geometry primitives for open-ended 3D/4D spatial reasoning. Demonstrates consistent gains across 20 benchmarks and multiple VLM backbones.