Two papers on giving models a grip on physical space

The week’s most substantial work was about teaching models the physical world rather than just text about it. Two arXiv papers came at it from opposite ends — perception and action — and both leaned on the same lever: scale, plus a representation that actually fits the problem.

“Imaginative Perception Tokens” tackles a real weakness in vision-language models: they reason well about what’s in frame and badly about what isn’t. The paper adds tokens that let a VLM infer unseen viewpoints — stitching partial observations into a coherent sense of the space around an object, rather than only the pixels it was handed. The framing is the interesting part: spatial understanding treated as something a model imagines beyond its input, not a property it reads straight off the image. Whether the gains survive outside the paper’s own benchmarks is the usual open question, but the problem it names — models that can’t reason about the occluded or off-screen — is the right one to be chasing.

Humanoid-GPT comes at the body instead of the eye. It’s a GPT-style transformer trained on a billion-scale motion corpus for whole-body control, and it reports zero-shot motion tracking across varied scenarios — generalising to movements it wasn’t explicitly trained on. The headline is the recipe more than the result: take the same “scale the data, keep the architecture boring” approach that worked for language and point it at retargeted motion data. It’s another data point for the view that a lot of embodied-AI progress has quietly become a dataset problem.

Put the two together and the week reads as a small bet that the route to models which understand physical space runs through more data and better-fitted tokens, not new exotic architectures — the same trajectory language took, arriving a few years later for perception and motion.

The other two things worth flagging were smaller and sharper. A careful piece on 255 vs 256 division works through a detail every graphics programmer gets wrong at least once: whether to map an 8-bit colour channel to the 0–1 range by dividing by 255 or 256, and why the answer isn’t arbitrary. It’s the kind of low-level correctness that quietly decides whether colours round-trip cleanly through a pipeline. And “I made my phone slow on purpose” is a reflective note on deliberately degrading a device’s responsiveness to make it less compelling to reach for — a constraint-as-feature argument that cuts against the usual “make everything faster” reflex.

If there’s a thread across all four, it’s fit over raw power: a token that fits the gap in a model’s spatial reasoning, a dataset that fits an architecture borrowed wholesale, a division that fits the maths, and a slowdown that fits how someone actually wants to use their phone. Not a flashy week — a useful one.

weekly-digestmultimodal-aispatial-reasoningembodied-ai