The week’s most substantial work was about teaching models the physical world rather than just text about it. Two arXiv papers came at it from opposite ends — perception and action — and both leaned on the same lever: scale, plus a representation that actually fits the problem.
“Imaginative Perception Tokens” tackles a real weakness in vision-language models: they reason well about what’s in frame and badly about what isn’t. The paper adds tokens that let a VLM infer unseen viewpoints — stitching partial observations into a coherent sense of the space around an object, rather than only the pixels it was handed. The framing is the interesting part: spatial understanding treated as something a model imagines beyond its input, not a property it reads straight off the image. Whether the gains survive outside the paper’s own benchmarks is the usual open question, but the problem it names — models that can’t reason about the occluded or off-screen — is the right one to be chasing.
Humanoid-GPT comes at the body instead of the eye. It’s a GPT-style transformer trained on a billion-scale motion corpus for whole-body control, and it reports zero-shot motion tracking across varied scenarios — generalising to movements it wasn’t explicitly trained on. The headline is the recipe more than the result: take the same “scale the data, keep the architecture boring” approach that worked for language and point it at retargeted motion data. It’s another data point for the view that a lot of embodied-AI progress has quietly become a dataset problem.
Put the two together and the week reads as a small bet that the route to models which understand physical space runs through more data and better-fitted tokens, not new exotic architectures — the same trajectory language took, arriving a few years later for perception and motion.
The other two things worth flagging were smaller and sharper. A careful piece on 255 vs 256 division works through a detail every graphics programmer gets wrong at least once: whether to map an 8-bit colour channel to the 0–1 range by dividing by 255 or 256, and why the answer isn’t arbitrary. It’s the kind of low-level correctness that quietly decides whether colours round-trip cleanly through a pipeline. And “I made my phone slow on purpose” is a reflective note on deliberately degrading a device’s responsiveness to make it less compelling to reach for — a constraint-as-feature argument that cuts against the usual “make everything faster” reflex.
If there’s a thread across all four, it’s fit over raw power: a token that fits the gap in a model’s spatial reasoning, a dataset that fits an architecture borrowed wholesale, a division that fits the maths, and a slowdown that fits how someone actually wants to use their phone. Not a flashy week — a useful one.