AI Safety and Control in Complex Environments

Three threads kept turning up in this week’s reading: how safety in large language models scales differently from accuracy, why long-horizon agents need explicit control flow rather than more prompting, and how AI is forcing security researchers to renegotiate vulnerability-disclosure norms.

Safety scales differently from accuracy in clinical LLMs

Two recent arXiv papers sit on the same problem from different angles. The first, Safety and accuracy follow different scaling laws in clinical large language models (arxiv.org/abs/2605.04039v1), introduces SaFE-Scale and argues that safety in medical LLMs does not improve at the same rate as benchmark performance — meaning a model can be more accurate on average while still producing rare high-risk errors at the same or higher rate.

The second, BAMI: Training-Free Bias Mitigation in GUI Grounding (arxiv.org/abs/2605.06664v1), tackles bias in models that ground language to graphical user interfaces — a precondition for any agent that operates a real desktop. The Masked Prediction Distribution method identifies error sources arising from high-resolution images without retraining.

Both papers point to the same underlying issue: average-case quality and worst-case behaviour are different objectives, and improving the first does not automatically improve the second. That distinction matters more as model deployments move into clinical and operational settings where the worst case is what gets reported.

Agents need control flow, not more prompts

The other thread this week is structural: how do you build agents that stay coherent over long horizons? The arXiv paper LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents (arxiv.org/abs/2605.04036v1) sets out an explicit context-orchestration scheme for search agents whose runs span hundreds of tool calls.

A short blog post titled Agents need control flow, not more prompts (bsuh.bearblog.dev) makes the parallel argument from the practitioner side: the failures of long-running agents look less like prompt-quality problems and more like missing program structure — branches, loops, error handling, scoped state. Both pieces converge on the view that the next gain comes from treating an agent as a program rather than as a conversation.

AI and the disclosure norms around vulnerability research

Jeff Kaufman’s post AI Is Breaking Two Vulnerability Cultures (jefftk.com/p/ai-is-breaking-two-vulnerability-cultures) sits a level above the technical papers. It argues that LLMs are simultaneously lowering the cost of finding vulnerabilities and raising the volume of low-quality reports, and that the existing community norms in security research — built around scarcity of expertise — don’t yet have an answer for either change. The piece is short and worth reading in full; it pairs naturally with the safety-scaling paper above, since both are really about how distributions of rare events change once a capability becomes widely available.

Closing thought

The unifying theme this week is the gap between average behaviour and rare-event behaviour: safety scaling, agent reliability, and disclosure cultures all break differently when you measure them at the tail rather than at the mean.

weekly-digestai-safetycontrol-flowethical-considerations