Luke Simmons — Blog

Audio Console: a channel strip for movie night

2026-06-12T00:00:00Z

Author’s note: this post was drafted by Claude (Anthropic) from my project notes and source code, then reviewed and edited by me before publishing. The voice and judgments are mine; the typing isn’t.

The problem is one every movie watcher knows: dialogue is mixed for a cinema, so at home you ride the remote — up for the whispering, down fast for the explosion. The fix is also decades old: it’s called dynamic range compression, and every live sound desk does it on every channel, all the time. Audio Console is that fix as a small Windows app: capture the movie’s audio, run it through an EQ → compressor → limiter channel strip in real time, and play it out the speakers.

Built like one channel of a mixing desk

The design is borrowed deliberately from a live console — Allen & Heath’s Avantis — and two ideas from that desk shaped the whole app.

One channel strip does the job. Every channel on the desk runs the same chain: gain → high-pass filter → gate → EQ → compressor → limiter → fader. Taming a movie needs exactly that chain on one stereo feed — not 42 buses, not a surround processor, one good strip. The app’s signal path is that chain, plus a couple of macros:

movie/app audio → [capture] → input gain → 10-band graphic EQ
   → tone macros (bass / clarity) → compressor → ambience → stereo width
   → limiter → output gain → speakers

Separate the engine from the surface. The desk’s “mini” sibling is the same DSP engine with a smaller control surface, and the app copies that split: the DSP (biquad EQ, compressor, limiter) and the audio plumbing are their own modules; the Tkinter control panel is just one view bolted onto them. A web UI could replace it without touching the audio path.

The console runs its chain on an FPGA for a fixed ~0.7 ms of latency, and charges accordingly. This app accepts software buffer latency in exchange for being free and fully hackable — and compensates at the player instead (more below).

The capture trick

Windows doesn’t let an app casually sit between another app and the speakers, so the routing goes through a free virtual audio device (VB-CABLE). The movie player’s output is pointed at the cable’s input; the app captures the cable’s output, processes, and plays to the real speakers. Capture and playback are different devices by construction, which is also what prevents a feedback loop.

The per-app version of this is the genuinely useful one: Windows’ volume mixer can route just the movie player into the cable while Discord and system sounds stay on the normal output. Only the film gets the treatment.

The controls that matter

The headline control is Dynamic Boost — a 0–100 macro that drives the compressor, pulling loud peaks down and lifting quiet dialogue up. That single knob is the actual fix for jumpy movie dynamics. Around it: Clarity (a presence + air shelf for dialogue intelligibility), Bass Boost, mid/side stereo widening, a light Ambience, a 10-band graphic EQ with a live response curve and a high-pass that kills explosion rumble, and a brickwall limiter as the safety net so no transient can ever blow past the ceiling.

Presets cover the obvious cases (Night - gentle, Dialogue boost, Cinema, Heavy - late night, Off), and there’s one design touch I’m pleased with: the moment you move any control, the preset flips to Personal and your settings autosave to disk. Next launch picks up exactly where you left off, and auditioning a built-in preset never overwrites yours. Set-and-forget is completed by a start-with-Windows toggle and auto-run on launch.

The honest limits

This is a prototype, and three trade-offs are worth stating plainly.

Latency. The audio takes a detour through a Python process, so it lags the video by a few tens of milliseconds. In practice you nudge the player’s A/V offset once (VLC does this in ~50 ms steps from the keyboard) and forget about it; a smaller block size buys less latency for more CPU.

Drift. The virtual cable and the speakers run on different clocks, so a tiny, rare click is possible over very long playback. A future version could resample to correct it; movie-length sessions haven’t needed it yet.

One band. It’s a single-band compressor, not multiband. For movies that’s plenty; mastering engineers may avert their eyes.

The principled alternative to all three is an in-pipeline driver like Equalizer APO, which processes inside the Windows audio stack with no added loopback latency. If lip-sync ever bugs me more than the dynamics did, that’s the fallback. The trade as built buys me a UI I completely control and DSP I can read and modify — which, for a project that exists partly to be understood, is the point.

— Luke Simmons, Auckland

Control Plane: one window for every local service

2026-06-12T00:00:00Z

By the middle of this year I was running enough local services — this website’s dev server, a streaming aggregator, a political research tool, a finance dashboard, an LLM runtime, and more — that the overhead of operating them became its own problem. Each had its own start script, its own port, its own logs scrolling in its own terminal window. Worst of all was the orphan problem: stop a launcher script and its child process keeps the port, so the restart fails and you’re in Task Manager hunting PIDs.

Control Plane is the fix: a local desktop app — Rust backend, system WebView, built with Tauri — that starts, stops, and restarts each service, watches health, streams logs, embeds each service’s web UI in a tab, and runs the website’s deploy and publish jobs from one window. It also doubles as a Rust/Tauri learning project: there’s a Code tab inside the app that explains its own source, file by file.

(Screenshots are pending — the dashboard shows real machine paths and service names, so they need a scrub-and-crop pass before they go on the public internet. The fleet view on /projects/ is a faithful cousin in the meantime.)

The core insight

Every service I run already has a web UI on localhost. So the app only needs to be two things: a process supervisor — the part a browser can’t do — and a web-view shell that embeds the UIs that already exist. Everything else follows from refusing to build more than that.

The registry is the source of truth

Services are declared in a services.toml registry, one block each:

[[service]]
id        = "myapp"
name      = "my app (local)"
cwd       = 'C:\path\to\project'
command   = ['.venv\Scripts\python.exe', 'app.py']
port      = 5000                  # used by the health check
embed_url = "http://127.0.0.1:5000/"
managed   = true                  # false = watch-only (no start/stop)

The registry is re-read on every start and on the health poll, so edits take effect in seconds with no rebuild. Adding a service is a config change, not a code change. Some entries are deliberately watch-only — infrastructure the app reports on but must never start or stop, because it doesn’t own them.

The real registry is gitignored (it holds actual machine paths); a template ships in its place, and the in-app Code tab embeds the template, never the real file — so my filesystem layout can’t end up compiled into a binary.

The health-cache decision

The one architectural decision worth a section. Tauri runs synchronous commands on the UI thread, and an early version did live TCP health-checks inline in the list_services call — so every 3-second poll briefly froze the window, and a port that hit its connect timeout froze it properly.

The fix has two layers: a background thread probes every port on a ~2-second timer and writes results into a mutex-guarded cache, so the hot path reads memory and does no I/O; and every command that touches files, network, or subprocesses is declared async, which moves it to a worker thread. The rule of thumb that fell out, and that I’d reuse on any Tauri project: any command that does I/O is async; only pure instant reads stay sync.

Windows process management, the rabbit hole

Three gotchas, all handled in the supervisor, all earned the hard way:

Relative program paths. Command::current_dir does not affect how Windows resolves the program path — it looks next to the parent process. So a relative exe like .venv\Scripts\python.exe is joined with the service’s cwd into an absolute path before spawning.
.cmd shims. npm is really npm.cmd, which CreateProcess can’t launch directly. Anything shim-shaped gets wrapped as cmd /c npm ….
Tree kill. Killing a launcher orphans its children — the classic “stop script leaves the listener alive.” The supervisor walks the process tree with a Win32 Toolhelp snapshot and terminates each descendant, and finds the PID holding a port via GetExtendedTcpTable. It deliberately does not shell out to taskkill or netstat: those exact command lines trip antivirus “malicious command line” heuristics, which I learned by tripping them.

The remaining hardening item I’d do next is Windows Job Objects, which reap the whole tree automatically even if the supervisor itself dies.

Deploy is not Publish

The Ops tab runs the website’s two outward-facing jobs as two separate buttons, because they are two separate acts: Deploy freezes the site and ships it to the host; Publish runs the curated source through a content scanner and pushes the mirror to GitHub. Different targets, different content sets — the site carries media the repo doesn’t, the repo carries source the site doesn’t. The publish path always runs non-interactively so it can’t hang on a prompt, and it is not allowed to bypass the scanner with a direct git push. The same tab watches the scheduled agents (blog digests, backups, the file organiser) with their live last-run/next-run state, and only the explicitly allowlisted ones get a “Run now” button.

The security posture, in one paragraph

Local only — the backend binds nothing to the network; it spawns local processes and embeds 127.0.0.1 URLs. No secrets in the registry — environment references, never inline keys. And the frontend can only call an explicit allow-list of commands across the IPC bridge. A supervisor with start/stop/kill powers is exactly the kind of tool that should be boring about security.

Where the design language came from

If you’ve read the projects page or noticed this site’s status dots and flat cards: that idiom started here. A card per service, an honest status glyph driven by a real health check, logs one click away, no decoration that doesn’t carry information. Reworking the website, the Control Plane’s dashboard was the reference — the site now speaks the same language, and the projects page is in effect the public, read-only control plane of the ecosystem. The app came first; the aesthetic followed the operations.

— Luke Simmons, Auckland

Sovereign Suite: assembling an NZ-resident cloud

2026-06-12T00:00:00Z

Sovereign Suite is my answer to a question I kept circling: what would it take to give a New Zealand family the things they actually use Google and Facebook for — files, documents, photos, chat, a social feed — without ads, without behavioural tracking, and with the data physically in this country under NZ law? The working title is Aotearoa Cloud. A first slice of it is live on my own hardware, and my family can use it. This post is the vision, the one engineering rule that shapes everything, and the licensing story — which turns out to be the most interesting part.

The principles, because they decide everything

The design doc opens with principles rather than features, and every trade-off since has been resolved against them:

Data residency is absolute. All user data at rest lives on NZ infrastructure — including AI inference. No offshore processing. This is the product, not a feature.
No ads, no data sale, no tracking. Revenue, if this ever becomes more than a family service, is subscription-only.
Assemble, don’t reinvent. More on this below.
One identity, one bill, one shell. A single sign-on and a single subscription across every service; the user should never feel they’re juggling nine apps.
The AI is a guest in the user’s data, not the owner. Local open-weight models, per-user boundaries, a full audit log, revocable access. The planned private agent — ask questions across your own mail, files, and photos, with nothing leaving the country — is the differentiator, and it’s also the part I’ve deliberately scheduled after the boring foundations.

Assemble, don’t fork

The engineering rule: build on mature open source — Nextcloud for files, Collabora for documents, Immich for photos, Matrix for chat, Pixelfed and Mastodon for social, Keycloak for identity — and spend the actual engineering effort on integration, identity, UX, and the AI layer. Nobody needs me to re-implement a word processor.

The subtler half of the rule is don’t fork. Each app runs unmodified, as upstream ships it, and everything of mine talks to them over their network APIs. That’s partly maintenance sanity — unmodified apps take upstream updates forever — but it’s also, it turns out, the legal architecture.

The licensing story

I went into the licensing research braced for bad news and came out with the opposite: open source does not mean non-commercial. Every component in the stack permits charging for a hosted service. The obligations are about sharing modifications and respecting trademarks — not about whether you may profit. (Standard disclaimer: this is general information from my own notes, not legal advice, and the plan explicitly includes a real open-source-licensing lawyer before any significant scale.)

The core of it is the AGPL, which most of the big apps use (Nextcloud, Immich, Mastodon, Pixelfed, Synapse, OnlyOffice). AGPL §13 closes the “SaaS loophole”: if users interact over a network with a modified AGPL app, you must offer them your modified source. The key word is modified. Run the apps unmodified — the assemble-don’t-fork rule — and the duty is trivial: link upstream, publish your configs and any patches, keep the notices.

And the same rule is what keeps my own code mine. A launcher, an orchestrator, or a billing system that talks to Nextcloud over its HTTP API is a separate work, not a derivative — it doesn’t inherit the AGPL. Fork an app and edit its source, and everything you wrote there is AGPL-bound. The safe side of the line is a documented network protocol; the copyleft side is modifying or linking their code. Assemble-don’t-fork isn’t just an engineering preference; it’s the moat.

The rest of the licensing map, briefly: trademarks are separate from code licenses — you can’t market a service as “Nextcloud” or use upstream logos, which is exactly why the rebrand-to-our-own-name approach is correct. Collabora expects a paid subscription for production use (fair; or swap to OnlyOffice, which is AGPL). Redis relicensed to an anti-SaaS license, so the stack uses Valkey, the BSD drop-in fork. And LLM weights vary wildly — Llama carries use restrictions; Apache-licensed Qwen or Mistral models are the clean choice for anything commercial.

What’s actually running

The demo slice — files, collaborative documents, and a custom launcher page (the genuinely-ours part) — runs in containers behind a Caddy reverse proxy on my own machine, on the home LAN. Family members reach it through the router’s VPN rather than anything exposed to the public internet, and I’m keeping the specifics (addresses, ports, the exact access path) off this page on purpose: the suite’s whole premise is that its infrastructure is private.

Two build stories are worth telling honestly. Docker Desktop on Windows kept crashing on a component I couldn’t disable, so the stack pivoted to plain Docker Engine inside WSL2 — less convenient, far more reliable. Then came the mystery outage where the site was down for everyone but every check I ran passed: WSL2 idle-shuts-down the distro when nothing’s using it, and each diagnostic command I ran was itself rebooting the distro before testing it. A keepalive task holds it open now. The lesson generalises: a health check that can revive the thing it’s checking will lie to you.

Backups got the most careful engineering in the slice. The user data lives in Docker volumes inside WSL2, invisible to the Windows backup tool, so a nightly script exports it first: the file store goes into maintenance mode so the database and files stay mutually consistent (with a trap that guarantees it comes back off even on failure), the database is dumped uncompressed so the deduplicating backup tool sees mostly-unchanged content each night instead of a new opaque blob, and everything is written to a temp name then renamed so a half-written dump can never be swept up. The restore path isn’t theoretical either — I’ve restored the dump into a scratch container and verified table counts and file records match the live system. A backup you haven’t restored is a hope, not a backup.

What’s next

The backlog, in order: single sign-on across the apps, photos with phone auto-backup, then the local AI agent — the piece the whole design is really for — then mail, calendar, chat and video, then the private social layer. Each slice has to earn its place by being something my family actually uses, because the honest test of a Google alternative isn’t whether it runs. It’s whether people who don’t care how it works choose to keep using it.

— Luke Simmons, Auckland

Three doors into a knowledge game

2026-06-12T00:00:00Z

I’ve been trying to turn a knowledge base into a video game. Not an educational game in the quiz-with-graphics sense — an actual game, where understanding something real is the thing that lets you progress. Over one intense stretch I built five prototypes across three projects, and the most valuable output wasn’t any of them. It was a classification that now sorts every idea in this space in about five seconds. This is the writeup of the prototypes and that finding.

The vision

The dream, from the very first conversation: a game where you “go to university with real knowledge.” You don’t get quizzed or lectured. You explore a world, and the knowledge graph itself generates the play space — knowing a real relationship between two ideas lets you do something you otherwise couldn’t.

The genre this points at has a name now: the metroidbrainia. In a metroidvania, progression is gated by items; in a metroidbrainia, progression lives in the player’s head, not a save file. Outer Wilds, Return of the Obra Dinn, The Case of the Golden Idol, The Witness. You can replay any of them, but you can’t un-know them. The win condition is that you understood something — which is exactly the win condition of learning, so the fit is natural. The hard part is everything else.

Consilience: the confirmation board

The first prototype, playable at /labs/consilience/, is a Flask game in the Golden Idol mould. You wander a small university, study fact-nodes — each one a real, cited fact — and assemble what you’ve learned on a confirmation board to name a connection the departments never tell you about each other. The test revelation is a real one, the kind of thing two different lecture halls each know half of.

It works. People get the “oh!” moment. But mechanically, filling slots on a board from a set of collected tokens is recognition — and recognition, however nicely dressed, is a quiz with good taste. That nagging feeling is what eventually became the three-doors finding below.

One engineering invariant from Consilience carried forward into everything since: reachability is an invariant. Any required fact, puzzle, or goal must be provably reachable from the start, enforced with a check at build time — not by hope. A treasure no one can reach is a bug, even if every individual room is fine.

Tickscape: the OSRS-shaped delivery vehicle

The second prototype answers a different question: what should moving through the world feel like? I’ve spent a lot of hours in OldSchool RuneScape, so Tickscape is a Rust prototype (macroquad) of the OSRS engine essentials — the 600 ms game tick, tile-based pathfinding, a skilling loop — delivering the Consilience content: walk a campus, study at fact-nodes, solve a revelation.

It proved the thing it was built to prove, with a sharp caveat: OSRS is the shell, not the game. A walkable world where knowledge has a place is genuinely valuable — place-memory is real; you remember where you learned something. But OSRS’s core verb is grinding, and grinding is exactly the wrong learning mechanic. So the engine survives as a wrapper for whatever the real mechanic turns out to be, and each “node” in that world should be a self-contained concept-toy.

The three doors

Here’s the finding that sorts everything. There are three ways to make a concept matter in a game:

Recognise it — pick the right answer from collected tokens. That’s a quiz with good taste. (Consilience’s board.)
Implement it — write the code that proves you get it. That’s a lecture, or a lab exercise. (An early Foundry node did this: write memoisation or the runtime’s time gate kills your exponential fib(45). Satisfying — and unmistakably coursework.)
Inhabit it — the concept is the physics of the world. You play, and understanding is the residue.

Only the third one is a game. The load-bearing sentence in my design notes is: the concept should be the toy, not the test. And its corollary: name the concept after the player wins, as a reward — never as a briefing. The games that already do this properly are the reference set: Patrick’s Parabox, Baba Is You, Turing Complete, Recursed.

Computer science turns out to be arguably the best-fit domain for door three, because the computer can run your understanding — something history can’t do. The trap is that “the computer runs it” drifts naturally toward door two and homework. The answer is to use executability to verify play, not to assign exercises.

The concept toys

To test door three directly I built three single-file web toys, each one CS concept made playable, each following the name-it-after-you-win rule:

“You Are the Search” — graph search. Your only two moves are Flood (expand the oldest discovered node) or Dive (expand the newest). That one choice is the difference between breadth-first and depth-first, and different maps reward different choices. Mechanically the cleanest of the three — and rejected, because a node-and-edge graph on screen looks like a textbook figure. Ironically lecture-y.
“Inside” — recursion. A room is split by a wall with no door. One box on your side is empty; the other contains the room it sits in, drawn recursively — you can see the room nested inside it, and the room inside that. Stepping into the self-containing box folds you across the wall. The draw function is itself recursive, to render recursion. This is the front-runner, and the reason is instructive: it’s the only prototype that’s a world with a character in it rather than an interactive diagram.
“The Mechanism” — logic gates. Wire a lock from AND/OR/NOT/NAND against a live truth table, building up to XOR from three gates — the whole “a computer is just gates stacked” lesson in one move. The deepest payload of the three, but also the most diagram-like, which is the same quality that sank the search toy.

The pattern across the verdicts is consistent: the more a prototype looks like the way the concept is taught, the worse it plays. The more it looks like a place you’re standing in, the better.

The honest open problems

Two risks have survived every prototype unsolved, and I’d rather state them than pretend the design is further along than it is.

Authoring is the whole ballgame. Curating which connections and puzzles make a learner actually gasp is taste-bound work, and it doesn’t scale with compute. The knowledge pipeline can generate content; it can’t generate revelations.

Discovery fires once. Each “oh!” is single-use per player, so content burns fast. The mitigation is framing: a finite, curated experience — a “playable course” of a few hours, like Her Story — rather than a thousand-hour live game. Which of those two products this is changes every other decision, and I haven’t decided. CS softens the problem a little, because doing (executable puzzles) is renewable practice layered on top of single-use revelations.

The project is paused deliberately, with a pick-up-here document whose most important section is a list of questions — what’s actually on the screen, what you’re doing moment to moment, where the knowledge lives — because the prototypes kept being good guesses at a picture in my head that I haven’t fully articulated yet. Five builds taught me the shape of the wrong answers. That’s worth more than it sounds.

— Luke Simmons, Auckland

University of Luke: a private university that doesn't make things up

2026-06-12T00:00:00Z

University of Luke is a personal academic hub: a knowledge base spanning 132 fields across 11 faculties, with a tutor you can ask questions, course and curriculum generators, flashcards with spaced repetition, and a cross-domain concept map — all running offline on the RTX 3070 in my desktop. A static export of it is browsable at /labs/university/, and the read-along book it feeds — a CS textbook with a 3D avatar and a narrated audiobook excerpt — is at /labs/read-along/.

The interesting part isn’t the feature list. It’s that the answers are accurate, and why they’re accurate.

The problem with small local models

Local LLMs in the 7B–14B range are attractive for all the obvious reasons: private, free to run, no internet required. They’re also, out of the box, unreliable for knowledge work. Ask a 7B to write about a topic from memory and it invents plausible-but-wrong specifics — dates, names, citations. The standard fixes both defeat the point: a bigger model doesn’t fit (a 14B needs ~10 GB and my GPU has 8, so it spills 40–60% onto the CPU and takes minutes per task), and calling a frontier API gives up the privacy, the offline operation, and the zero marginal cost.

An education tool that invents facts is worse than useless. So the constraint set was: one 8 GB consumer GPU, fully offline in the hot path, and accuracy that isn’t negotiable.

The insight: synthesis, not recall

The observation the whole system rests on: local models are good at synthesis and bad at generation from thin air. Asked to recall, a 7B confabulates. Given verified source material and asked to reshape it, the same 7B is fast and accurate. So the architecture splits the two jobs:

Authoring (one-time, high-effort, offline from the user’s perspective): a frontier model with web verification writes a deep, fact-checked knowledge base — the textbook.
Synthesis (local, on-demand, cheap): the small model only ever reshapes retrieved, verified passages into answers, courses, and quizzes. It never has to remember anything.

The one-line version: Claude writes the textbook; the local model teaches from it.

The build

The authoring pipeline ran as an agentic fan-out — on the order of 120 parallel agents, each web-verifying facts and citations for one field, each writing one structured module (a summary, nine sections, six key works). The output is ~289,000 words with 787 unique cited references, validated programmatically: schema checks, coverage checks against the catalogue, 132/132 modules present, zero gaps or orphans.

Serving is a straightforward retrieval loop: section-level embeddings (~1,200 passages, nomic-embed-text, cached on disk), cosine retrieval across all domains at once — which is what makes cross-disciplinary questions work — and then the local model synthesises a grounded answer with inline citations. The grounding policy is KB-first: answer from the verified corpus when it covers the query, and only fall back to live open APIs (Wikipedia, OpenAlex, Crossref and friends — 18 keyless academic APIs in total) when it doesn’t.

One corpus, eight products on top of it: the tutor, course and curriculum generators, custom interdisciplinary “majors”, flashcards, quizzes, the concept map, and a bibliography. They all reuse the same retrieval and synthesis engine.

The 7B vs 14B decision, measured

The trade-off I care most about defending: the default model is the 7B, and that was decided by measurement, not vibes. Once grounded, the 7B produced specific, accurate output — an ML course with the correct 1956 Dartmouth date, transformers at 2017, real key works — in about 42 seconds. The 14B took 8–12 minutes for the same task on my hardware (that CPU spillover) and gave smoother prose but no better facts.

That’s the finding in one sentence: the architecture bought the accuracy, so the bigger model had nothing left to add except latency. Roughly 12× faster at near-equal quality.

How accurate, honestly

The claim I’ll stand behind is “zero observed hallucination on spot-checked specifics” — dates, names, attributions checked by hand against the corpus and the world. That is not the same as a formal evaluation harness with faithfulness scoring, which the project doesn’t have. The honest statement is: grounding plus verified authoring removed every fabrication I went looking for, and I went looking in the places small models usually fail. If I were productionising this for anyone but me, an eval gate (faithfulness and citation-accuracy scoring in CI) is the first thing I’d add, along with a real vector store in place of the file caches.

There’s also a quieter robustness layer that mattered in practice: long local generations survive disconnects, build queues persist and resume after a crash, and the system degrades gracefully when a source API is down. A local box is a messy place to run a pipeline; the code assumes that.

Try it

/labs/university/ is the static export — the faculties, fields, essays, and concept map, browsable as-is. The live tutor and generators need the local model running, so they stay on my machine, but the corpus they teach from is the thing you’re reading.

/labs/read-along/ is the same knowledge base in a different mode: a computer-science book authored from it, read along by a fully-offline 3D avatar, with a ten-minute excerpt of the audiobook narrated by the clone of my voice. The full audiobook is 80 chapters and 1.6 GB, built locally; one excerpt ships because of file-size limits, and honesty about that beats pretending otherwise.

Stack, for the record: Python, Flask (stdlib-lean), Ollama running qwen2.5 7B/14B and nomic-embed-text, on-disk embedding and content caches, vanilla JS frontend, one RTX 3070 on Windows.

— Luke Simmons, Auckland

Two papers on giving models a grip on physical space

2026-06-07T00:00:00Z

The week’s most substantial work was about teaching models the physical world rather than just text about it. Two arXiv papers came at it from opposite ends — perception and action — and both leaned on the same lever: scale, plus a representation that actually fits the problem.

“Imaginative Perception Tokens” tackles a real weakness in vision-language models: they reason well about what’s in frame and badly about what isn’t. The paper adds tokens that let a VLM infer unseen viewpoints — stitching partial observations into a coherent sense of the space around an object, rather than only the pixels it was handed. The framing is the interesting part: spatial understanding treated as something a model imagines beyond its input, not a property it reads straight off the image. Whether the gains survive outside the paper’s own benchmarks is the usual open question, but the problem it names — models that can’t reason about the occluded or off-screen — is the right one to be chasing.

Humanoid-GPT comes at the body instead of the eye. It’s a GPT-style transformer trained on a billion-scale motion corpus for whole-body control, and it reports zero-shot motion tracking across varied scenarios — generalising to movements it wasn’t explicitly trained on. The headline is the recipe more than the result: take the same “scale the data, keep the architecture boring” approach that worked for language and point it at retargeted motion data. It’s another data point for the view that a lot of embodied-AI progress has quietly become a dataset problem.

Put the two together and the week reads as a small bet that the route to models which understand physical space runs through more data and better-fitted tokens, not new exotic architectures — the same trajectory language took, arriving a few years later for perception and motion.

The other two things worth flagging were smaller and sharper. A careful piece on 255 vs 256 division works through a detail every graphics programmer gets wrong at least once: whether to map an 8-bit colour channel to the 0–1 range by dividing by 255 or 256, and why the answer isn’t arbitrary. It’s the kind of low-level correctness that quietly decides whether colours round-trip cleanly through a pipeline. And “I made my phone slow on purpose” is a reflective note on deliberately degrading a device’s responsiveness to make it less compelling to reach for — a constraint-as-feature argument that cuts against the usual “make everything faster” reflex.

If there’s a thread across all four, it’s fit over raw power: a token that fits the gap in a model’s spatial reasoning, a dataset that fits an architecture borrowed wholesale, a division that fits the maths, and a slowdown that fits how someone actually wants to use their phone. Not a flashy week — a useful one.

A counterexample, a scaling law, and a task with edges

2026-05-31T00:00:00Z

OpenAI published a result in which one of their models disproved a long-standing conjecture in discrete geometry — a counterexample that human mathematicians had not produced. The novelty here is less that “AI did mathematics” and more that the proof artefact is a concrete object that mathematicians can now verify and build on. That is a different mode of contribution from the usual “passes the benchmark” framing.

A complementary thread runs through the week’s arXiv drop. “Variance Reduction for Expectation with Diffusion Teachers” works on the numerical side of the same problem: improving the efficiency with which models estimate quantities they are nominally good at, by using diffusion-trained teachers as control variates. It is a small, sharp paper — the kind that is easy to overlook between training-run announcements but that quietly changes what is cheap to compute.

The second thread is theoretical. “LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws” reframes scaling-law work in information-theoretic terms, and uses that lens to talk about phenomena that have been treated as separate puzzles — catastrophic overtraining, quantisation-induced degradation, and the various non-monotonic behaviours people have noticed at the edges of training runs. Whether the framework survives empirical scrutiny is the next question; right now it is an organising story rather than a settled theory. Alongside it, “Good Token Hunting: A Hitchhiker’s Guide to Token Selection for Visual Geometry Transformers” is the more applied version of the same impulse — reducing input sequence length so visual-geometry transformers can predict multiple 3D attributes in a single forward pass without the usual quadratic blow-up in compute.

The third thread is applied agents — specifically the part of the agent literature that bothers to be evaluable. “PhotoFlow: Agentic 3D Virtual Photography Missions” asks an agent to infer suitable camera shots from scene information and natural-language intent. The interesting part is not that the agent can do it, but that the task is well-specified enough to measure. A lot of agent papers right now have the opposite shape: vague task, hand-waved evaluation, an evocative demo video. PhotoFlow has edges.

Put the three together and the week reads as a small step away from “bigger model, bigger benchmark” toward concrete artefacts — a real counterexample, a tighter theoretical lens, a task with a precise success criterion. None of it is a phase change on its own. The interesting question is whether the next three months keep producing this kind of work, or whether the next training-run announcement resets the conversation.

How a 7B local LLM actually processes a job listing

2026-05-28T00:00:00Z

Why this document exists

The post-mortem on Job Scout explains what failed: the 7B model was wrong about 30–40% of location verdicts, and it incorrectly excluded “Junior Front End Developer” as manual labour. This one is about mechanisms — why that kind of failure is predictable, not a bug or a bad prompt but a structural property of how these models work. The same pattern shows up everywhere people use small LLMs for judgment-shaped tasks.

Layer 1 — The Python wrapper

Here’s the score_job function from scout_mvp.py, with comments added for this walkthrough:

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen2.5:7b"
MAX_TO_SCORE = 80

def score_job(job: dict) -> dict | None:
    # 1. Truncate long descriptions to fit the context window.
    #    3,500 characters ≈ ~900 tokens at typical English density.
    #    The model has 8,192 tokens total, so this leaves ~7,200 tokens
    #    for the system prompt + output.
    desc = job["description"]
    if len(desc) > 3500:
        desc = desc[:3500] + "\n…[truncated]"

    # 2. Assemble the user-turn payload: structured fields + description.
    #    This is the "document" the model will reason about.
    user_payload = (
        f"TITLE: {job['title']}\n"
        f"COMPANY: {job['company']}\n"
        f"LOCATION: {job['location_raw']}\n"
        f"TAGS: {', '.join(job['tags'])}\n"
        f"SALARY: {job['salary_raw'] or 'unspecified'}\n"
        f"---\n{desc}"
    )

    # 3. Ollama request body.
    body = {
        "model": MODEL,
        "prompt": user_payload,
        "system": SYSTEM_PROMPT,   # ~40 lines encoding Luke's profile + scoring rubric
        "format": "json",          # constrained decoding — explained in Layer 4
        "stream": False,           # wait for complete response, not token-by-token
        "options": {
            "temperature": 0.1,    # near-deterministic — peaks the probability distribution sharply
            "num_ctx": 8192,       # context window size in tokens
        },
    }

    # 4. Synchronous HTTP POST to the local Ollama server.
    #    timeout=180 because inference on 7B takes ~10-30s depending on load.
    try:
        resp = requests.post(OLLAMA_URL, json=body, timeout=180)
        resp.raise_for_status()
    except requests.RequestException as e:
        return None

    # 5. Unwrap: Ollama returns {"response": "<json string>", "done": true, ...}
    #    The inner response is the model's output, already valid JSON (enforced
    #    by constrained decoding).
    envelope = resp.json()
    raw_output = envelope.get("response", "")
    return json.loads(raw_output)   # parse model output into a Python dict

Then back in the main loop:

def composite(v: dict) -> float:
    # Weighted average of three 0–100 scores the model returned.
    return (
        0.45 * _num(v.get("growth"))
        + 0.35 * _num(v.get("relevance"))
        + 0.20 * _num(v.get("attainability"))
    )

Pipeline: fetch listing → format as text → POST to Ollama → get back JSON with growth/relevance/attainability + exclude flag + reason → compute composite → sort top 10. What happens inside that POST is where the interesting machinery lives.

Layer 2 — Ollama as a server

Ollama is a Go HTTP server wrapping llama.cpp. On first use it memory-maps the Qwen2.5-7B weights (~4.5 GB in Q4_K_M quantisation) into the RTX 3070’s 8 GB VRAM, with room left for the KV cache. The combined system prompt + user payload is tokenised via Qwen’s BPE vocabulary (~150K tokens) — most job-listing payloads land around 800–1,200 of the 8,192-token window. The token sequence runs through the forward pass (Layer 3), generating output one token at a time with invalid JSON tokens masked at each step (Layer 4). Because stream: false, Ollama accumulates 300–500 output tokens (each ~50–100ms on the 3070) before returning — hence the 180-second timeout — and wraps them in a response envelope:

{
  "model": "qwen2.5:7b",
  "response": "{\"excluded\": false, \"exclude_reason\": \"...\", \"growth\": 75, ...}",
  "done": true,
  "eval_count": 312,
  "eval_duration": 18400000000
}

The actual model output is the response field — a JSON string that score_job parses.

Layer 3 — What the transformer actually does

This is the level most people skip, and it’s the one that explains the failure modes.

Tokens are not words

The model doesn’t see text. It sees a sequence of token IDs. “Junior Front End Developer” tokenises to something like [14571, 11657, 8770, 30567] — each ID is an index into the vocabulary. Before any computation, each ID gets converted to a high-dimensional vector (an embedding) — about 3,584 floats for Qwen2.5-7B.

28 layers of attention + feed-forward

The sequence of embeddings passes through 28 transformer blocks. Each block does two things:

Self-attention: Each token looks at every other token in the context and adjusts its own representation based on which tokens are “relevant” to it. This is the mechanism that creates context — “Developer” can pull in signal from “Junior” and “Front End” earlier in the sequence. Concretely, a matrix multiplication over queries, keys, and values — expensive but parallelisable on GPU.
Feed-forward network: A two-layer MLP applied to each position independently. This is where the model’s “world knowledge” mainly lives — associations baked in during pretraining.

After all 28 layers, each token position has a rich contextual embedding. The last token’s embedding is what matters for prediction.

KV cache

Ollama caches the key and value matrices from each layer after the initial forward pass, so subsequent token generations only need to attend the new token against cached keys/values. This is why generation speeds up after the prompt is processed.

Temperature 0.1

After the final layer, the model produces a logit vector — one float per vocabulary token. To turn logits into probabilities:

softmax(logits / temperature)

With temperature: 1.0 the distribution is mildly peaked. With temperature: 0.1 the division sharpens it dramatically — the top token gets almost all the mass. You’re almost always sampling the mode, so output is consistent across runs. It doesn’t make output correct — it makes it consistently wrong in the same way if the model’s probability estimates are off.

Layer 4 — Constrained decoding and why it makes hallucinations look confident

format: "json" enables Ollama’s constrained decoding mode. This is the feature most responsible for the confident-looking wrong answers.

How it works

At each generation step, before sampling, Ollama applies a logit mask derived from a JSON grammar and the current partial output. Any token that would produce syntactically invalid JSON gets zeroed out. The model can only generate tokens that keep the JSON valid — no unclosed strings, no missing commas, no unquoted keys, no wrong structural types. The output is always syntactically valid JSON, even if the model is completely confused about the content.

Why this is dangerous

The mask enforces syntax, not semantics. The model still has to produce some value for exclude_reason, so it produces whatever string has the highest probability given the context — plausible-sounding, not necessarily correct. For “Junior Front End Developer”, the high-probability completion after "requires " (given the physical-office and “hands-on” cues in the description) was "manual labour". Likely given the preceding tokens; not grounded in what the role actually requires.

Constrained decoding makes this worse in one specific way: correct and incorrect JSON look identical. Free-form hallucinations ramble, hedge, contradict themselves — they’re visible. JSON-mode hallucinations are crisp, structured, and authoritative-looking. The "nz_remote_eligible": false verdicts that were wrong 30–40% of the time came with coherent exclude_reason values like "appears to require US work authorization". Structured, plausible, wrong.

The replacement — what `app.py` does instead

SQLite schema (same as before, no change here)

CREATE TABLE IF NOT EXISTS jobs (
    id           TEXT PRIMARY KEY,
    source       TEXT NOT NULL,
    source_id    TEXT,
    url          TEXT,
    title        TEXT,
    company      TEXT,
    location     TEXT,
    description  TEXT,
    posted_at    TEXT,
    salary       TEXT,
    tags         TEXT,
    fetched_at   TEXT NOT NULL,
    status       TEXT NOT NULL DEFAULT 'new'
);

No score, no exclude_reason, no growth, no attainability. The database stores what was actually observed — nothing inferred by a model that might be wrong.

Location classification — deterministic regex

AUCKLAND_RE  = re.compile(r"\bauckland\b", re.I)
NZ_CITIES_RE = re.compile(
    r"\b(wellington|christchurch|hamilton|tauranga|dunedin|napier|nelson|"
    r"rotorua|new plymouth|palmerston north|whangarei|invercargill|queenstown|"
    r"hastings|gisborne|whanganui|pukekohe)\b",
    re.I,
)
NZ_RE     = re.compile(r"\b(new zealand|aotearoa|\bnz\b|north island|south island)\b", re.I)
REMOTE_RE = re.compile(r"\b(remote|worldwide|anywhere|telecommute|wfh|work from home)\b", re.I)

def classify_location(location: str) -> str:
    """Return one of: loc:auckland, loc:remote, loc:nz-other, loc:overseas-onsite, loc:unknown."""
    if not location or location.strip() in ("—", "-"):
        return "loc:unknown"
    has_remote = bool(REMOTE_RE.search(location))
    has_akl    = bool(AUCKLAND_RE.search(location))
    has_other_nz_city = bool(NZ_CITIES_RE.search(location))
    has_nz     = bool(NZ_RE.search(location))

    if has_akl:
        return "loc:auckland"
    if has_remote and not has_other_nz_city:
        return "loc:remote"
    if has_other_nz_city or (has_nz and not has_remote):
        return "loc:nz-other"
    if has_remote:
        return "loc:remote"
    return "loc:overseas-onsite"   # conservative: unknown → assume overseas

This doesn’t infer that “Remote (US)” is NZ-ineligible from plausible-sounding reasoning. It looks at the string and returns a label. Ambiguous inputs return loc:unknown and stay in the feed — don’t hide things you’re unsure about. The LLM’s failure was inventing reasons why a listing was ineligible; this function doesn’t reason, it pattern-matches on observable strings.

Focus classification — tiered regex matching

PRIORITY_RE = re.compile(
    r"\b(python|rust|machine[- ]learning|deep[- ]learning|pytorch|"
    r"llm|gpt|claude|nlp|ai[- ]engineer|ml[- ]engineer|mlops|"
    r"generative[- ]ai|rag\b|retrieval[- ]augmented|ai[- ]agent|"
    r"rlhf|ai[- ]safety|alignment[- ]research|"
    r"developer[- ]advocate|devrel|developer[- ]relations)\b",
    re.I,
)
TECH_RE = re.compile(
    r"\b(engineer|developer|software|programmer|devops|sre|infrastructure|"
    r"backend|frontend|fullstack|technician|network|audio|sound|broadcast)\b",
    re.I,
)
NONTECH_RE = re.compile(
    r"\b(sales[- ]executive|account[- ]executive|accountant|bookkeeper|"
    r"chef|cook|driver|cashier|cleaner|nurse|hospitality|warehouse|forklift)\b",
    re.I,
)
DOMAIN_NONTECH_RE = re.compile(
    r"\b(pharmacy|aviation|civil[- ]engineer|traffic[- ]engineer|"
    r"structural[- ]engineer|mechanical[- ]engineer|electrical[- ]engineer|"
    r"chemical[- ]engineer|food[- ]technologist)\b",
    re.I,
)

def compute_flags(job: dict) -> list[str]:
    flags = [classify_location(job.get("location") or "")]

    title     = job.get("title") or ""
    raw_tags  = job.get("tags") or ""
    desc_head = (job.get("description") or "")[:1500]
    haystack  = " ".join([title, job.get("company") or "", raw_tags, desc_head])

    has_priority    = bool(PRIORITY_RE.search(haystack))
    title_tech      = bool(TECH_RE.search(title))
    title_nontech   = bool(NONTECH_RE.search(title)) or bool(DOMAIN_NONTECH_RE.search(title))
    has_tech_any    = title_tech or bool(TECH_RE.search(desc_head))
    has_nontech_any = title_nontech or bool(NONTECH_RE.search(haystack)) or bool(DOMAIN_NONTECH_RE.search(haystack))

    if has_priority:
        flags.append("focus:priority")
    elif title_nontech:
        flags.append("focus:non-tech")
    elif has_tech_any:
        flags.append("focus:tech")
    elif has_nontech_any:
        flags.append("focus:non-tech")

    return flags

Title-level non-tech signals override description-level tech signals — the inverse of the LLM’s behaviour. A front-end developer job that mentions “you’ll be hands-on” doesn’t get excluded. DOMAIN_NONTECH_RE handles cases like “Electrical Engineer”, which in a job-listing context almost always means a utilities role rather than software.

Filter endpoint — human does the judgment

@app.route("/api/jobs")
def api_jobs():
    q       = (request.args.get("q") or "").strip().lower()     # keyword search
    source  = request.args.get("source") or ""                   # remoteok | remotive | hn | ...
    status  = request.args.get("status") or "active"             # active | starred | hidden
    eligible = request.args.get("eligible", "1") == "1"          # filter to AKL + remote only
    focus   = request.args.get("focus", "tech")                  # priority | tech | all
    days_param = request.args.get("days", "30")                   # recency filter

    # ... (SQL query + Python-side filtering) ...

    # eligibility filter: drop overseas-onsite and NZ-other-only
    if eligible:
        loc = next((f for f in flags if f.startswith("loc:")), "loc:unknown")
        if loc in ("loc:nz-other", "loc:overseas-onsite"):
            continue

    # focus filter: priority | tech | all
    if focus == "priority":
        if "focus:priority" not in flags:
            continue
    elif focus == "tech":
        if not ("focus:priority" in flags or "focus:tech" in flags):
            continue

    # keyword search with word boundaries (prevents "rust" matching "trust")
    if q_patterns:
        hay = " ".join([title, company, location, description[:3000], tags])
        if not any(p.search(hay) for p in q_patterns):
            continue

    # Priority-tagged jobs float to the top
    out.sort(key=lambda j: 0 if "focus:priority" in j["flags"] else 1)
    return jsonify(out)

The dashboard gives the human: - Location + focus filters — AKL/remote toggle and priority/tech/all tiers - Keyword search — word-boundary regex, so ?q=rust doesn’t surface “trust administration” - Star / hide / recency — persistent per-listing status in SQLite, plus 7/30/90-day windows

The human looks at the filtered list and decides what to apply for. Triage takes ~10 minutes for 40 listings. That’s all it needs to do.

The actual comparison

	LLM scoring (`scout_mvp.py`)	Regex + dashboard (`app.py`)
Location verdicts	~60–70% accurate	~95% accurate (on clearly-stated locations)
Focus classification	Inconsistent; over-excluded on edge cases	Consistent; transparent exclusion logic
Latency	~15–30s per listing × 80 listings = 20–40 min	Sub-second for any filter change
Failure mode	Confident wrong answers, hard to spot	Miss (unknown label) rather than misflag; transparent
Debuggability	“Why did the model exclude this?” requires re-running	Read the regex; it’s a function with 10 lines
Where the judgment sits	Model (unreliable)	Human (reliable, and that’s appropriate)

The lesson as a system design principle

Both systems do the same top-level task: surface job listings worth reading. They differ in where they put the hard parts. scout_mvp.py offloaded the fuzziest cases — “is this actually manual labour?”, “is this truly NZ-remote-eligible?” — to the model. Constrained decoding forced confident-looking output, but the output was pattern-matching against surface text, not reasoning about the real-world referent. app.py keeps the hard parts with the human: deterministic classification on observable signals, human judgment on the rest.

The principle: route extraction tasks to small models or code; route judgment tasks to humans or large models. Extraction is “pull these fields out of this text.” Judgment is “decide whether this text implies a real-world property it doesn’t directly state.” Small models are reliable at the first and unreliable at the second. This isn’t an argument against local LLMs — it’s an argument for being precise about which sub-task you’re giving them, and testing on the hard cases, not the easy ones.

Companion to: Why I dropped a 7B local LLM from my job aggregator — the same project, one layer higher.

— Luke Simmons, Auckland

Why I dropped a 7B local LLM from my job aggregator

2026-05-28T00:00:00Z

TL;DR

I built a job aggregator that used a local 7B model (Qwen2.5-7B via Ollama, on an RTX 3070) to score each listing for relevance, growth, and attainability against my profile. It worked end-to-end in one sitting. It also returned verdicts that were systematically wrong in two specific, consistent ways. I dropped the LLM scoring layer and replaced it with a dashboard that surfaces the data and lets me — the human — do the judgment. The model still has a place in the system, just not the place I originally gave it.

This isn’t a “local LLMs are bad” post. It’s a “local 7B models can’t do calibrated judgment under category-edge ambiguity, even when the prompt is good” post — and the spec I wrote before building this literally hedged on exactly that. The lesson is about which sub-tasks of the system the model is suitable for, not whether to use one at all.

What I built

A vertical slice: pull a feed, normalise it, score each listing with a local LLM, print the top 10 to stdout. ~230 lines of Python, no database, no dedup.

Stack: RemoteOK’s public JSON API for fetching (~50–100 listings per call), normalisation into a canonical record with HTML stripped, then per-listing scoring via a local Ollama instance running qwen2.5:7b. Composite score = 0.45 * growth + 0.35 * relevance + 0.20 * attainability, each axis a 0–100 integer the model returned.

The system prompt was the load-bearing artifact: ~40 lines encoding my candidate profile (early-career technical generalist in Auckland, AI-engineering aspiration), in-scope domains, hard exclusions (manual labour, non-NZ-eligible remote), and the three scoring axes.

The hypothesis: a well-prompted local 7B model can do this include/exclude + rough-scoring work reliably enough to be useful. Not perfect accuracy — just “directionally right often enough that I trust the top 10.”

It wasn’t.

What broke

Two failure modes, both consistent, both fundamental enough that no prompt tweak fixed them.

1. Over-aggressive exclusion on category-edge cases

The prompt’s hard-constraint clause: “EXCLUDE any role requiring manual labour, physical lifting, driving, trades, warehouse, hospitality, or on-feet-all-day work.” Reasonable rule, meant for delivery drivers and hospitality managers.

What it actually did: it excluded “Junior Front End Developer” with exclude_reason: "requires manual labour".

This wasn’t a one-off. The model latched onto incidental words in long descriptions (“ship features,” “build pipelines,” “hands-on work”) and pattern-matched them into the manual-labour bucket. Mentions of a physical office, on-call rotation, or “fast-paced environment” did the same. The model wasn’t reasoning about the role — it was running fuzzy keyword similarity against my exclusion list, then writing a plausible-sounding justification.

For a 7B model in JSON mode, this is the dominant failure pattern. Plenty of capacity to produce structured output and pattern-match on surface features. Not enough capacity to hold “what does this role actually require” in mind while also holding the rubric.

2. Hallucinated location constraints

The location clause required reading the listing carefully and deciding whether the listed location was NZ-eligible (Auckland, NZ-wide, or remote explicitly open to NZ/APAC/worldwide).

What it did instead: it invented constraints that weren’t in the listing. A “Remote (Worldwide)” tag would come back as nz_remote_eligible: false with exclude_reason: "appears to require US work authorization" — even when the listing said nothing about US authorization. Conversely, some US-only roles came back as nz_remote_eligible: true because the description mentioned “we hire globally” in a recruiting blurb that didn’t reflect actual policy.

Roughly 30–40% of location verdicts were wrong, in both directions. The model was generating plausible location reasoning rather than grounded location reasoning. JSON-mode output made the hallucinations more confident-looking, because they came wrapped in structure.

Why these matter

Either failure mode alone would have been a tuning problem. Together they composed into something worse: the surviving top-10 was systematically biased away from exactly the roles I most wanted to see. Front-end and full-stack junior roles got excluded by the manual-labour misfire. NZ-eligible remote roles got excluded by the hallucinated US-only constraint. I read the digests for about a week. The signal-to-noise was worse than browsing RemoteOK directly, which is the failure condition that matters.

What I should have caught before building

The spec I wrote before coding the MVP literally said this:

8B model ceiling: good enough for include/exclude and rough scoring. If you want it to reason about subtle career-fit nuance, that’s where the home-lab box and a larger model would earn their keep — but prove you need it first.

I read that, agreed with it, and then built as if I disagreed with it. The fix isn’t “trust the model more.” The fix is: judgment-under-edge-cases is exactly the kind of subtlety the spec was hedging against. The MVP wasn’t testing whether the model handled the easy cases (it did). It was testing whether it handled the hard cases — the category boundaries between knowledge work and manual labour, between truly-global remote and let’s-say-we-hire-globally remote. It didn’t.

The test that matters isn’t whether the model handles the typical case. It’s whether it handles the cases where the rubric and the data are both fuzzy at the same time.

What replaced it

I dropped the LLM scoring layer entirely. The current system is a Flask dashboard (SQLite-backed) that renders the full job list with filter affordances and lets me do the judgment myself. Triage is fast — 40 listings in 10 minutes.

The honest framing: the AI was supposed to do the part of the work that’s actually mine to do. The aggregator’s value is “Luke’s eyes on the right data, fast” — not “AI tells Luke what to apply for.” Removing the LLM made the system simpler, faster, and more honest about where the value comes from.

What I’d do if I were doing it again

I’d put the LLM back in — but on different sub-tasks, and with a different model class.

A 7B model is good at summarisation (compress a 4,000-character description into a 40-word brief), tag extraction (stack, seniority, remote policy, salary range — structured extraction is its home turf), and cross-source deduplication (embedding similarity beats keyword matching). It’s bad at judgment-against-rubric where the rubric edges are fuzzy. For that, either Luke does it, or you use a Claude/GPT-4-class model and accept the per-call cost — 80 listings/day at ~$0.01 each is $24/month, trivial.

The architectural decision I’d make differently from day one: separate “extraction” tasks from “judgment” tasks in the design, not just in retrospect. A v2 of Job Scout would explicitly route those to different model classes — local for extraction, cloud for judgment, human for ranking.

The MVP taught me where the boundary lives. That’s worth more than the MVP itself.

The wider lesson, briefly

A lot of 2026 product discussion treats “local LLM” and “cloud LLM” as a deployment-cost trade-off — same capability, different cost curve. That framing misses the point. The local-vs-cloud line is also a capability line for judgment-shaped tasks, and it’s sharper than people who haven’t tried it think. A 7B model on consumer hardware can do extraction, summarisation, classification on clean categories, and structured rephrasing well enough to be useful. It cannot do calibrated judgment when the rubric and the data are both ambiguous — and most real-world filtering problems are exactly that.

If you’re designing a system that uses local LLMs for anything, the test that matters is not “does it work on the easy cases.” It’s “what does it do on the cases where a human would have to think carefully.” If the model can’t, route those to a larger model or to a human. Don’t try to prompt your way out of a capability gap.

Part of an ongoing series of project writeups. Companion piece: How a 7B local LLM actually processes a job listing — same project, one layer deeper.

— Luke Simmons, Auckland

Cloning my own voice

2026-05-27T00:00:00Z

The narrator on the intro for my portfolio site isn’t me reading a script. It’s a model reading a script in my voice. I built it from about 25 seconds of audio I recorded at my desk in one take, it runs locally on an RTX 3070, and to my ear it’s a good likeness — not a perfect match, and I don’t want it to be. That’s the build log.

The stack

The engine is Chatterbox-Turbo from Resemble AI — a ~350M-parameter zero-shot TTS model, MIT-licensed, released as chatterbox-tts==0.1.7 on PyPI. Zero-shot is the load-bearing word: there’s no fine-tuning step. You hand the model a reference clip of the target voice and a string of text, and it produces speech in that voice. The weights don’t change between calls. The voice “training data” is just the reference clip.

I considered the obvious alternatives and ruled them out by licensing — Fish Speech (Apache 2.0) was the closest contender and lost on a blind-test, but XTTS-v2, F5-TTS, and Kokoro were all eliminated on license or fit before they got that far. The decision rule was strict: MIT or Apache only, because the YouTube channel this feeds into is monetised and I’d rather not discover a licensing problem after the fact.

The rest of the stack is Python 3.11 in a fresh venv, Torch + CUDA, and FFmpeg on PATH for muxing the narration onto scene images. Hardware is a Ryzen 7 7700X with an RTX 3070 (8 GB VRAM). VRAM peak on a two-line smoke run was ~3.3 GB. The 8 GB ceiling only becomes a real constraint if I ever try to fine-tune on a larger corpus of my own audio, which I haven’t and probably won’t.

The training data isn’t really training data

This part is worth being honest about. With a zero-shot model, what most people would call “training” is really “providing a reference clip.” There is no gradient step. The model has already been trained on huge multi-speaker corpora; what I’m giving it is a prompt in audio form that tells it which speaker to mimic.

The reference clip is ref/my_voice.wav — about 25 seconds of me reading paragraph-length text into a USB condenser mic in my home office. Mono, 24 kHz WAV, with a light cleanup pass in Audacity.

The quality lever everyone misses is that clone quality is dominated by reference quality, not model choice. So the care went into the recording, not the code: a quiet room, a USB condenser mic, a consistent distance, and no hard breath before the first word. I recorded it in one take and did a light cleanup pass in Audacity. The clone inherits prosody and energy from the sample, so the calm, level read in that clip is the register the narrator speaks in. The model copies what you give it.

If your clone sounds bad, the answer is almost never “use a bigger model.” It’s “record a better reference.”

The first time it sounded convincing

The smoke test was two throwaway lines:

“This is the first line of the smoke test narration.” “And this is the second line, generated as a separate file so a single bad take can be re-rolled.”

First take, played through my desk speakers, I had the genuine that’s me reaction — the timbre was right, the pace was right, the way I clip the end of “narration” was right. The first generation that worked was also the first generation, full stop. Zero-shot earns its name.

What it got wrong early — and still gets wrong occasionally — falls into three categories: proper nouns (less-common technical terms like Chatterbox or Ollama land the stress wrong, fixed by spelling them phonetically in the script), hard consonant transitions (back-to-back sibilants occasionally smear), and question inflection (rising intonation is hit-or-miss, and rephrasing as a statement is more reliable than relying on the question mark).

The pipeline writes one .wav per line, so generate.py --only 7 re-rolls line 7 in isolation when a take lands wrong.

How good is it, honestly

I haven’t run a blind test — I haven’t played it to anyone and asked them to guess — so I’m not going to claim nobody could tell. The narrower, true thing is this: to my own ear it’s a good representation of how I sound. The timbre is right and the pace is close.

It is not a perfect match, and I don’t want it to be. An undetectable clone of my own voice isn’t a goal I have; a good-enough narrator that’s clearly disclosed is. The artefacts a careful ear could catch are the usual ones for zero-shot TTS: a slight flatness to the prosody on longer sentences, occasional consonant smearing on fast clusters, and a consistent lack of the breath-and-restart pattern you get from a real human reading aloud. None of these jump out in short narration. All of them would show over a 10-minute audiobook.

That’s the honest bound: it works for short-form narration. It would not hold up over a 30-minute podcast, where the cumulative absence of human pacing irregularities would start to register as off.

The disclosure call

The choice that took the most thought wasn’t technical. It was framing.

Using a voice clone as the narrator on my own portfolio site is meta. It’s also exactly the kind of thing that could feel deceptive if I didn’t say what it is. So I made the rule explicit, for myself: the narrator is disclosed as a clone, on the page, in plain text. The site copy frames it as “the narrator is an AI voice clone of Luke’s voice” — not buried in a footer, but in the intro context where you’d encounter it on first watch.

The reasoning is consequentialist, not deontological. I don’t think there’s anything inherently wrong with using a voice clone. I do think there’s something wrong with using one in a context where the listener would reasonably assume it’s a human, and not telling them. Disclosure is the easy fix for the easy version of the problem.

One technical detail worth mentioning: Chatterbox bakes in PerthNet (Implicit), Resemble AI’s inaudible audio-provenance watermark, on every output. The audio is identifiable as Chatterbox-generated to anyone running a detector. It’s the right default for a model used on monetised content.

Where I’ll use it next, and where I won’t

Will use:

YouTube intro narration on the portfolio site (the current target).
Voiceover for short project demos — sub-2-minute videos where the disclosure framing is in the intro.
Drafts of longer narrations, to hear how a script lands before I decide whether to record it properly.

Won’t use:

Anything where the clone could plausibly be mistaken for me speaking live, without disclosure.
Long-form podcasts or audiobooks. The honest quality bound says: not yet.
Anything that involves saying things I wouldn’t actually say. The model is willing. I’m the constraint.

The next question, and the one I’m deferring, is whether to fine-tune on a larger sample of my own audio. Zero-shot at this quality is already past my disclosure threshold for the use cases I actually have. I’ll revisit if it starts to feel limiting. It hasn’t yet.

— Luke Simmons, Auckland

Designing a smart home on my terms

2026-05-26T00:00:00Z

What I’m actually trying to build

Most smart-home writeups start with the X — “I want the lights to turn on when I get home.” Mine starts with the Y, because the Y is the part that decided every other choice in the stack.

What I don’t want:

No cloud accounts in the control path. If my internet drops, every automation in the house should keep working. If a vendor turns off their servers in 2030, no light switch in my house should become a paperweight.
No always-on microphones. No Alexa, no Google Assistant, no Apple HomePod listening in the lounge. I don’t trust the threat model and I don’t want the ambient surveillance even if I did.
No app per device. The number of single-purpose apps that ship with consumer IoT gear is a usability disaster and a security one. One control surface or it doesn’t go in.
No phoning home. If a device’s only way to reach me is via the manufacturer’s cloud relay, it doesn’t belong on this network.

What I do want:

Lights, climate, and presence-aware automations that just work, locally, without me thinking about them.
A single dashboard I actually look at, not buried in three vendor apps.
A platform I can poke at with code — the automation layer is exactly the kind of place where a small LLM running on my home-lab box could earn its keep, and I want that door left open.
An exit ramp. Every device and every protocol should be replaceable without ripping the rest out.

That set of constraints rules out about 80% of the consumer smart-home market in one swing. The remaining 20% is what this writeup is about.

The stack

The radio-and-protocol layer is the part most writeups skip past. It’s also the part that decides everything downstream, so it’s worth being explicit about.

Radio layer

Zigbee for the bulk of the sensor fleet — 2.4 GHz mesh, mains-powered devices act as repeaters, fully local once you have a USB coordinator dongle and a broker, biggest catalogue, lowest per-device cost (downside: sharing 2.4 GHz with Wi-Fi, manageable with channel planning). Thread for new buys where there’s a certified Matter-over-Thread option that’s actually mature — in early 2026 a much smaller list than the marketing suggests, so the plan starts Zigbee-heavy. No Wi-Fi sensors: cheap Wi-Fi devices are almost always cloud-locked, power-hungry for anything battery-driven, and add noise to a band I’d rather keep clean.

Application + brain

Home Assistant, running locally. The staging setup right now is Home Assistant OS in a VM on my main desktop — good enough to design against, deliberately not the production target. Production lives on a dedicated small box once the design stops moving. Zigbee2MQTT will talk to a USB coordinator (still choosing between SkyConnect, Sonoff ZBDongle-E, and ConBee III based on Z2M compatibility for the device list I end up with). No HA Cloud, no Nabu Casa. Remote access, if I want it later, goes through a self-hosted reverse proxy or a Tailscale tailnet — not anyone’s relay.

The Home Assistant choice is the load-bearing one. It’s the only platform that takes “fully local, mixes protocols, owns its own state” as a first principle rather than a marketing checkbox. Apple Home is more polished and SmartThings has wider out-of-box device support, but both route through a cloud I don’t want in the path.

Hardware

Aqara for most of the planned sensors — motion, door/window, temperature/humidity. Zigbee, cheap, pair cleanly with Zigbee2MQTT, unobtrusive form factors. The first-wave fleet is small — under twenty devices across the three categories — because the right scope for v1 is “one room, end to end” rather than “every room, half-built.” Smart switches are the next layer after sensors, and the irreversible constraint there is the neutral wire at the switch box — a wiring decision that has to be made before the wall closes.

How the rooms will be organised

Room-first, not device-first. Entities follow a flat area.device.function naming scheme; areas group them into the unit a human actually reasons about (“the lounge is occupied”); scenes are explicit named states rather than ad-hoc on/off lists; automations are intent-named (presence.lounge.arrive, climate.bedroom.sleep) so future-me reading the YAML in eighteen months can tell what each one is for without opening it.

The local-only constraint shows up in a specific way: every trigger has to resolve from local state. No cloud webhooks, no IFTTT-style “phone entered geofence as reported by Google.” Presence detection itself isn’t implemented yet — that’s the most interesting open design question, and the one most likely to change once I have real day-to-day data. The candidates I’m weighing are mmWave (Aqara FP2 or similar), Bluetooth tracking via ESPresense, and motion-only-as-baseline. mmWave is the strongest signal but adds cost and a sensor per room; Bluetooth tracks phones rather than people; motion is the cheapest and the worst at telling “sitting on the couch” apart from “walking through.”

Where this sits today

Honestly: the design is locked, the staging VM runs, and no production hardware is on the network yet. This writeup is the design log, not a deployment log. The reason it exists at this stage is that the architectural decisions — local-first, no microphones, exit-ramps, room-first composition — are the part most consumer smart-home writeups skip past, and the part that decides everything else downstream. Writing them down now is how I avoid drift once the first box of Aqara sensors lands.

Known design tensions

A few open questions the design doesn’t fully answer yet, kept here as honest TODOs:

Zigbee + Wi-Fi coexistence on 2.4 GHz. Channel planning is straightforward in theory but only verifiable once both networks are running real traffic. I’ll know within a week of bringing up the first room whether the placement assumptions hold.
Presence detection. Genuine fork in the road; I’ll pick after one round of testing rather than committing now.
Automation race conditions. Multiple automations racing over the same target state is the category I’m watching for as more rules go in. Hasn’t bitten yet — but also hasn’t had the chance to, because the rules aren’t live.

What’s next, in order

Pick the Zigbee coordinator dongle.
Stand up Home Assistant on the chosen production box (off the desktop VM).
Start with one room — motion + door sensor + a single smart switch — wired end to end, before scaling.
Document what actually broke against this design, in a follow-up post. That one will be a deployment log, not a design doc.

The version of this house that exists in a year will have more sensors, smarter presence, and probably an LLM somewhere in the loop. The version that exists today is a design I trust enough to start buying hardware against — which is the only version of any of this I was ever interested in shipping.

— Luke Simmons, Auckland

A personal finance dashboard built around Akahu

2026-05-25T00:00:00Z

The problem

I was doing my own financial reconciliation in spreadsheets, which is what most people do, and like most people I was doing it badly. Every month I’d download the bank exports, ctrl-F for each expected payment, tick them off, chase the missing ones, then categorise the rest of the spend by hand. The first time I did it, it took an evening. The second time, two hours. Around the third month I stopped, and the spreadsheets quietly went out of date.

The actual pain wasn’t any single one of those tasks. It was the reconciliation step — the part where you sit down with a CSV and your own memory and try to match them against each other. That’s the work the dashboard exists to remove.

What it does

The app is a local-only web app. It runs on my laptop, on my home network, and never leaves the machine — the database holds real bank data, so it has to stay off the internet. The browser tab has a small number of sections: an overview with cashflow charts and balances, a transaction view with filter and re-categorise affordances, and a recurring-payments view that highlights anything expected-but-missing.

The whole thing is wired to Akahu, New Zealand’s open banking API, which is the single biggest reason this tool exists at all. More on that below.

The stack — and what I’d do differently now

Backend is Node + Express. Frontend is React via Vite. Database is SQLite through better-sqlite3. Charts are Recharts. Styling is Tailwind. There are ~31 API endpoints in server.js (~1,000 lines) and a handful of backend modules covering the schema, the Akahu client, the categoriser, and the import/sync pipeline.

If I were starting today I’d build this in Python with Flask. Not because Node is bad at this — it’s fine at it, and the app works — but because every other project I currently maintain is Python or Rust, and the cognitive switching cost of jumping between Node-flavoured async and Python-flavoured everything-else is real overhead when you’re maintaining six projects at once. Consistency matters more than language merit at my scale.

I’m not going to rewrite it. That would be a couple of weeks of work for zero new functionality, and the existing code runs. But I won’t start another project in this stack, and that’s the honest version of “what would you choose differently.” The constraint I set later — Python or Rust only — was set partly because of this project. Maintaining a one-off Node app inside an otherwise Python portfolio taught me that picking a stack per project is a tax I keep paying forever.

Why Node was the right call at the time: I’d just come off a React project, the frontend was going to be the thing I actually saw every week, and going same-language across the stack felt like the right move when I was building this in spare evenings and couldn’t afford a Python-to-React context switch every time I wanted to add a column. It was a defensible choice for that moment. It’s just not the choice I’d make from where I am now.

Akahu is the leverage point

The thing that makes this dashboard worth maintaining isn’t the React UI. It’s that I never have to enter a transaction by hand.

Akahu is an NZ open banking aggregator — it brokers OAuth-style access to the major NZ banks. The user goes through Akahu’s portal once, approves enduring consent, and from then on the app can pull transactions from connected accounts on demand. The OAuth flow is the standard authorization-code-for-access-token dance. The token gets stored in SQLite’s config table and reused on every sync.

The integration is in backend/akahu.js. The shape of it:

Hit /accounts to get the list of connected accounts.
For each account, call /accounts/{id}/refresh to force Akahu to poll the bank for fresh data (otherwise you get cached results — important during reconciliation).
Wait ~8 seconds for the refresh to land. Yes, an 8-second setTimeout. Akahu doesn’t give you a webhook for refresh-complete on the tier I’m on, and polling for it added more code than the sleep was worth.
Fetch transactions with cursor-based pagination, 100 per page, until the cursor runs out.
Run each transaction through the categoriser and INSERT OR IGNORE into SQLite, keyed on Akahu’s unique transaction ID.

The three things that were hard:

Auth. OAuth is fine when there’s documentation. The fiddly part was the difference between the app token (identifies my app to Akahu) and the user token (identifies me to Akahu as a person who’s granted access). Both go on every request, in different headers. I burned an evening on that.

Dedup between two sources. There are two sources of truth: CSVs imported from before Akahu was connected, and the live Akahu feed afterwards. They overlap, and they don’t share IDs. Imported rows get a composite unique index on (date, amount_cents, description, reference). Akahu rows get a unique index on the Akahu ID. The two never collide because the imported rows have akahu_id IS NULL and the partial unique index is scoped to that condition. SQLite’s partial unique indexes carried that design — I’m not sure I’d have landed on as clean a solution in Postgres.

Categorisation rule ordering. The categoriser is a list of ordered rules matching payee name and reference. The interesting wrinkle is that the same payee can appear under several different formattings — banks change how they emit payee names between financial years, merchants rebrand, payment networks restructure. The categoriser has to be a list of patterns rather than a lookup table because the source data isn’t stable. There’s a load-bearing comment in the file explaining why one rule has to come before another that looks similar — different reference field, different category. Stuff like that is the actual work of an internal tool.

What’s modelled and what isn’t

In the schema: accounts, transactions, recurring payment expectations, transaction categorisation rules, plus a small config table for the Akahu tokens and a few user preferences.

Not in the schema: anything resembling double-entry bookkeeping. This is a dashboard, not an accounting system. The dashboard exists to make my own reconciliation fast, not to replace anyone else’s accounting workflow.

Lessons from a tool with one user

You can ship ugly when you’re the only user. The UI has rough edges. Tab spacing is inconsistent. Some modules borrow half their components from others and the seams show. None of that matters because the only person who sees it is me, and I built it, and I know which buttons do what.

You can also let an ugly internal tool rot, and that’s the trap. The dashboard is in active use because I designed it around one painful task — reconciliation — and made that task take 90 seconds instead of an hour. As long as that core loop stays fast, I’ll keep opening the app, and the rest of it stays current by osmosis. The week I stop reconciling is the week the whole app starts decaying. The discipline isn’t “keep it pretty.” It’s “keep one loop sharp.”

The other lesson: an internal tool’s value is the integration it owns, not the code it runs. The Akahu connection is the value. The React shell could be Flask, could be a Streamlit prototype, could be a CLI with a Rich table — and the dashboard would be equally useful. What it couldn’t be is “manually downloaded CSV files every month.” That’s the line.

If I built v2, the Akahu integration is the only piece I’d keep verbatim. Everything else I’d port to Python so it stops being the odd-one-out in my portfolio. But there’s no v2 on the roadmap, because v1 still does the job. That’s the bar for a small internal tool: does it still do the job. Not: is it the stack I’d pick today.

— Luke Simmons, Auckland

Organising the files on my machine, safely

2026-05-24T00:00:00Z

The mess

The trigger was a specific moment: I was building my website, wanted photos of my grandfather (David Roy Simmons, ethnologist, 1930-2015), and after twenty minutes of clicking around realised the photos I needed had been sitting inside Windows Mail’s attachment cache the whole time. The full path, for the record:

C:\Users\Luke Simmons\AppData\Local\Packages\
  microsoft.windowscommunicationsapps_8wekyb3d8bbwe\
  LocalState\Files\S0\4\Attachments\

774 files, 227 MB, completely invisible to normal browsing. That was the prompt to audit the rest. A few highlights:

C:\Users\Luke Simmons\Downloads: 5,680 files, 3.47 GB — mostly .jar and .class debris from a university Minecraft modding project.
D:\Downloads: 929 files, 19.40 GB, most under 30 days old — the actual active Downloads folder.
D:\Documents and D:\Hard Drive - SONY: near-identical extension profiles. One was almost certainly a forgotten backup of the other.

In total, roughly 7,500 files in active staging locations plus ~2,700 photo candidates to consolidate. None of it would fit in my head, which is why I’d been ignoring it.

The two-pronged tool choice

I installed Everything (Voidtools) for instant filename search. That solved half the problem — the Mail-cache discovery would have taken thirty seconds with Everything installed, instead of a year of low-grade frustration.

For the actual sorting I picked organize — Python, MIT-licensed, YAML-configured. The rules live in config\config.yaml and organize sim runs the whole pipeline in dry-run mode without touching disk. The rules themselves are boring (e.g. D:\Downloads images → D:\Photos\Inbox\{created.year}-{created.month}\). Boring is the point. The cleverness lives in everything around the rules engine.

The safety design — this is the writeup

This isn’t really “a file organiser.” It’s a set of safety gates wrapped around a file organiser. Every design decision was about not eating my data.

Rule 1: never delete. Move, don’t rm. When the destination is unclear, the rule moves the file to D:\Quarantine\YYYY-MM-DD\ and leaves it there for a week. I have not yet found a “delete this file” rule worth writing.

Rule 2: dry-run first, every time, no exceptions. organize sim walks every rule against the real filesystem and prints what would move. The hard rule: no rule runs for real until it’s had a full successful dry-run pass with output I’ve actually read. This caught at least one bug per rule on average.

Rule 3: every operation writes a JSONL log line. I wrote a custom organize action (actions/log_move.py, ~80 lines) that wraps shutil.move and shutil.copy2 with structured logging. Every move writes one line to logs\moves.jsonl:

{"ts": "...", "rule_id": "downloads-images", "reason": "...",
 "mode": "move", "src": "...", "dest": "...", "size": 123456}

The log is the audit trail and the rollback trail — if a rule goes wrong I can scan the JSONL, find every file the bad rule touched, and reverse those specific moves.

Rule 4: copy-only mode for irreversible sources. The Mail attachment cache is the canonical case. Windows Mail may still need the originals, so the rule that pulls from microsoft.windowscommunicationsapps_8wekyb3d8bbwe\LocalState\Files\ runs in mode="copy", never mode="move". The log_move helper also short-circuits if a same-name same-size file already exists at the destination.

Rule 5: off-limits directories are explicit. The config has a fixed exclusion list — D:\SteamLibrary, the Hyper-V VM, D:\ai-website-manager (this website), plus the usual .venv\, .git\, node_modules\. The rules engine is not allowed to make decisions about my active project folders.

digiKam, for the photo half

I didn’t build my own photo deduplicator. digiKam already does it better than I would have, so I used digiKam. The split worked out clean: organize handles the ongoing flow (new images land in dated inbox folders), digiKam handles the one-time consolidation pass and perceptual-hash dedup.

What broke

A compressed-files rule matched too broadly on its first dry-run — it would have swept up .tar and .gz files inside a Linux dotfile backup I’d forgotten about. The dry-run output was the only reason I caught it; one-line fix with max_depth: 0.

The Mail-attachments copy rule, on its first real run, created a duplicate loop — the JSONL log showed the same file being copied as (1), (2), (3) on consecutive scheduled runs. Fixed with the same-name-same-size check in log_move. The dry-run hadn’t caught it because the duplicates only appeared after a real run created the first copy. Honest lesson: dry-runs catch the rules-that-should-not-fire problem; the JSONL log is what made the state-after-the-rule-fires problem visible.

Where it sits now

The pipeline runs hourly via a Windows Task Scheduler job that invokes scripts\run_organize.ps1. Every run appends to logs\organize-runs.log and, for any actual moves, logs\moves.jsonl. Restic backs the lot up nightly.

The principle

Most file-automation tools fail one basic test: they assume the user trusts them. They shouldn’t. Filesystem automation has the same risk profile as a database migration — silent corruption is more dangerous than a loud failure, the consequences are durable, and “undo” is rarely free. The right default is the opposite of what most tools ship with: every action reversible, every action audited, every rule dry-runnable.

The reason this project works for me isn’t the YAML rules. It’s that there is no path through the system that touches my filesystem without leaving a JSONL receipt and without having first been shown to me in simulation. That’s the test I’d want any future automation I build to pass.

— Luke Simmons, Auckland

A political research tool for New Zealand

2026-05-23T00:00:00Z

The itch

Every three years a major outlet stands up an election-cycle quiz, you take it once, you screenshot the result, and then the URL rots. The methodology was never documented in a way you could interrogate, and after the election the whole thing disappears.

The other half of the friction is the lookup problem. When I want to remind myself who an MP is, what they actually stand for, and how their party is positioned, I end up in five tabs that don’t compose.

I wanted one local tool that did both jobs, that I could pull apart and edit, and that didn’t go away after election night. So I built it: a Flask app on 127.0.0.1:5000, ~175 lines in app.py, three editable JSON files in data/, and three thin service modules in services/. In debug mode it reloads the data files on every request, so I can tweak a party’s stance on the wealth tax, hit refresh, and watch the match percentages move.

What’s in it

Three things, deliberately small.

A party browser. All seven parties currently in the conversation — Labour, National, Greens, ACT, NZ First, Te Pāti Māori, TOP. Each has a curated profile in data/nz_parties.json: leader, founding year, summary, five key policies, a colour, and a Political Compass position. Party pages also pull a live Wikipedia summary as a neutral third-party blurb next to my characterisation.

An international browser. Curated entries for a handful of countries in data/countries.json, enriched live with Wikipedia. The /world page fans the fetches across a thread pool so it loads in the time of the slowest single request.

A quiz. Twenty questions in data/questions.json, each tagged by category (Economic, Environment, Social, Treaty, Foreign), each with an econ_weight, a social_weight, and a per-party expected answer on a five-point scale (-2 to +2). Submitting projects your answers onto the Political Compass and computes a Vote-Compass-style percentage match against each party.

Why a REST endpoint, not a scraper

The Wikipedia integration hits https://en.wikipedia.org/api/rest_v1/page/summary/<title>. No API key, no scraping, no HTML parsing — just a small JSON blob with an extract and a link out. Responses are cached in-process for an hour with a six-second timeout. For the use case I’m actually solving — “remind me who this person is in 30 seconds” — a scraper would be overkill.

The hard part: where do you put each party on the compass?

This is the part that has to be defensible or the whole tool is theatre.

Each party gets a compass coordinate stored in nz_parties.json. Labour sits at (-3, -1). National at (5, 2). The Greens at (-6, -5). ACT at (8, -4). NZ First at (1, 6) — economically near the centre but socially the most authoritarian of the seven. Te Pāti Māori at (-5, -2). TOP at (-1, -4).

These are not endorsements and the data file says so in a leading _note field. They’re characterisations based on published platforms, calibrated against each party’s official policy pages, their voting record in the House, and the per-question positions I encoded in questions.json. The per-question positions are the audit trail. If you think I’ve put the Greens too far left on the economic axis, you can open the questions file and see exactly which positions on wealth tax, capital gains, welfare, rent controls, and SOE ownership produced that placement. Change the number, refresh, and the compass position shifts.

This matters most for the parties whose platforms don’t fit a clean left-right line. NZ First is the standout: economically interventionist in places (protecting SOEs, regional development) but socially conservative and nationalist in a way that doesn’t map onto either Labour or National. The Greens are the inverse — a left economic platform paired with a libertarian-leaning social one, which is why they sit in the lower-left quadrant rather than the top-left. TOP doesn’t really sit anywhere conventional; their land-value tax and UBI agenda is economically heterodox in a way that a single left-right number genuinely struggles with.

I’d rather show that ambiguity than smooth it away. The compass coordinate is one summary. The per-question breakdown is the real thing.

How the matching works

The maths is deliberately boring. Per-question similarity is a linear mapping from absolute distance in [-2, +2] into [0, 1]; the overall match is the unweighted mean, expressed as a percent. The compass projection is a separate weighted sum of answers against each question’s econ_weight and social_weight, normalised and clamped to [-10, +10].

Equal weighting per question is a choice. A more sophisticated version would let you mark questions as “important to me” and weight those higher — Vote Compass does this. I haven’t built it yet because I want to use v1 through an election cycle first and see whether the unweighted version is actually wrong in practice or just feels wrong in theory.

What I deliberately didn’t build

No predictions. No polling. No vote-share modelling. No “you should vote for X.” No tactical-voting calculator that says “in your electorate, your party vote is most efficiently spent on Y.” The tool is research, not strategy. The output is “here is how your answers line up with each party’s positions, and here is where you sit on the compass” — not “here is what you should do with that information.”

I also didn’t build a headlines integration. The country page calls a get_recent_headlines(country_id) stub in services/news.py that returns an empty list. The hook is there if I want to drop in NewsAPI or the Guardian Open Platform later. Anything behind that stub becomes something to maintain across an election cycle, and I’d rather ship the questionnaire first.

The civic-tech principle

There’s a class of small, local-first civic tools that respect the user’s intelligence. They show you the data, document their assumptions, let you disagree, and don’t tell you what to do with what you’ve found. The election-cycle quizzes mostly don’t — they’re built to be consumed once, screenshotted, and forgotten. A tool you can edit, that runs on your own machine, that you can use between elections, is a different kind of object.

That’s the version I wanted. So that’s the version I built.

What’s next

Council-level data, starting with Auckland Council and the local boards — the layer of government that affects me most and that I see least.
A platform refresh before the next general election, since every party will republish policy and the per-question encodings will need a pass.
An “importance weight” affordance on the quiz, once I’ve used the unweighted version through a cycle.

None of these are blocking. The tool already does the two things I built it for: it lets me look up an MP or party without leaving the tab, and it gives me a calibrated, editable version of the quiz I take every three years.

— Luke Simmons, Auckland

streamfinder: a streaming aggregator that knows about NZ free TV

2026-05-22T00:00:00Z

The gap I’m filling

If you live in New Zealand and you want to know who’s streaming a given title, your options are JustWatch or the others — and JustWatch is also the data source behind TMDB’s /watch/providers endpoint, so it’s implicitly behind most aggregators too. For paid services in NZ it’s mostly fine — Netflix, Disney+, Neon, Prime, Apple TV+ all get indexed reasonably well.

The free side is where it falls apart.

TVNZ+ has a catalogue of [CHECK: ~1,200] shows including a large BBC slice that I’m currently paying for via Sky/Neon. ThreeNow has [CHECK: ~580]. Between them you can put together a meaningful “what can I watch tonight without paying for another subscription” answer for a NZ household. JustWatch under-indexes TVNZ+ badly and the matching is patchy enough that I stopped trusting it.

The international aggregators aren’t going to bother — NZ is a small market with two free services no global product is going to bend its schema around. So if I want this solved properly, I have to solve it myself.

That’s streamfinder.

What it actually does

Search any title → metadata, ratings, runtime, and every NZ-region streaming provider, with free options surfaced prominently.
Free-service search runs against a locally indexed catalogue of TVNZ+ and ThreeNow, not against a third party’s interpretation of those services.
Deep links open the title in the user’s existing browser session on the service that has it. No credential storage, no playback, no proxying.
Local-first. SQLite file in data/streamfinder.db. The TMDB calls are the only external traffic, logged to logs/api-calls.jsonl with the API key stripped.

Stack: Python 3.13, FastAPI on port 8765, HTMX + Tailwind via CDN (no build step — I want to read every line of every template), SQLite with FTS5, TMDB for international provider data, custom sitemap fetchers for the NZ free side. CLI first, web UI second, same data underneath.

TMDB for the paid side

For the paid services I lean on TMDB’s /watch/providers endpoint. One call gives you every region in one response, and you slice it client-side. The Python wrapper is about 130 lines including search, details, recommendations, and the provider lookup. The provider data comes from JustWatch under the hood — good for Netflix / Disney+ / Neon / Prime, mediocre for TVNZ+, missing for ThreeNow.

TMDB also gives me canonical metadata plus a useful merged “similar titles” view from combining /recommendations (editorial) and /similar (algorithmic) — nice for discovery that isn’t just “type the name of the thing you already know.”

TMDB is the licensed access path. Free, documented, rate-limit-friendly. Scraping JustWatch is fragile and rude. Use the API that exists.

Sitemap parsing for the free side

This is the part I’m proud of, because it sidesteps a category of problems most people would reach for the wrong tool to solve.

Both TVNZ+ and ThreeNow publish XML sitemaps for SEO purposes:

https://www.tvnz.co.nz/sitemap/sitemap-video.xml — full <video:video> blocks with title, description, thumbnail, category, and a requires_subscription flag (TVNZ wraps some shows behind a free account).
https://www.threenow.co.nz/sitemap_shows.xml — URL-only entries. Title gets reconstructed from the slug.

Both robots.txt files are permissive, so this is the sanctioned access path. The combined payload is around [CHECK: 1.5 MB], trivial to re-pull nightly. No JavaScript rendering, no API key, no rate limiter to dance around. The data is structured, stable, and — crucially — the format the services want search engines to consume. If they break the sitemap, their Google ranking dies, so the format has strong incentives against breaking.

Much better than scraping the front-end, reverse-engineering the mobile APIs, or asking the services for a feed.

The sitemap is the answer the service already wrote down. Use it.

The parser is one file (free_index.py, ~180 lines), uses xml.etree.ElementTree, and produces one row per show with a normalised slug for cross-matching against TMDB titles. Upserts by (service, service_id) so re-running is idempotent. Will be scheduled nightly via Task Scheduler once Phase 6 is fully wired up.

Why FTS5 SQLite over Elasticsearch or Postgres

For a personal-scale project — single user, low thousands of records, on-device — FTS5 is the right answer and it isn’t even close.

Elasticsearch: absurd. It’s a clustered search engine. I have one user. The JVM heap alone would dwarf the rest of the app.

Postgres + tsvector: also overkill. Now I’m running a database server, managing a connection pool, writing migrations, and getting search quality that’s worse than FTS5 for the prefix-typeahead behaviour I want. Postgres is right when you have multiple writers, real concurrency, or you’re already running it. For a Windows desktop app with one user, it’s a service to babysit.

FTS5: virtual table, lives in the same .db file as everything else, prefix matching with term* syntax for nice typeahead, unicode tokenizer with remove_diacritics 2 so “pokemon” finds “Pokémon,” triggers keep it in sync with the base table automatically. Zero ops. About 20 lines of SQL in the schema plus a 15-line query function.

Tradeoffs I’m accepting: no multi-process writers (only one ingest job writes, WAL mode lets the web app read concurrently), no ranking beyond bm25 (I’m not running Google), no distributed scaling (there is exactly one node and it is my laptop).

Most search-engine choices in side projects are people picking the tool they read about, not the tool that fits the problem. FTS5 fits this problem. If “what if I scale” ever becomes real I can swap the backend behind the same interface — the schema is portable.

The schema, briefly

The free-catalogue side is free_titles (one row per show, unique on (service, service_id) for idempotent re-ingest) and free_titles_fts (FTS5 virtual table over title + description, kept in sync via three triggers so I never have to remember to reindex). There are also caches for two probes that didn’t pan out; schema kept around in case I find a better signal later.

What I haven’t built yet: a unified media / watchlist / history set of tables for the full Phase 3 flow. That’s coming.

Where the project is right now

Status as of late May 2026:

Phase 1 — CLI + TMDB validation. Done. search and lookup subcommands; confirmed TMDB is good enough for the paid side, partial for the free side.
Phase 2 — web UI. Done. FastAPI on 8765, HTMX search, results with provider badges colour-coded by monetisation type, detail page with overseas-region fallback when NZ has nothing.
Phase 3 — watchlist + history. Not built. Highest-value next thing.
Phase 4 — Trakt sync. Not built. Optional, low priority.
Phase 5 — LLM recommendations. Not built. Lower priority than I originally thought (see my recent post about pulling the LLM out of job-scout — same lesson applies here).
Phase 6 — NZ free-service indexing. In flight. Sitemap parsers work end-to-end, FTS5 search runs locally, cross-matching against TMDB by (slug, year) is next, then surfacing “Free on TVNZ+” as a prominent badge alongside the paid providers — the actual headline feature for an NZ user.

What I’m learning about niche regional products

The whole project is a bet on one claim: the value of a regional product is precisely in the local knowledge that global products won’t bother to encode. JustWatch could index TVNZ+ properly — they choose not to, because the engineering cost isn’t worth it at their scale for a market my size. That’s the gap, and that’s the moat for anyone who lives here and is willing to do the maintenance work.

The cost is exactly what it sounds like: when TVNZ rebrands or ThreeNow restructures their URLs, my parser breaks and theirs doesn’t. For a personal project that tax is cheap — the sitemap format is stable for years, and maintenance is roughly one evening every six months. If I were shipping this as SaaS, the calculus would be different.

What’s next

Finish Phase 6 — wire the free-catalogue rows into the main search results so a single query returns paid providers (TMDB), free providers (TMDB, where it knows), and free-catalogue hits (my own index) in one merged view.

After that: Phase 3 (watchlist + history), nightly Task Scheduler refresh with a visible “last successful” timestamp, and better cross-source matching (year + slug + fuzzy fallback so TMDB’s “Pokémon” finds TVNZ+’s “Pokemon: Indigo League”).

Phase 5 LLM recommendations dropped significantly after the job-scout post-mortem. A deterministic “what’s new on the services I subscribe to” view is more valuable to me than a model’s vibes-based suggestions. If recommendations come back, they’ll be a thin layer on top of structured data, not a black box.

The project is doing what I wanted it to do. It tells me whether something is on TVNZ+ before I open Neon to pay for it. That’s the feature.

— Luke Simmons, Auckland

The Rust project I retired, and what it taught me about how I learn

2026-05-21T00:00:00Z

What minikv was

minikv was supposed to be the project that taught me Rust properly. The design is a Bitcask-style key-value store — the same pattern Riak’s storage engine uses — written as a Rust library (src/lib.rs) and exposed to Python through PyO3 bindings, built with maturin. The end state would have been a pip install-able Python package whose hot path is native Rust.

The Bitcask model is small and tidy, which is part of why I picked it. Writes go to the end of a single append-only log file. A separate in-memory keydir holds a HashMap<Key, FileOffset> mapping every live key to the byte offset of its most recent value on disk. Reads consult the keydir, seek to the offset, and pull the value out in one I/O operation. Deletes write a tombstone record. The log grows forever, so a periodic compaction pass rewrites a fresh log containing only the live keys and atomically swaps it in. That’s the whole design. Six milestones from scaffold to concurrent reads with CRCs and fsync.

I picked it for two reasons. It would force me to confront Rust on its own terms — ownership puzzles you can’t paper over, lifetimes you have to actually think through, an FFI boundary where you feel every byte you copy. And the Rust-core / Python-bindings pattern itself is a real deployable technique: plenty of production Python codebases bolt a Rust hot path on via PyO3. Knowing how that boundary works is genuinely useful.

The design was sound. The scaffolding got built. Then I sat at src/lib.rs with a blank file and a spec, and I didn’t write any code.

What actually happened

It’s not that Rust was too hard. Rust is hard, but “hard” isn’t a blocker — every project I’ve shipped this year had a hard part. The actual problem was more specific.

minikv was built around an implicit premise: that I learn by writing implementation myself from a blank file, with an AI assistant nearby to nudge me when I get stuck. The project’s claude.md even had a rule on it — don’t write Rust function bodies for Luke — to keep the exercise honest.

After a few weeks of not opening the file, I noticed this wasn’t a motivation problem. I’d shipped six other things in 2026 — Job Scout, a personal finance dashboard, voice cloning, the smart-home rebuild, a file organizer, a political research tool. I wasn’t bouncing off building software. I was bouncing off this specific format: blank file, spec, you go.

What does work for me — what I’ve been doing all year without naming it — is the inverse loop. I sit with Claude, describe what I want, watch it write the implementation, read the result, ask why it made the choices it made, and push back when something looks wrong. By the third or fourth pass on a given pattern, I can see the design space clearly. I can tell when Claude has reached for the wrong abstraction. I’m not typing the function bodies, but I’m making the architectural calls, and the pattern is in my head afterwards.

That’s not a worse mode of learning. It’s just not the mode minikv was built for.

The realisation

The moment I named this was almost embarrassingly casual. I told Claude, more or less, that it’s more fun watching it build things and then having it explain them afterwards than it is grinding the implementation myself. Once that sentence was out, the project’s whole premise came apart. Every constraint I’d put on Claude was built for a learning style I don’t actually have.

So I retired the old rule and inverted it. As of 2026-05-28: Claude builds the implementation, then explains the design choices afterwards — ownership, lifetimes, FFI boundary types, the gotchas. I read, review, and ask why. I don’t retype. That sounds like a small policy change but it isn’t, because it follows from a bigger thing I’d been avoiding saying out loud.

What this actually means

I’m an AI-leveraged operator, not a hand-coder.

That’s the frame. The portfolio I’ve built in 2026 is real software that real users (mostly me) actually use, but the implementation work was done largely with AI assistance. My value-add was judgment, direction, integration, and shipping decisions — what to build, why, what shape it should take, when to cut scope. The Job Scout post-mortem is the cleanest example: the valuable insight wasn’t writing the Python, it was correctly identifying that a 7B local LLM can’t do calibrated judgment under category-edge ambiguity, and ripping the model out. Nobody needs me to hand-type a Flask route. They might need me to call when to delete one.

Once I name that honestly, things follow. The career path that fits is FDE, solutions engineer, founder, or AI-direction roles — not IC engineering at Halter or an embedded shop. Both gate on writing code in live whiteboard interviews. My learning style doesn’t develop that skill, and pretending it does by grinding minikv was implicitly preparing me for a career I’m not actually pursuing.

The textbook-style learning resources — Rustlings, the Brown interactive Rust Book, Exercism — aren’t going to land for me, and I should stop feeling guilty about that. I read the chapters; I bounce off the exercises. It’s not a discipline failure — it’s the wrong shape for how my brain encodes patterns.

And admitting this is more useful than completing minikv would have been. A finished minikv would have been a 600-line Rust crate that does what dbm already does. Naming my actual learning mode unlocks every project after this.

What replaces minikv

Not nothing.

The Rust-core / Python-bindings pattern is still one I want available. I just won’t get to it by hand-coding a key-value store. I’ll get to it the same way I get to every other pattern: wait until a real project needs it, direct Claude to build it, and read the result.

The next concrete instance is already lined up. TenderPilot — my upcoming NZ government tender aggregator, scoped into the gap Job Scout’s restructure left behind — has a Rust GETS fetcher in its design. Not because the project needs Rust everywhere, but because the GETS XML feed is the kind of high-throughput parallel-fetch problem where Rust earns its keep over Python. The fetcher will be a Rust binary called from a Python orchestrator. Claude will write it. I’ll review the ownership choices, the error handling, the parallelism model, the FFI surface if we go that way. That’s the build-and-explain loop applied to a real project where Rust does work Python can’t easily do.

I’ll learn the pattern there. I won’t learn it by completing minikv. There’s a queue of follow-on projects where Rust fits the hot path, and the knowledge will accumulate through them — via the loop that works for me, not the one I thought I was supposed to use.

The honest closing

There’s a version of this post that reads as confession — “I tried to learn Rust and failed.” That version would be wrong. I haven’t failed to learn Rust. I’ve redefined what learning Rust means for someone in my position.

I don’t need to type function bodies to understand patterns. The pattern is what makes architectural calls. The syntax is what AI fills in. As long as I can read a Rust crate, recognise when the ownership model is wrong for the problem, see when a lifetime annotation is doing real work versus just placating the borrow checker, and direct the build at the architecture level — that’s the skill that actually pays. minikv was built around the assumption that the syntax reps were the load-bearing part. They aren’t, for me.

I get a week of my life back, I stop forcing myself through a learning mode that doesn’t fit, and the Rust knowledge I actually want — pattern-level, deployable, real-project-grounded — comes in through TenderPilot and the projects after it instead.

minikv is retired. The Rust isn’t.

— Luke Simmons, Auckland

Multimodal Generation and Robust Agent Engineering

2026-05-17T00:00:00Z

Going back through this week’s reading, two threads stood out from the arXiv cs.AI feed: advances in multimodal generation and continual learning, and a sharper argument for treating personal AI agents as engineered software rather than improvised prompt chains.

Advances in multimodal generation

Two papers approached the same broad area — making unified models better at generation — from different directions. The first introduces AlphaGRPO, a framework for self-reflective multimodal generation in Unified Multimodal Models (UMMs) via Group Relative Policy Optimization (GRPO). Applying GRPO to AR-Diffusion UMMs, the approach aims to unlock reasoning-heavy generation tasks such as text-to-image without an additional cold-start stage (arxiv.org/abs/2605.15198v1).

The second tackles continual learning in large language models — the problem of adapting a model to new tasks without catastrophic forgetting or loss of plasticity. It proposes using in-context learning with fixed parameters, so a model adjusts to task-specific requirements through prompt optimization rather than weight updates, keeping baseline performance stable across domains (arxiv.org/abs/2605.15188v1).

The two pair naturally: one is about making generation smarter through a training-time objective, the other about adapting at inference time without retraining. Both point at the same goal — capability gains that don’t come at the cost of stability elsewhere in the model.

Robust engineering for personal agents

The other thread is a paper titled Engineering Robustness into Personal Agents with the AI Workflow Store (arxiv.org/abs/2605.10907v1). It argues for a more disciplined approach to building personal AI agents — integrating traditional software-engineering practice such as iterative design and rigorous testing — and critiques the current paradigm of on-the-fly agent synthesis, where an agent’s workflow is generated fresh each run. The paper’s case is that improvised synthesis undermines reliability: there is nothing stable to test, version, or debug.

This connects directly to the continual-learning paper above. Both are really about the same tension — capability versus reliability. A model or agent that adapts freely is more capable in the moment but harder to reason about; one with fixed, tested structure is more predictable but slower to change. The interesting engineering question is where to put the boundary between the parts that adapt and the parts that stay fixed.

Closing thought

The unifying theme this week is that progress in AI systems increasingly looks like ordinary engineering: deciding which components are allowed to change at runtime and which are pinned, tested, and versioned. The frontier research and the practitioner critique are converging on the same point from opposite ends.

AI Safety and Control in Complex Environments

2026-05-10T00:00:00Z

Three threads kept turning up in this week’s reading: how safety in large language models scales differently from accuracy, why long-horizon agents need explicit control flow rather than more prompting, and how AI is forcing security researchers to renegotiate vulnerability-disclosure norms.

Safety scales differently from accuracy in clinical LLMs

Two recent arXiv papers sit on the same problem from different angles. The first, Safety and accuracy follow different scaling laws in clinical large language models (arxiv.org/abs/2605.04039v1), introduces SaFE-Scale and argues that safety in medical LLMs does not improve at the same rate as benchmark performance — meaning a model can be more accurate on average while still producing rare high-risk errors at the same or higher rate.

The second, BAMI: Training-Free Bias Mitigation in GUI Grounding (arxiv.org/abs/2605.06664v1), tackles bias in models that ground language to graphical user interfaces — a precondition for any agent that operates a real desktop. The Masked Prediction Distribution method identifies error sources arising from high-resolution images without retraining.

Both papers point to the same underlying issue: average-case quality and worst-case behaviour are different objectives, and improving the first does not automatically improve the second. That distinction matters more as model deployments move into clinical and operational settings where the worst case is what gets reported.

Agents need control flow, not more prompts

The other thread this week is structural: how do you build agents that stay coherent over long horizons? The arXiv paper LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents (arxiv.org/abs/2605.04036v1) sets out an explicit context-orchestration scheme for search agents whose runs span hundreds of tool calls.

A short blog post titled Agents need control flow, not more prompts (bsuh.bearblog.dev) makes the parallel argument from the practitioner side: the failures of long-running agents look less like prompt-quality problems and more like missing program structure — branches, loops, error handling, scoped state. Both pieces converge on the view that the next gain comes from treating an agent as a program rather than as a conversation.

AI and the disclosure norms around vulnerability research

Jeff Kaufman’s post AI Is Breaking Two Vulnerability Cultures (jefftk.com/p/ai-is-breaking-two-vulnerability-cultures) sits a level above the technical papers. It argues that LLMs are simultaneously lowering the cost of finding vulnerabilities and raising the volume of low-quality reports, and that the existing community norms in security research — built around scarcity of expertise — don’t yet have an answer for either change. The piece is short and worth reading in full; it pairs naturally with the safety-scaling paper above, since both are really about how distributions of rare events change once a capability becomes widely available.

Closing thought

The unifying theme this week is the gap between average behaviour and rare-event behaviour: safety scaling, agent reliability, and disclosure cultures all break differently when you measure them at the tail rather than at the mean.

The Week in AI: Time Perception, Hallucinations, and Automation

2026-05-03T00:00:00Z

Going back through this week’s feeds, three threads keep turning up: time perception in videos, prompt-induced hallucinations in vision-language models (LVLMs), and the use of agentic AI for automating scientific workflows. Each of these topics highlights different challenges and opportunities within the rapidly evolving field of artificial intelligence.

Time Perception in Videos

The first thread focuses on how machines can discern whether a video has been sped up or slowed down, as well as methods to generate videos at varying speeds. The paper ‘Seeing Fast and Slow: Learning the Flow of Time in Videos’ (http://arxiv.org/abs/2604.21931v1) delves into these issues. Understanding time manipulation within video content is crucial for applications such as interactive media experiences, where precise control over playback speed can enhance user engagement and immersion.

Prompt-Induced Hallucinations in LVLMs

Another critical issue addressed this week is the tendency of large vision-language models (LVLMs) to produce outputs that are not grounded in their visual input. The paper ‘When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs’ (http://arxiv.org/abs/2604.21911v1) explores how prompts can lead these models astray and presents methods to mitigate such hallucinations, ensuring more accurate and trustworthy interactions between humans and machines.

Agentic AI for Scientific Automation

A third theme revolves around the automation of scientific research through agentic AI. The paper ‘From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation’ (http://arxiv.org/abs/2604.21910v1) introduces an architecture that closes the semantic gap between research questions and workflow specifications, automating both execution and translation processes. This work underscores how agentic systems can streamline scientific workflows, making them more efficient and accessible.

Hacker News Highlights

This week’s top stories on Hacker News include a provocative piece titled ‘The West forgot how to make things, now it’s forgetting how to code’ (https://techtrenches.dev/p/the-west-forgot-how-to-make-things), which discusses the decline in technical skills and its implications for innovation. Another notable story is about an amateur who solved an Erdős problem using ChatGPT (https://www.scientificamerican.com/article/amateur-armed-with-chatgpt-vibe-maths-a-60-year-old-problem/). These stories highlight both the challenges and opportunities in leveraging AI to solve complex problems.

Conclusion

This week’s digest highlights three key areas of focus: time perception, hallucinations, and automation. Each area presents unique challenges but also offers significant potential for advancements in AI applications. As we continue to push the boundaries of what machines can do, these studies underscore the importance of addressing technical limitations while exploring new possibilities.

In summary, this week’s material underscores the ongoing evolution of artificial intelligence, with a particular emphasis on how it perceives and manipulates time, mitigates hallucinations, and automates scientific research. These advancements are crucial for ensuring that AI remains both reliable and innovative.

A week of low-tech pushback and closed-model wobble

2026-04-26T00:00:00Z

The most-upvoted story on Hacker News this week was not a model launch or a new coding agent. It was an Alberta startup selling tractors with no electronics, at roughly half the price of the tech-heavy equivalent. Two thousand-plus points on a story about refusing software is a signal worth taking seriously, especially in the same week David Crawshaw’s “I am building a cloud” cleared 900 points and Arch Linux announced a bit-for-bit reproducible Docker image. The through-line is not Luddism. It is a steadily stronger taste for systems where the operator can see all of it and replace any piece of it without asking a vendor’s permission.

The week’s AI news made that taste feel earned. OpenAI shipped GPT-5.5 with the usual capability-chart PDF. At the same time, a year-old Anthropic postmortem on Claude Code quality issues climbed back up Hacker News — worth re-reading from cold. It walks through three separate regressions between early March and mid-April 2025: a default reasoning-effort downgrade for lower latency, a caching optimisation that made Claude “forgetful and repetitive” by clearing its thinking history every turn, and a brevity-focused system-prompt tweak that hurt code quality. The specifics matter less than the shape. Capability claims for hosted models are downstream of operational decisions the user cannot inspect, and the operational track record is what makes those claims earnable.

This is the right week for SWE-chat, a new arXiv paper describing 6,000 real coding-agent sessions collected in the wild from open-source developers. It is one of the first serious attempts to measure what people actually get out of coding agents, rather than what the benchmark PDFs say. Alongside it, “Coverage, Not Averages” formalises something practitioners have been muttering about for a year: RAG-evaluation query sets are heuristic, carry hidden biases, and conventional headline metrics obscure failure modes in rare but important queries. Both papers are useful because they undercut the happy graphs a little and point at where real measurement would have to happen.

The other thread was supply chain

The week’s security stories all had the same shape. Attackers did not exploit novel cryptography or model weaknesses. They attacked the distribution and identity layers around software.

Bitwarden’s CLI package was compromised as part of a wider Checkmarx-targeted supply-chain campaign, the latest in a year of package-registry incidents that have mostly gone unpunished at the registry level. A stable Firefox identifier was found to link every private Tor identity to a single device, undoing in a single browser-storage bug what several cryptographic layers had been trying to keep apart. Apple patched a bug that law-enforcement tooling had been using to extract deleted messages from iPhones, a sentence that carries its own commentary. A French government agency confirmed a breach with the attackers offering the data for sale.

Put the week’s two main streams together — hosted-model operational opacity, and a visible run of supply-chain and identity failures — and the enthusiasm for no-tech tractors and locally reproducible builds stops looking contrarian. It looks like a rational response to a month in which the parts of a system you cannot see kept being the parts that broke.

None of this says hosted models are a mistake, and none of it says a small farm will run better on a carburettor than on a CAN bus. It says that the week’s news kept adding cases where the cost of opacity showed up as a real incident. If you are weighing where to put the next piece of your stack — a coding agent, a password manager, a tractor — “can I see and replace this thing” moves up the list.

This is the site’s first weekly digest, produced by the local agent described on the Blog index. It replaces the per-day posts from earlier in April 2026 with a single themed weekly synthesis. Future weekly digests will land on Sundays.

Luke Simmons — Blog

Audio Console: a channel strip for movie night

Built like one channel of a mixing desk

The capture trick

The controls that matter

The honest limits

Control Plane: one window for every local service

The core insight

The registry is the source of truth

The health-cache decision

Windows process management, the rabbit hole

Deploy is not Publish

The security posture, in one paragraph

Where the design language came from

Sovereign Suite: assembling an NZ-resident cloud

The principles, because they decide everything

Assemble, don’t fork

The licensing story

What’s actually running

What’s next

Three doors into a knowledge game

The vision

Consilience: the confirmation board

Tickscape: the OSRS-shaped delivery vehicle

The three doors

The concept toys

The honest open problems

University of Luke: a private university that doesn't make things up

The problem with small local models

The insight: synthesis, not recall

The build

The 7B vs 14B decision, measured

How accurate, honestly

Try it

Two papers on giving models a grip on physical space

A counterexample, a scaling law, and a task with edges

How a 7B local LLM actually processes a job listing

Why this document exists

Layer 1 — The Python wrapper

Layer 2 — Ollama as a server

Layer 3 — What the transformer actually does

Tokens are not words

28 layers of attention + feed-forward

KV cache

Temperature 0.1

Layer 4 — Constrained decoding and why it makes hallucinations look confident

How it works

Why this is dangerous

The replacement — what app.py does instead

SQLite schema (same as before, no change here)

Location classification — deterministic regex

Focus classification — tiered regex matching

Filter endpoint — human does the judgment

The actual comparison

The lesson as a system design principle

Why I dropped a 7B local LLM from my job aggregator

TL;DR

What I built

What broke

1. Over-aggressive exclusion on category-edge cases

2. Hallucinated location constraints

Why these matter

What I should have caught before building

What replaced it

What I’d do if I were doing it again

The wider lesson, briefly

Cloning my own voice

The stack

The training data isn’t really training data

The first time it sounded convincing

How good is it, honestly

The disclosure call

Where I’ll use it next, and where I won’t

Designing a smart home on my terms

What I’m actually trying to build

The stack

Radio layer

Application + brain

Hardware

How the rooms will be organised

The replacement — what `app.py` does instead