Author’s note: this post was drafted by Claude (Anthropic) from my project notes and source code, then reviewed and edited by me before publishing. The voice and judgments are mine; the typing isn’t.
University of Luke is a personal academic hub: a knowledge base spanning 132 fields across 11 faculties, with a tutor you can ask questions, course and curriculum generators, flashcards with spaced repetition, and a cross-domain concept map — all running offline on the RTX 3070 in my desktop. A static export of it is browsable at /labs/university/, and the read-along book it feeds — a CS textbook with a 3D avatar and a narrated audiobook excerpt — is at /labs/read-along/.
The interesting part isn’t the feature list. It’s that the answers are accurate, and why they’re accurate.
The problem with small local models
Local LLMs in the 7B–14B range are attractive for all the obvious reasons: private, free to run, no internet required. They’re also, out of the box, unreliable for knowledge work. Ask a 7B to write about a topic from memory and it invents plausible-but-wrong specifics — dates, names, citations. The standard fixes both defeat the point: a bigger model doesn’t fit (a 14B needs ~10 GB and my GPU has 8, so it spills 40–60% onto the CPU and takes minutes per task), and calling a frontier API gives up the privacy, the offline operation, and the zero marginal cost.
An education tool that invents facts is worse than useless. So the constraint set was: one 8 GB consumer GPU, fully offline in the hot path, and accuracy that isn’t negotiable.
The insight: synthesis, not recall
The observation the whole system rests on: local models are good at synthesis and bad at generation from thin air. Asked to recall, a 7B confabulates. Given verified source material and asked to reshape it, the same 7B is fast and accurate. So the architecture splits the two jobs:
- Authoring (one-time, high-effort, offline from the user’s perspective): a frontier model with web verification writes a deep, fact-checked knowledge base — the textbook.
- Synthesis (local, on-demand, cheap): the small model only ever reshapes retrieved, verified passages into answers, courses, and quizzes. It never has to remember anything.
The one-line version: Claude writes the textbook; the local model teaches from it.
The build
The authoring pipeline ran as an agentic fan-out — on the order of 120 parallel agents, each web-verifying facts and citations for one field, each writing one structured module (a summary, nine sections, six key works). The output is ~289,000 words with 787 unique cited references, validated programmatically: schema checks, coverage checks against the catalogue, 132/132 modules present, zero gaps or orphans.
Serving is a straightforward retrieval loop: section-level embeddings (~1,200 passages, nomic-embed-text, cached on disk), cosine retrieval across all domains at once — which is what makes cross-disciplinary questions work — and then the local model synthesises a grounded answer with inline citations. The grounding policy is KB-first: answer from the verified corpus when it covers the query, and only fall back to live open APIs (Wikipedia, OpenAlex, Crossref and friends — 18 keyless academic APIs in total) when it doesn’t.
One corpus, eight products on top of it: the tutor, course and curriculum generators, custom interdisciplinary “majors”, flashcards, quizzes, the concept map, and a bibliography. They all reuse the same retrieval and synthesis engine.
The 7B vs 14B decision, measured
The trade-off I care most about defending: the default model is the 7B, and that was decided by measurement, not vibes. Once grounded, the 7B produced specific, accurate output — an ML course with the correct 1956 Dartmouth date, transformers at 2017, real key works — in about 42 seconds. The 14B took 8–12 minutes for the same task on my hardware (that CPU spillover) and gave smoother prose but no better facts.
That’s the finding in one sentence: the architecture bought the accuracy, so the bigger model had nothing left to add except latency. Roughly 12× faster at near-equal quality.
How accurate, honestly
The claim I’ll stand behind is “zero observed hallucination on spot-checked specifics” — dates, names, attributions checked by hand against the corpus and the world. That is not the same as a formal evaluation harness with faithfulness scoring, which the project doesn’t have. The honest statement is: grounding plus verified authoring removed every fabrication I went looking for, and I went looking in the places small models usually fail. If I were productionising this for anyone but me, an eval gate (faithfulness and citation-accuracy scoring in CI) is the first thing I’d add, along with a real vector store in place of the file caches.
There’s also a quieter robustness layer that mattered in practice: long local generations survive disconnects, build queues persist and resume after a crash, and the system degrades gracefully when a source API is down. A local box is a messy place to run a pipeline; the code assumes that.
Try it
/labs/university/ is the static export — the faculties, fields, essays, and concept map, browsable as-is. The live tutor and generators need the local model running, so they stay on my machine, but the corpus they teach from is the thing you’re reading.
/labs/read-along/ is the same knowledge base in a different mode: a computer-science book authored from it, read along by a fully-offline 3D avatar, with a ten-minute excerpt of the audiobook narrated by the clone of my voice. The full audiobook is 80 chapters and 1.6 GB, built locally; one excerpt ships because of file-size limits, and honesty about that beats pretending otherwise.
Stack, for the record: Python, Flask (stdlib-lean), Ollama running qwen2.5 7B/14B and nomic-embed-text, on-disk embedding and content caches, vanilla JS frontend, one RTX 3070 on Windows.
— Luke Simmons, Auckland