how the guides are built
Methods
Most AI content sites won't tell you how their pages were made. We will. The recreation guides on Apprentice are produced by a pipeline that combines semantic retrieval over classical art-instruction texts, a small reranker, and a larger generator constrained by an anti-hallucination prompt. The pipeline is deliberately conservative. It would rather omit a detail than invent one.
The corpus
Three layers of source material:
- 01Ten classical art-instruction books spanning 1390–1920, listed in full on the sources page. Cennini through Bridgman. The technical lineage of Western painting from the Renaissance bench to the early-twentieth-century atelier.
- 02~2,200 artist Wikipedia biographies and ~200 Wikipedia pages for individually famous artworks. Wikipedia is the source for named facts — when a particular painter trained, what a specific work depicts.
- 03Wikipedia pages for movements, techniques, and pigments. Background for the broader context a recreation has to fit into.
Everything was chunked, embedded with nomic-embed-text (768 dimensions, L2-normalized), and indexed for nearest-neighbor lookup. The resulting index contains roughly 14,000 passages.
The pipeline
For each artwork, the pipeline runs four queries — one per aspect: composition, color, drawing, materials. For each query:
- iRetrieve. Top-K nearest passages from the index by cosine similarity.
- iiRerank. A small model (
qwen2.5:7b-instruct) scores each passage for whether it would help a painter recreate this specific work. Embedding similarity finds passages that mention a technique; the rerank step weeds out passages that share keywords but don't actually carry information you can paint with. - iiiCompose. The retained passages are concatenated as a labeled
SOURCE PASSAGESblock, then passed to the generator alongside the artwork metadata. - ivGenerate.
qwen3.6:27bproduces structured JSON — overview, materials, palette, steps grouped by phase, critical techniques, common pitfalls, known gaps, source citations. Every named technique cites the source that supports it.
Anti-hallucination discipline
The generator's prompt is explicit about what's allowed:
- ·Specific visual details (objects on the wall, clothing patterns, facial expressions) may only be mentioned if a source passage explicitly describes them. No filling in from training-data stereotypes of the artist's other paintings.
- ·General artist practice (a painter's documented palette, signature methods) may be stated without an exact source quote, but must be phrased as such — "Vermeer characteristically…" not "in this painting…".
- ·Uncertainty has three paths, in preference order: (a) hedge with "likely" or "characteristically", (b) emit
nullfor the field, (c) list the gap explicitly in theknownGapsarray. Confident specifics that paper over uncertainty are not allowed.
The "what the sources don't tell us" block you'll see on every artwork page is the surfacing of that knownGaps array. It is on purpose. We would rather show you what the corpus doesn't cover than pretend it does.
What we know about how well it works
The first production run covered 2,500 priority artworks chosen by fame, style, and beginner-friendliness. Of those:
| Metric | Value |
|---|---|
| Fully enriched | 2,491 / 2,500 (99.64%) |
| JSON parse failures | 9 (re-runnable) |
| Steps with a source citation | 13,401 / 14,044 (95.4%) |
Records with non-empty knownGaps | 100% (all ≥3 entries) |
| Unique cited sources across corpus | 728 |
| Hallucinated source labels | 0 detected |
What this doesn't tell you: how factually correct each individual guide is on close reading by a domain expert. We're working on that. When users flag issues, we'll log them on a public errata page and update the affected records.
What this doesn't replace
A grounded LLM guide is a starting scaffold, not a substitute for studying with a working painter. The recreations are best understood as a way to focus attention on what to look for — the brushwork phase, the palette choices, the historical method — and as a way to cross-reference what the classical books actually say. The corpus itself is the source of truth. Read it.