how the guides are built

Methods

Most AI content sites won't tell you how their pages were made. We will. The recreation guides on Apprentice are produced by a pipeline that combines semantic retrieval over classical art-instruction texts, a small reranker, and a larger generator constrained by an anti-hallucination prompt. The pipeline is deliberately conservative. It would rather omit a detail than invent one.

The corpus

Three layers of source material:

01Ten classical art-instruction books spanning 1390–1920, listed in full on the sources page. Cennini through Bridgman. The technical lineage of Western painting from the Renaissance bench to the early-twentieth-century atelier.
02~2,200 artist Wikipedia biographies and ~200 Wikipedia pages for individually famous artworks. Wikipedia is the source for named facts — when a particular painter trained, what a specific work depicts.
03Wikipedia pages for movements, techniques, and pigments. Background for the broader context a recreation has to fit into.

Everything was chunked, embedded with nomic-embed-text (768 dimensions, L2-normalized), and indexed for nearest-neighbor lookup. The resulting index contains roughly 14,000 passages.

The pipeline

For each artwork, the pipeline runs four queries — one per aspect: composition, color, drawing, materials. For each query:

iRetrieve. Top-K nearest passages from the index by cosine similarity.
iiRerank. A small model (qwen2.5:7b-instruct) scores each passage for whether it would help a painter recreate this specific work. Embedding similarity finds passages that mention a technique; the rerank step weeds out passages that share keywords but don't actually carry information you can paint with.
iiiCompose. The retained passages are concatenated as a labeled SOURCE PASSAGES block, then passed to the generator alongside the artwork metadata.
ivGenerate. qwen3.6:27b produces structured JSON — overview, materials, palette, steps grouped by phase, critical techniques, common pitfalls, known gaps, source citations. Every named technique cites the source that supports it.

Anti-hallucination discipline

The generator's prompt is explicit about what's allowed:

·Specific visual details (objects on the wall, clothing patterns, facial expressions) may only be mentioned if a source passage explicitly describes them. No filling in from training-data stereotypes of the artist's other paintings.
·General artist practice (a painter's documented palette, signature methods) may be stated without an exact source quote, but must be phrased as such — "Vermeer characteristically…" not "in this painting…".
·Uncertainty has three paths, in preference order: (a) hedge with "likely" or "characteristically", (b) emit null for the field, (c) list the gap explicitly in the knownGaps array. Confident specifics that paper over uncertainty are not allowed.

The "what the sources don't tell us" block you'll see on every artwork page is the surfacing of that knownGaps array. It is on purpose. We would rather show you what the corpus doesn't cover than pretend it does.

What we know about how well it works

The first production run covered 2,500 priority artworks chosen by fame, style, and beginner-friendliness. Of those:

Metric	Value
Fully enriched	2,491 / 2,500 (99.64%)
JSON parse failures	9 (re-runnable)
Steps with a source citation	13,401 / 14,044 (95.4%)
Records with non-empty `knownGaps`	100% (all ≥3 entries)
Unique cited sources across corpus	728
Hallucinated source labels	0 detected

What this doesn't tell you: how factually correct each individual guide is on close reading by a domain expert. We're working on that. When users flag issues, we'll log them on a public errata page and update the affected records.

What this doesn't replace

A grounded LLM guide is a starting scaffold, not a substitute for studying with a working painter. The recreations are best understood as a way to focus attention on what to look for — the brushwork phase, the palette choices, the historical method — and as a way to cross-reference what the classical books actually say. The corpus itself is the source of truth. Read it.