RAG

Retrieval-augmented generation where the retrieved documents form a stable prefix pinned to a session, so repeated queries reuse the compiled document context.

Retrieval-augmented generation on Zumik where the retrieved documents form a stable prefix pinned to a session. Repeated queries over the same corpus reuse the compiled document context instead of resending it on every call - the reuse Zumik measures and reports.

Source: examples/rag. Retrieval here is a toy keyword scorer so the example is self-contained; only the top-k document ids matter to the rest of the flow, so you swap in your vector store and feed its results in.

Run

export ZUMIK_API_KEY="zk_..."
pip install -r requirements.txt   # httpx==0.28.1
python rag.py

Walkthrough

Retrieve the relevant documents

Rank the corpus and take the top-k document ids. The example uses a tiny query-term overlap scorer; a real pipeline returns its vector store's top-k.

def retrieve(question: str, k: int = 2) -> list[str]:
    terms = {t.strip(".,?").lower() for t in question.split()}
    scored = sorted(CORPUS.items(),
        key=lambda kv: sum(term in kv[1].lower() for term in terms),
        reverse=True)
    return [doc_id for doc_id, _ in scored[:k]]

Upload each document as an immutable artifact

Each retrieved document becomes a document artifact, with the source id kept in metadata for traceability.

art = client.post("/v2/artifacts", headers=h, json={
    "artifact_type": "document",
    "content": CORPUS[doc_id],
    "metadata": {"source_id": doc_id},
})
items.append({"artifact_id": art.json()["id"], "role": "context"})

Bundle the documents and open a session

The retrieved set becomes an ordered agent_prefix bundle; the session pins it as its base context.

bundle = client.post("/v2/bundles", headers=h, json={
    "bundle_type": "agent_prefix", "items": items,
})
session = client.post("/v2/sessions", headers=h, json={
    "base_bundle_ids": [bundle.json()["id"]],
})
sid, bid = session.json()["id"], session.json()["default_branch_id"]

Run queries against the session

Each query goes through the native /v2/responses surface with a retrieval template and an interactive QoS request. The pinned document prefix is reused across queries; the response reports the execution profile.

resp = client.post("/v2/responses", headers=h, json={
    "model": "code.fast",
    "input": RETRIEVAL_TEMPLATE.format(question=q),
    "session_id": sid,
    "branch_id": bid,
    "qos": {"class": "interactive", "degrade_policy": "allow_compatible_fallback"},
})
body = resp.json()
print(body["output_text"], body.get("execution_profile"))

Reusing the document prefix

The documents are uploaded once and pinned as the session's base bundle. Because each query references the session (session_id + branch_id), the compiled document context is reused rather than recompiled - and the actual reuse is reported on the response, never assumed.

The example retrieves once for the lead question and reuses that document set for the follow-ups. For a production retriever, re-run retrieval per query and rebuild the bundle only when the retrieved set changes; an unchanged set keeps the same stable prefix, so the reuse holds.

Tip

The retrieval template tells the model to answer using only the session's documents and to say so when the answer is not there. Keeping that instruction stable, with the question last, preserves the cacheable prefix - the same ordering the prompt linter checks.

LlamaIndex

The same documents-over-sessions pattern from a LlamaIndex pipeline.

Reuse metrics

How realized reuse is reported versus the opportunity available.

Run

Walkthrough

Reusing the document prefix

LlamaIndex

Reuse metrics

On this page