Part 3: Process Appendix

How I collaborated with AI agents throughout this project — the working method, the prompting techniques, and where domain knowledge shaped the output.

This project was built with Claude as the primary collaborator, with NotebookLM and ChatGPT used for specific verification and cross-checking tasks. What follows is a short account of the working method I found productive, the techniques that made the collaboration useful rather than sloppy, and where my domain training cut across what the AI proposed.

What came from whom

AI generated: initial literature search and parameter sourcing, code implementation of the simulator, first drafts of prose sections.
I contributed: analytical framing, model architecture decisions (what to include and what to cut), all final interpretations and conclusions.
I caught and corrected the AI on: dropping a decorative two-cohort split that added no analytical value; identifying that a Korean progression rate reflected screening intensity, not slower biology; discovering that the code had drifted from the outcome tree diagram during vibe-coded v1 (details in Part 2, "Three Inspection Points").

Collaboration stance

I led the analytical framing. AI was research, drafting, and coding support, not a source of conclusions. When I wanted an opinion I asked for one; the default was for Claude to improve on my thinking rather than generate its own.
Step-by-step pacing. I did not information-dump. I kept each session focused on one or two decisions, confirmed understanding as I went, and carried a written project log across sessions so that continuity did not depend on the AI's memory.

Prompting techniques that worked

Structured output with provenance tracking

I asked Claude to return research outputs with explicit confidence tags — sourced, derived, judgement, hypothetical — next to every number, along with a named reference. This turned the research document into something I could audit at a glance rather than a wall of prose I had to trust.

Multi-agent verification

For citations that mattered, I pulled the original papers into NotebookLM and asked it to verify specific claims Claude had made. Running a second model against the same source material catches hallucinations that self-consistency checks within a single system would miss.

Intermediate artefacts before execution

Before asking Claude to build the interactive demo, I had it generate the outcome tree as a static diagram. Inspecting the structure as a standalone artefact caught design problems that would have been buried inside running code.

Persona-driven prompting

For open-ended discussion I assigned Claude specific system-level roles — start-up owner, VC investor, sceptical reviewer — which constrained the feedback distribution and kept me from drifting into vague affirmation.

Where domain judgement shaped the output

AI agents have reasonable defaults — but defaults are generic. Three categories of intervention mattered most:

Scope control. Claude's default was to add structure — more branches, finer stratification, extra parameters. These suggestions were individually defensible, but several did not change the output. The useful question was "does this earn its complexity?" Knowing when a model is complete enough is a judgement call that requires understanding what the model is for, not just what it could include.
Contextual interpretation of data. Numbers from the literature are not raw facts; they carry the fingerprints of how they were collected. Observational cohort rates may already include opportunistic eradication effects. Country-level progression rates may reflect screening intensity rather than biology. Claude treated these as face-value inputs by default — a reasonable behaviour for a system without epidemiological training. Flagging when a number means something different from what it says is where domain knowledge is irreplaceable.
Editorial curation. Claude's default when building the demo was to add flexibility — more sliders, more options, more configurability. In a general-purpose tool that instinct is correct. In a demo designed to make one argument legible, extra controls dilute the signal. A cleaner surface does more work than a richer one.