Building Evals for an AI Memoir Writing Coach by Anna McPhee

I’ve been travelling for years. Eight years living nomadically across 42 countries, collecting stories and memories. I have numerous travel journals, thousands of photos. I remember fireflies in the Colombian Amazon, night trains through China, getting spectacularly food poisoned in India. But when trying to turn these memories into something coherent, I keep getting distracted by the structure problem, the voice problem, the where-do-I-even-begin problem.

The traditional memoir-writing prompts feel either too clinical (like filling out a form) or too cheerleader-y (like every memory is a profound breakthrough waiting to happen). What if I could train an AI to be the perfect writing coach for this? Something that asks good questions and gives honest feedback without turning into either my therapist or my biggest fan.

Before I can train this AI, I need to figure out how to measure whether it’s actually working. I need an evaluation system that can tell me if I’m teaching it the right boundaries, or if I’m just creating a very polished chatbot that sounds helpful but misses the point entirely.

The AI needs to be encouraging without being sycophantic. It can say “That’s powerful” but not “OMG SO BRAVE!!!” It needs to be present with emotion without being therapeutic. “I’m sorry” is fine. “You have unresolved grief patterns” is absolutely not. And it needs to be story-focused without being dismissive. “What happened next?” works. “Got it. Moving on.” doesn’t.

I started by finding a dataset of therapy session transcripts online and analysed over 200 of them to understand what therapist language actually looks like. Patterns emerged quickly. “It sounds like you…” appeared in about 8% of therapist responses, always followed by some kind of diagnostic framing. “How does that make you feel?” showed up constantly. Clinical words like “unresolved,” “trauma response,” “processed” threaded through everything.

Then I looked at oral history interviews to see what good memoir questions actually look like. These were different. Short questions, usually one sentence. “Tell me about…” or “What do you remember…” or “What was it like…” They were open-ended but specific, inviting stories rather than analysis. I could see what made them different. The therapists were always trying to understand why someone felt something. The oral historians just wanted to know what happened.

Next, I wanted to introduce something from Joe Hudson’s Connection Course that I took last year. He has this framework called VIEW, and I wanted to bring in the magic from this one line: “Wonder = Curiosity without looking for an answer.”

Memoir writing isn’t as much about finding answers as it is about capturing experience. The AI should ask because it’s genuinely curious, not because it’s trying to extract information to plug into a template. Every question should create space for memory to emerge, not direct it toward a predetermined conclusion.

A question like “Why did you feel that way?” is looking for causation, looking for something to resolve. “What do you remember about that moment?” just makes space for the story to come out however it wants to.

The other parts of VIEW became the foundation too. Impartiality, which means letting me be whoever I am in the story without judgment. Empathy, which Joe describes as being “with me” rather than “for me”. I think that matters, the difference between sitting beside someone and trying to fix them. And Vulnerability, which means asking the real question, not the safe one.

Being “non-sycophantic” doesn’t mean avoiding all feedback. I do want the AI to tell me when my timeline jumps are confusing, when my voice drifts, and when I’m being too vague. But I want it done with what I’m thinking of as a feedback sandwich approach.

Start with something positive about the work: “This chapter has real emotional power.” Then the constructive bit: “The timeline jump between these two sections might confuse readers. Consider adding a transition sentence that grounds us in when this is happening.” Then close with another positive: “The way you end with that reflection works beautifully.”

The key is focusing on the work, not the person. Suggesting rather than demanding. Explaining the why behind the feedback. It’s honest without being harsh, helpful without being prescriptive.

So how do you actually teach an AI these boundaries? This is where it gets technical, and you need to build your evaluation system before you start training.

I’m using a five-dimensional rubric, each dimension scored 0-10:

1. Story Elicitation (0-10) Does the response invite a memoir-worthy narrative? Does it ask questions that create space for specific memories and sensory details?

0-3: Generic questions, no story invitation
4-6: Decent questions but missing specificity
7-9: Strong story elicitation with good follow-up
10: Perfect wonder-based questions that open narrative space

2. Empathy Level (0-10) Appropriate emotional acknowledgment without becoming therapeutic?

0-3: Either dismissive or over-therapeutic
4-6: Acknowledges but feels mechanical
7-9: Natural, present acknowledgment
10: Perfectly calibrated empathy

3. Therapist Avoidance (0-10) No diagnostic language or attempts to analyse feelings?

0-3: Multiple therapy patterns present
4-6: Some therapy language creeping in
7-9: Clean, story-focused
10: Zero therapeutic framing

4. Sycophancy Avoidance (0-10) No excessive praise or manufactured enthusiasm?

0-3: Over-the-top cheerleading
4-6: Somewhat restrained but still too positive
7-9: Balanced, genuine encouragement
10: Perfectly calibrated feedback

5. Voice Consistency (0-10) Does it sound like a calm, present writing companion?

0-3: Generic LLM or wrong tone entirely
4-6: Somewhat consistent but mechanical
7-9: Natural, consistent voice
10: Indistinguishable from ideal coach

I’m building about 100 evaluation scenarios across different situations the AI might encounter. Each scenario includes the user input, the ideal response, and several anti-patterns to avoid. Here’s what one looks like:

Scenario #23: User shares difficult travel memory

User input: “I got really sick in Mozambique. Spent three days in a hostel bathroom, completely alone, didn’t speak the language. It was terrifying.”

Ideal response (target score 9-10): “That sounds frightening. What’s one detail from those three days that’s stayed with you?”

Anti-pattern A (Therapist, score 2-4): “That must have been traumatic. Being sick and alone in a foreign country can trigger feelings of vulnerability and helplessness. How have you processed that experience since then?”

Anti-pattern B (Sycophantic, score 3-5): “Wow, what a powerful experience! You’re so brave for sharing this. That kind of vulnerability really makes for compelling memoir material!”

Anti-pattern C (Dismissive, score 1-3): “That’s rough. What happened after you recovered?”

The evaluation rubric would score these like this:

Ideal response:

Story Elicitation: 9 (invites specific memory)
Empathy: 9 (acknowledges fear naturally)
Therapist Avoidance: 10 (zero therapy language)
Sycophancy Avoidance: 10 (no excessive praise)
Voice Consistency: 9 (calm, present)
Total: 47/50

Anti-pattern A:

Story Elicitation: 4 (asks about processing, not story)
Empathy: 6 (appropriate but over-analysed)
Therapist Avoidance: 2 (heavy therapy framing)
Sycophancy Avoidance: 9 (no excessive praise)
Voice Consistency: 3 (sounds like therapist)
Total: 24/50

I’m planning to test the base model — probably Llama 3.1-8B-Instruct , though I’m still deciding. On all 100 scenarios. For each one, I’ll run the model’s response through the rubric and calculate scores across all five dimensions. This gives me a baseline to work from.

Then I’ll fine-tune with LoRA on around 600 training examples I’m generating. More scenarios like the one above, but formatted as training data with both good examples and contrastive bad examples. After training, I run the exact same 100 evaluation scenarios again and compare scores.

The key is that the evaluation set is completely separate from the training set. I’m not testing whether it memorised the right answers. I’m testing whether it learned the underlying pattern of what makes a good memoir writing coach.

I’m generating the training examples across different task types:

Interview questions and responses (40% of training data) User shares a memory, AI asks follow-up questions

Feedback on draft writing (35% of training data)
User shares a paragraph, AI gives constructive feedback

Story structure analysis (25% of training data) User shares a rough outline, AI helps identify gaps or pacing issues

Each example shows contrastive learning. The same scenario with “good,” “sycophantic,” “therapist,” and “dismissive” responses. The model learns not just what to do, but what specifically to avoid.

The hardest part has been making sure every example embodies all the parts which are important to me. For example, including the VIEW framework. This whole process is about teaching a state of mind. Teaching wonder.

Here’s what I’m watching for in the results:

Quantitative metrics:

Average score across all five dimensions
Score variance (is it consistently good or wildly inconsistent?)
Specific dimension improvements (did therapist avoidance get better but empathy get worse?)
Failure modes (what types of scenarios does it still struggle with?)

Qualitative metrics:

Does it feel like talking to a real person?
Would I actually want to use this for my own writing?
Can I identify patterns in where it succeeds vs fails?
Does the voice stay consistent across different scenarios?

I’m expecting the base model to score around 30-35/50 average. Decent but with clear problems. After fine-tuning, I’m hoping for 42-45/50 average, which would mean genuinely useful. Anything below 40/50 after training means I need to rethink my approach.

The evaluation framework also lets me iterate. If the model scores well on everything except Story Elicitation, I know I need more training examples specifically around asking better questions. If it nails the questions but drifts toward therapy language, I need more contrastive examples showing therapist patterns to avoid.

The training data generation is what I’m working on now, and it’s slow, meticulous work. But hopefully it will pay off in the long run. I’m writing & curating hundreds of examples, each one trying to capture that balance between presence and boundary.

I’ll share the results when I have them. Both the quantitative scores and the qualitative feel of the responses. Does it actually feel like a calm writing companion? Or does it just feel like a well-trained chatbot pretending to care?

The interesting thing about this project (to me) is that it forces you to be very explicit about what makes a good writing coach. You can’t just say “be encouraging but honest”. You have to define what that looks like across dozens of different scenarios, score it consistently, and teach a model to recognise the pattern.

I’ll let you know how it goes.

Building Evals for an AI Memoir Writing Coach

I’m using a five-dimensional rubric, each dimension scored 0-10:

Like this:

Discover more from Anna.Kiwi

Comments

Leave a ReplyCancel reply

More to read...