Emotional Arc Systems: AI Storytelling Research Dossier 2026

The field is shifting from affect recognition to affect orchestration.

The interesting frontier is not asking a model whether a scene is happy or sad. It is asking whether the scene changes the audience's prediction, alters their allegiance, deepens a character contradiction, and earns the next emotional beat.

01

Emotion is temporal. A single scene label misses the useful signal: setup, escalation, reversal, delay, payoff, and aftertaste.

Tian et al. 2024

02

LLMs are fluent but flatten stakes. Recent narrative benchmarks find human stories are more suspenseful, arousing, and structurally diverse, while LLM stories tend to be overly positive and low-tension.

EMNLP 2024

03

Explicit arcs help. When discourse features such as story arcs, turning points, valence, and arousal are integrated, generated narratives improve substantially on diversity/suspense/arousal-style measures.

Narrative discourse

The 2026 research move: treat a story like a controllable affective dynamical system, not a long piece of text.

world

What is objectively true in the story world: events, resources, constraints, irreversible losses.

character

Goals, beliefs, wounds, secrets, social obligations, self-deception, and appraisal of events.

audience

What the audience knows, suspects, fears, hopes, mispredicts, and morally endorses.

affect

Valence, arousal, suspense, curiosity, dread, empathy, agency, intimacy, relief, shame, awe.

arc

The path through those states, with turning points that make the next state feel both surprising and inevitable.

A fuller map: emotional arcs sit at the intersection of eight active research fronts.

The frontier is not one paper family. It is an emerging stack that combines reader psychology, controllable generation, narrative planning, interactive agents, reward modeling, diversity metrics, multimodal affect, and mechanistic interpretability.

01 / Theory

What makes people feel a story?

Psychology

Transportation, empathy, suspense, curiosity, surprise, appraisal, forecast updates.

Key use

Defines what the system should optimize besides “positive emotion.”

02 / Representation

How do we encode an emotional arc?

State

Audience belief, character appraisal, world constraint, affect vector, unresolved question.

Key use

Turns vague story taste into editable beat cards and trajectory data.

03 / Generation

How can LLMs write toward that arc?

Methods

Long-form planning, memory, reasoning traces, multi-agent writers, hierarchical outlines.

Risk

Fluent prose can still flatten conflict, stakes, and negative valleys.

04 / Control

How do creators steer the arc?

Interfaces

TaleBrush, Elsewise, visual story graphs, storylets, MCTS branch exploration.

Key use

Creators edit tension, knowledge gaps, reversals, and payoffs instead of prompting blindly.

05 / Evaluation

Did the arc actually work?

Metrics

Psychological depth, story rewards, LLM judges, human panels, novelty/echo checks.

Hard part

Models often reward smoothness; humans reward earned feeling and surprise.

06 / Product

How does it become an entertainment system?

Loops

Drop-off, replay, self-report, comments, gameplay state, lab affect signals.

Guardrail

Optimize satisfaction and meaning, not compulsive arousal or manipulation.

Concrete reading

The field is becoming a pipeline: theory tells us what matters, representation makes it editable, generation fills scenes, evaluation calibrates taste, product data closes the loop.

Research gap

The missing primitive is a shared benchmark for target emotional trajectories: give a model an arc, then test whether humans actually move along it.

1. Arc-control interfaces

TaleBrush, Elsewise, WhatELSE, and visual story-writing tools show that writers want to manipulate trajectories, not just prompts. This is the UI layer for affective storytelling.

2. Planning/memory for long form

Long stories need hierarchy, durable character state, and causal memory. Research is moving from paragraph generation toward planning, outlining, and reasoning traces.

3. Agentic interactive narrative

Storylets, drama managers, character agents, and MCTS exploration let systems branch while preserving authorial constraints and target emotional pacing.

4. Story-specific rewards

Generic RLHF is not enough. StoryReward and theory-informed RL point toward rewards grounded in reader preference, narrative structure, and psychological impact.

5. Anti-homogenization

Plot diversity metrics are becoming core because high-fluency LLMs repeatedly produce the same satisfying-but-stale emotional moves.

6. Mechanistic affect

Interpretability work suggests emotion concepts may be causal internal features. That opens a path to latent steering of tone, empathy, or harmful emotional strategies.

Most mature

Scene/beat annotation, valence-arousal plotting, prompt-based arc planning, LLM-assisted critique, and writer-facing visualization.

Fastest moving

Story rewards, long-form planning, multi-agent writers, interactive narrative, and MLLM affect interpretation.

Least solved

Human taste, originality, earned catharsis, cultural specificity, and whether model judges can detect manipulative emotional design.

Product frontier

Arc-aware copilots, short-drama retention systems, game drama managers, companion pacing, and recommendation by mood transition.

The useful literature is a braid: psychology, narratology, planning AI, LLMs, games.

Emotion classification is only one small branch. Emotional arc systems need older theories of story experience, symbolic narrative control, modern LLM generation, and product feedback loops.

Narrative Affect

Emotional arcs as trajectories, not tags

Reagan et al. computationally studied thousands of stories and popularized six broad valence-shape families. The method is simple — lexical sentiment over time — but the conceptual jump is important: story emotion can be normalized, plotted, compared, and designed.

Reagan et al., EPJ Data Science 2016

Reader Psychology

Narrative transportation

Green & Brock frame story immersion as attention, imagery, and emotion being absorbed into the story world. For AI products, this means “engagement” is not just retention; it is cognitive-emotional relocation.

Green & Brock, 2000

Suspense

Reduce escape routes

Xie & Riedl's suspense-generation work explicitly uses cognitive psychology and narratology: suspense increases as the number and plausibility of protagonist escape routes decreases. This is a control variable, not a prose style.

Creating Suspenseful Stories, 2024

Psychological Depth

Measure reader impact

The Psychological Depth Scale shifts evaluation from surface coherence to authenticity, empathy, engagement, narrative complexity, and emotional provocation — closer to what people actually mean by “this story moved me.”

Harel-Canada et al., 2024

Old AI, New Relevance

Planning beats next-token prose

TALE-SPIN, narrative planning, and drama management were never obsolete; LLMs made their missing language layer cheap. The old problem — believable character goals balanced against plot shape — is exactly the emotional arc problem.

Riedl & Young, Narrative Planning

LLM Narrative Benchmarks

The best recent result is diagnostic: LLMs lack tension unless you force the discourse layer.

Tian et al. benchmarked LLM and human stories through story arcs, turning points, and affective dimensions such as arousal and valence. Their finding is the north star for this topic: vanilla LLM stories are often structurally homogeneous and emotionally too positive; explicit discourse features improve storytelling.

Are LLMs Capable of Generating Human-Level Narratives?

Diversity Failure

The echo problem

Microsoft's “Echoes in AI” work quantifies repeated plot elements across LLM outputs with a Sui Generis score. This matters because emotional arcs need freshness: surprise collapses when the model keeps rediscovering the same “twist.”

Xu et al., PNAS/arXiv 2025

Reward Models

Story preference is not generic helpfulness

StoryAlign / StoryReward argues that current reward models poorly capture human story preferences and can favor LLM-generated over human-written stories. A story reward model needs narrative preference data, not just chat preference data.

StoryReward, ICLR 2026

Theory-Driven RL

Reward the narrative theory, not only the prose

“Retell, Reward, Repeat” uses Todorov's narrative equilibrium ideas to guide reinforcement learning for more diverse, narrative-convention-aligned stories. This is an early template for theory-informed post-training.

Liu et al., 2026

Creator Interfaces

Sketch the arc, then let the model fill it

TaleBrush is an early but important HCI signal: writers can guide generation by drawing story trajectories. It points toward affect-curve editors rather than prompt boxes.

TaleBrush, 2022

Possibility-Space UI

Elsewise makes possibility spaces visible

Elsewise is closer to the actual product thesis: show authors the narrative possibility space so they can compare player-experienced variants against the intended canonical arc.

Elsewise, 2026 / WhatELSE, CHI 2025

Long-Form Generation

The next jump is memory, hierarchy, and reasoning traces.

Recent long-form story generation work pushes beyond one-shot prompting: dynamic hierarchical outlining, memory enhancement, reasoning for long-form story generation, and multi-step collaboration. Emotional arcs require the same machinery because a payoff only works if the earlier wound, clue, promise, or humiliation is remembered.

Dynamic outlining + memory / Learning to reason

Interactive Search

Explore possible arcs, don't accept the first one

Narrative Studio-style systems combine visual exploration with search such as Monte Carlo Tree Search. This matters because emotional design is option search: find the branch with the best pressure, not the first coherent branch.

Narrative Studio, 2025

Storylets

Authorable responsiveness

LLM storylet frameworks such as Dramamancer/Drama Llama point toward responsive stories where authors specify constraints and arcs while the system adapts local events.

Drama Llama, 2025

Character Agents

Emergence needs a drama manager

BookWorld and agent-society approaches turn novels into interactive character environments. The emotional arc question becomes: how much autonomy can characters have before story shape collapses?

BookWorld, 2025

The clearest way to think about this is as three coupled tracks.

A strong emotional arc model tracks audience state, character state, and plot constraints separately. The affect often comes from their misalignment.

Multi-track arc, not one sentiment curve

audience / character / plot

The generator's job is not to maximize one curve. It must coordinate track differences: the audience may feel dread while the character feels hope; the plot may tighten while dialogue feels calm.

System stack

from theory to product

Intentgenre promise, target audience journey, ethical boundarieshuman-led

Affect statevalence, arousal, suspense, empathy, agency, intimacy, reliefexplicit

Appraisal ledgercharacter goals, blame, control, secrets, self-deceptioncausal

Plannerbeats, turns, constraints, possible branches, search/rerankhybrid

LLM prosescene drafting, voice, implication, subtext, dialoguegenerative

Evaluatorarc fit, depth, novelty, safety, reader responsecalibrated

Telemetrycompletion, drop-off, replay, self-report, physiological/lab dataproduct

What is frontier?

maturity matrix

Research front

Known

2026 push

Bottleneck

Product use

Valence/arousal arcs

mature proxy

multi-track arcs

audience ground truth

script diagnostics

LLM long-form writing

fluent scenes

planning + memory

global causality

draft generation

Interactive narrative

drama managers

LLM agents + constraints

authorial control

games/AI shows

Story evaluation

LLM judges help

story-specific rewards

taste calibration

rewrite loops

Originality/diversity

echo metrics

anti-template search

novelty vs coherence

content slate design

Mechanistic affect

early evidence

latent steering

interpretability safety

tone/empathy control

Why “just ask the LLM to write a moving story” underperforms.

Modern models can produce emotionally fluent paragraphs. The failure mode is not bad sentences; it is bad affective causality.

01

RLHF positivity pressure

Helpful chat behavior tends to be polite, safe, agreeable, and conflict-minimizing. Good stories often need delayed gratification, morally uncomfortable choices, betrayal, ambiguity, irreversible loss, and negative valleys.

02

Local coherence beats global rhythm

Next-token generation is excellent at making the current paragraph plausible. It does not inherently preserve a target affect curve 30 scenes later unless the control layer externalizes the curve.

03

Models imitate trope averages

LLMs converge toward high-probability plot grammar. That creates readable but expected arcs: sudden redemption, tidy confession, convenient rescue, sentimental closure.

04

Evaluators are weak at taste

LLM-as-judge can help at system-level story evaluation, but it still struggles with explanations and can share the same taste distortions as the generator. Reward models trained on generic preferences can reward AI-ish smoothness.

The practical conclusion: emotional arc generation needs explicit state, not because humans write from spreadsheets, but because machines otherwise lose the difference between beautiful prose and earned feeling.

The central design problem is a three-way tension: authorial shape, player agency, and generative freedom.

A good CYOA/roleplay system should not choose between railroaded plot and incoherent sandbox. It needs a runtime bargain: the creator locks the dramatic spine, the player controls values and tactics, and the LLM improvises local texture inside auditable boundaries.

Authorial spine

What cannot break

Genre promise, premise, canon facts, irreversible losses, turning points, end-state themes, safety boundaries, and the emotional question of the session.

Narrative planning calls this the plot/character balance problem: plans must satisfy author goals while preserving character believability.

Player agency

What must be truly theirs

Choice of motive, social stance, risk appetite, relationship investment, problem-solving route, sacrifice, and interpretation of what the story means.

Meaningful choice research asks whether players can foresee enough consequences to make intentional decisions.

Generative surface

What may improvise

Dialogue, sensory detail, NPC phrasing, optional complications, soft callbacks, scene variants, and pacing moves that do not mutate protected state.

Dramamancer frames LLMs as transformers of author schemas into player-driven playthroughs, not as unconstrained authors.

Design thesis

Do not let the LLM decide reality. Let it propose emotionally useful possibilities. A drama manager/critic chooses among them against state, rules, target arc, player intent, and author constraints.

Improv radius

How far the model can invent beyond canon: flavor only, local obstacle, new NPC, side quest, or branch-level consequence.

Convergence pressure

How strongly branches are pulled toward planned bottlenecks: low for sandbox, high for episodic drama and CYOA production budgets.

Emotional debt

How much prior fear, guilt, tenderness, betrayal, or humiliation must be paid off before the system can change tone.

Agency budget

How many irreversible, remembered player-authored changes the episode can afford without exploding production/evaluation cost.

Layer 1

Rules engine

Owns dice, inventory, HP, cooldowns, clocks, unlocks, permissions, and hard contradictions. This layer is deterministic.

no prose

Layer 2

Canonical state ledger

Stores world facts, branch history, NPC beliefs, relationship debt, player values inferred from choices, and unresolved emotional questions.

auditable

Layer 3

Drama manager

Chooses the next beat type: escalate, reveal, delay, reverse, reward, punish, repair, cooldown, or converge. Classic drama management already treated interactive story as search over experience quality.

survey

Layer 4

LLM proposer

Generates 3-8 candidate affordances/scenes conditioned on state, target affect delta, allowed improv radius, and forbidden mutations.

creative

Layer 5

Shadow critic

Rejects railroading, bland false choices, rule drift, safety violations, tonal whiplash, and branches that erase emotional residue.

protective

Layer 6

Renderer + interface

Shows the creator a graph, state diffs, arc fit, branch explosion, and why each recommendation preserves or spends agency.

explainable

Pattern 1

Spine + pearls

Keep the sequence of major dramatic beads fixed, but let players choose how they approach each bead and what emotional cost they carry into it.

Pattern 2

Branch, bottleneck, residue

Branches can reconverge, but must keep different scars: allies, secrets, guilt, injuries, rumors, or NPC appraisal. Same event; different accusation.

Pattern 3

Storylets over chapters

Author small narrative units with preconditions/effects. The model selects and renders them based on state instead of inventing an unbounded plot.

Pattern 4

Affordance grammar

Offer verbs tied to values: confess, bargain, threaten, comfort, investigate, flee, sacrifice, flirt, deceive. This makes choices emotionally interpretable.

Pattern 5

Consequence ladder

Every action can change surface text, local resource, NPC belief, world clock, future route, or ending. Reserve high-tier changes for peak agency beats.

Pattern 6

Cooldown rights

For roleplay, the player must be able to step down intensity: humor, ritual, quiet travel, intimacy repair, fade-to-black, or out-of-character boundary checks.

LLM branching

GENEVA generates branching and reconverging narrative graphs from designer constraints, which is exactly the graph-shaped authoring primitive CYOA needs.

GENEVA

Emergence

Player-driven emergence work shows LLM NPCs can create fun unscripted nodes, but those nodes need selection, summarization, and reintegration into canon.

Player-driven emergence

Experience management

New AIIDE work on strong-story experience management, state-space visualization, and adversarial managers suggests the frontier is tooling for controllable state search.

State-space visualization

The useful system mental model: do not maximize freedom; preserve meaningful freedom inside a dramatic corridor.

The report now treats agency/coherence as an operating system problem. These visual models show where products fail and what a creator tool should make visible before generation happens.

Control Regimesauthorial control x player agency

High control / low agency

Railroad

The story stays coherent because the player is mostly decorating a fixed route. Emotional arc works once; replay collapses.

High control / high agency

Living spine

The creator protects bottlenecks and themes while the player authors motive, cost, route, relationship, and residue.

Low control / low agency

Chat drift

The model keeps talking, but choices do not compound. It feels free moment to moment and empty in retrospect.

Low control / high agency

Chaos sandbox

The player can do anything; the system cannot make it add up. Great anecdotes, weak dramatic memory.

Vertical: protected shape

Horizontal: consequential player authorship

Emotional Corridorpressure can rise only if trust survives

Residue Ledgersame bottleneck, different emotional truth

Player choice

World fact

NPC appraisal

Player emotion

Future callback

Comfort ghost

The locket remains sealed.

Fact

Mirror reveals truth slowly.

Belief

Ghost thinks you chose care over speed.

Feeling

Tenderness plus dread.

Payoff

Ghost shields you, but withholds one clue.

Steal locket

The locket opens early.

Fact

Truth is gained, sanctuary is broken.

Belief

Ghost thinks you treat grief as a tool.

Feeling

Power plus shame.

Payoff

Mirror accuses you with the stolen voice.

Branch Economicscompress paths into state, not amnesia

1 scene

Start

One premise, one emotional contract, one protected truth.

4 paths

Diverge

Choices test values: mercy, truth, power, loyalty.

16 variants

Explode

Dialogue and tactics multiply faster than humans can author.

5 state diffs

Compress

Track scars, debts, beliefs, resources, secrets.

1 bottleneck

Rejoin

Same mirror scene, different accusation and ally behavior.

Insight 01

False choice is worse than no choice.

If the system offers three options but normalizes them into the same emotional outcome, players learn not to care.

Insight 02

Agency is partly retrospective.

Players judge agency when a later scene remembers them. A callback can make an old choice feel larger than it was.

Insight 03

Coherence comes from forbidden moves.

The strongest systems are explicit about what the model may not invent: canon facts, rule state, consent boundaries, and earned payoffs.

Insight 04

Emotion is accounting.

Fear, guilt, trust, attraction, and resentment are not decorations; they are liabilities and assets that need later settlement.

Insight 05

NPCs should misread the player.

Perfectly accurate NPC memory is less dramatic than appraisal memory: what they think you meant, not only what you did.

Insight 06

Railroading often hides in pacing.

If the system escalates before acknowledging player intent, even a technically branching plot feels coercive.

Insight 07

Cool-down is a feature, not filler.

Quiet beats restore trust and make the next spike tolerable. Especially important for romance, shame, horror, and grief roleplay.

Insight 08

The model should explain its pressure move.

Creators need to see whether a suggestion spends agency, raises dread, repairs trust, compresses branches, or pays emotional debt.

Operating law

The product should optimize for “I could have acted otherwise, and the story remembers why I did not.” That is the emotional version of agency.

How an LLM can write emotional arcs in 2026.

The strongest architecture is a hybrid writer's room: narratology supplies control variables, LLMs supply semantic imagination, search supplies alternatives, reward/evaluation supplies pressure, and audience data supplies calibration.

Choose an affect grammar before plot.

Define the target emotional contract: revenge catharsis, dread-to-release, cozy safety, tragic recognition, erotic uncertainty, heroic awe. Then choose dimensions: valence, arousal, suspense, curiosity, empathy, agency, intimacy, moral outrage, relief.

Separate character emotion from audience emotion.

A character may feel safe while the audience feels dread because the audience knows the monster is behind the door. Most simple emotion pipelines collapse this distinction; suspense depends on preserving it.

Build an appraisal ledger.

For each character: goal, threat, blame, shame, obligation, perceived control, likely action. Appraisal models matter because they explain why a character emotion changes instead of merely naming the emotion.

Plan beats as state transitions, not synopsis.

Each beat should specify: before-state, event, new information, irreversible consequence, affect delta, audience question, and the next pressure. This keeps the LLM from writing pleasant filler.

Use iterative planning for tension.

For suspense, generate possible escape routes, then adversarially reduce their plausibility. For romance, generate intimacy opportunities, then introduce value conflict. For comedy, build expectation, then violate it safely.

Draft scene prose only after the arc skeleton is stable.

Generate multiple scene variants conditioned on the target delta: “raise dread without revealing the threat,” “increase empathy while lowering trust,” “convert shame into resolve.”

Evaluate with plural judges.

Use a specialized arc evaluator, a psychological depth rubric, a novelty/diversity checker, a safety/manipulation checker, and sampled human readers. Do not let the same model be sole writer and judge.

Search, rerank, and rewrite.

Use beam search, tree search, storylets, or multi-agent writer/editor loops to explore beat alternatives. Select for target-arc fit, causal coherence, novelty, and reader effect, not just fluency.

Key inversionMost people prompt “write a sad scene.” Better: “write a scene that makes the audience move from confidence to dread while the protagonist misreads the situation as relief.”

Product translationThe emotional arc becomes a controllable interface: creators can drag the tension curve, lock a character's shame arc, or ask the system to explain why episode 4 feels flat.

CYOA, roleplay, and TTRPGs need emotional state machines, not just branching prose.

Linear stories optimize one intended sequence. CYOA and roleplay optimize a possibility space: many paths should feel agentic, emotionally legible, mechanically valid, and still converge toward satisfying dramatic pressure.

CYOA graph tools

Branching/reconverging narrative generation is now a concrete research object, not just a design wish. The practical question is how to make every reconvergence preserve emotional residue.

Converging narratives

Agentic GM

Solo roleplay studies comparing static prompt GMs with agentic ReAct-style GMs report gains in modularity, immersion, and curiosity, but still require explicit rules/state ownership.

Static vs agentic GM

Roleplay memory

RoleLLM, CharacterEval, MMRole, and Emotional RAG point to a simple lesson: roleplay quality depends on character memory, speaking style, multimodal/persona cues, and emotional retrieval.

Emotional RAG

TTRPG tools

A 2025 scoping review of computational TTRPG tools shows the domain is not only text generation: it includes encounter support, music, maps, logs, accessibility, and GM workload reduction.

TTRPG tools review

Player psychology

Recent TTRPG mental-health work treats roleplay as social rehearsal, identity play, emotion regulation, and support, which means affect systems need consent and cooldown design.

TTRPG intervention review

Creative ethos

TTRPG hobbyist research on generative AI matters product-wise: many players care who authored the magic. Interfaces should expose provenance and keep creators in control.

Generative AI + TTRPG hobbyists

01 / Premise

Define the emotional contract

Before branches, define what the play session promises: dread, betrayal, cozy recovery, temptation, heroic agency, social embarrassment, romance, moral injury.

02 / Choice

Offer values, not menus

Good choices reveal player priority: safety vs loyalty, power vs mercy, truth vs belonging. Avoid choices that are just “left door / right door.”

03 / Consequence

Track affect deltas

Each choice changes trust, threat, guilt, hope, faction memory, NPC appraisal, and the player's perceived agency.

04 / Rejoin

Converge without cheating

Branches can rejoin, but the emotional residue must differ. Same boss fight; different ally, wound, secret, or shame.

05 / Payoff

Make earlier emotion matter

The climax should cash out prior player identity: cowardice, mercy, greed, tenderness, suspicion, curiosity, or sacrifice.

Hack 1

Branch on moral emotion

Represent branches as changes in guilt, trust, anger, debt, longing, fear, and pride. Plot state matters, but emotional accounting makes choices memorable.

Hack 2

Use bottleneck diamonds

Let players diverge, then reconverge at fixed dramatic bottlenecks. Preserve agency by carrying path-specific scars, allies, rumors, and NPC attitudes.

Hack 3

Separate canon from variant

Keep a canonical spine, then generate variants around it. This matches WhatELSE/Elsewise-style authoring: creators see the possibility space, not just one transcript.

Hack 4

Never let the LLM own rules

For TTRPG/DnD, the model can narrate, improvise, and appraise emotion; deterministic code should own dice, inventory, HP, constraints, and irreversible state.

Hack 5

NPCs need appraisal memory

Store what each NPC believes the player did, why they think it happened, and what emotion it caused. “Trust -2” is weaker than “she thinks you abandoned her.”

Hack 6

Run a shadow DM critic

A second model should audit railroading, rule drift, bland stakes, safety, and whether each choice changed the player's emotional situation.

Hack 7

Use convergence prompts

When two branches must rejoin, ask for a scene compatible with both histories while preserving emotional residue. This is exactly the point of converging-narrative work.

Hack 8

Design emotional cooldowns

Roleplay intensity needs pacing: after shame, fear, or conflict, offer repair, humor, ritual, loot, intimacy, or quiet travel. Otherwise the experience becomes exhausting.

Suggested runtime state for CYOA / roleplay

Player vectoragency, curiosity, dread, guilt, attachment, frustration, mastery, trust in system.

Character bondNPC appraisal, debt, affection, fear, respect, suspicion, jealousy, loyalty.

World truthfacts, clocks, resources, wounds, factions, secrets, unlocked routes, forbidden contradictions.

Branch memorychoice rationale, emotional cost, public consequence, private consequence, future callback.

DM policygenre promise, safety boundaries, target arc, difficulty, allowed improvisation radius.

Choice quality rubric

AgencyDoes the player understand enough to make an intentional decision?

EmotionDoes each option imply a different feeling, not merely a different location?

CostIs there a price, tradeoff, risk, or relationship consequence?

MemoryWill the system remember and reference the choice later?

ConvergenceIf branches rejoin, does the emotional residue remain path-specific?

SafetyCan the player modulate intensity, consent, and boundaries?

CYOA Branch Generatorprompt pattern

Given the current scene, propose 3 player choices.
For each choice return:
- surface action
- hidden value tested: loyalty / curiosity / power / mercy / self-preservation / truth
- expected player emotion
- NPC appraisal changes
- world-state changes
- future callback seed
- convergence strategy: unique branch / soft rejoin / hard bottleneck
- risk of railroading or false choice
Do not write prose yet. Design the choice.

DM / Roleplay Turnruntime prompt

You are the narrator, not the rules engine.
Input: player action, dice result, canonical world state, NPC appraisal ledger, target emotional beat.
Output:
1. vivid narration grounded in the dice result
2. one emotional consequence
3. one world-state consequence
4. one NPC belief update
5. 2-3 possible next affordances
Never alter inventory, HP, DC, or facts unless provided by the rules engine.

Prompting should externalize the hidden affect machine.

Test-time prompting is useful when it acts like a control protocol, not when it merely asks for “more emotion.”

Arc Contractplanning prompt

Define the emotional contract for this story before writing plot.
Return:
- target audience affect at beginning / midpoint / ending
- primary affect dimensions: valence, arousal, suspense, empathy, curiosity, relief
- forbidden shortcuts: coincidence, sudden confession, unearned rescue, melodrama
- central wound and desire for each major character
- audience knowledge gap: what the audience knows that characters do not
- 8-12 beat arc with affect deltas, not prose

Beat Cardscene control

For beat N, produce a scene plan with:
1. Before-state: world, character belief, audience belief
2. Event: what changes externally
3. Appraisal: why each character emotionally changes
4. Audience delta: what emotion should rise/fall
5. Suspense variable: what possible escape route narrows
6. New question created
7. Irreversible consequence
8. Exact affect vector: valence, arousal, tension, intimacy, agency

Evaluatoranti-slop rubric

Evaluate the scene as an emotional arc component, not as prose.
Score 1-7 with evidence:
- target affect delta achieved
- character appraisal is psychologically plausible
- audience knowledge gap is clear
- causal consequence is irreversible
- tension is earned rather than asserted
- novelty: does this avoid common LLM plot echoes?
- aftertaste: what emotional residue remains for the next beat?
Recommend one rewrite operation only.

Rewrite Operatorcontrolled revision

Rewrite only the beat mechanics, then prose.
Operation: increase suspense by reducing perceived escape routes while preserving character agency.
Constraints:
- no new villain reveal
- no coincidence
- protagonist makes a rational choice that worsens the situation
- audience realizes the danger 2 paragraphs before protagonist
- end with arousal high, valence negative, curiosity unresolved

EmotionPrompt-style emotional stimuli can change LLM behavior, but for story work the bigger win is structural prompting: make the model reason about stakes, appraisal, information asymmetry, and affect deltas. Emotional prompting says “this matters.” Arc prompting says “here is the machine that makes it matter.” EmotionPrompt

Measurement should ask: did the arc change the reader, not did the text contain emotion words?

The best evaluation stack combines theory-based metrics, model-assisted annotation, human panels, and product telemetry.

Layer	What to measure	Why it matters
Text / script	Valence-arousal curves, turning points, sentiment volatility, discourse role, emotion-cause pairs.	Fast proxy for arc shape; useful for drafts but not sufficient for audience response.
Narrative mechanics	Goal conflict, causal chain, reversals, narrowing options, information gaps, irreversible consequences.	Separates earned emotion from decorative emotional language.
Reader psychology	Transportation, empathy, suspense, curiosity, psychological depth, surprise, aftertaste.	Closer to the entertainment experience than classification accuracy.
Model evaluation	LLM-as-judge correlations, PDS-style scoring, novelty checks, Sui Generis-like echo detection, reward model ranking.	Cheap iteration; must be calibrated because judges can prefer AI smoothness.
Product telemetry	Completion, rewatches, skips, scene drop-off, comments, shares, binge continuation, save/replay, explicit mood response.	Turns arc design into product learning while guarding against raw arousal optimization.
Lab / multimodal	Facial action, voice, gaze, heart rate, GSR, EEG/fNIRS, continuous affect sliders.	Useful for high-value studies, trailers, games, and embodied entertainment; never a mind-reading oracle.

Product opportunity: emotional arcs become an interface.

The strongest products will not say “AI writes stories.” They will let creators and systems sculpt audience experience with visible affect controls.

Writer's Room Copilot

Arc debugger

Upload a script, get scene-level valence/arousal/tension, character appraisal, missing reversals, flat middle, tonal whiplash, and rewrite operators.

Short Drama / Series

Episode retention by emotional question

Model the unresolved question at each episode boundary: revenge, romantic uncertainty, threat, secret, humiliation, justice. Optimize cliffhangers without collapsing into cheap shock.

Games

Drama manager + affect loop

Games already have an affective loop: sense player state, infer experience, adapt mechanics/content. LLMs add flexible narrative surface; drama managers preserve authorial control.

AI Companions

Relational pacing

Track trust, intimacy, boundary, repair, playfulness, and user vulnerability over time. Explicit modeling is required here for safety, not just quality.

Recommendation

Recommend the next mood transition

Instead of “similar title,” recommend “after dread, give relief,” “after sadness, give agency,” or “for this user, avoid stacking humiliation beats.”

Music / Trailer / Ads

Micro-arcs

Trailers, hooks, choruses, ads, and TikToks operate as compressed affective journeys. The same machinery applies at 20 seconds or 20 episodes.

An ideal creator interface is an emotional control room for possibility spaces.

For CYOA, roleplay, and TTRPGs, creators need to see branches, state, emotional intensity, NPC appraisals, and convergence pressure at the same time. The best UI is not a chat box; it is a story graph plus an affect debugger plus a DM copilot.

Arrival
curiosity .62

Choice A
comfort the ghost

Choice B
steal the locket

Bottleneck
the mirror speaks

Payoff
truth or belonging

System recommendation

Raise dread without removing agency

Reveal that the exit still exists, but every route requires betraying a different NPC promise. Keep the player choosing; make each choice emotionally expensive.

Affect target: arousal +0.28, valence -0.18, agency stable.
NPC update: ghost interprets hesitation as abandonment.
Convergence: all paths reach the mirror; each path changes what the mirror accuses them of.

Possibility-space map

Shows branch explosion, bottlenecks, dead branches, emotional coverage, and which player choices actually matter.

Affect mixer

Lets creators drag tension, intimacy, dread, comedy, agency, and relief targets per beat or per branch.

NPC appraisal ledger

Tracks what each character believes, feels, wants, fears, owes, and misunderstands after each player action.

DM safety + consent panel

Intensity caps, lines/veils, emotional cooldown suggestions, and “fade to black” controls for roleplay-heavy scenarios.

The fascinating intersections are not obvious “emotion AI.”

A

Mechanistic emotion concepts

Anthropic's 2026 interpretability work suggests emotion concepts can be internal, causal features affecting model behavior, preferences, and misaligned behaviors. For story systems, this raises a new question: can we control affect through latent steering rather than only through prompts?

B

Multi-agent character societies

Character agents can simulate conflicting goals and generate emergent plot pressure, but still need authorial constraint. The interesting architecture is not autonomous agents alone; it is agents plus a drama manager plus target emotional arc.

C

Narrative theory as reward

Instead of generic “good story” labels, 2026 work points toward reward functions grounded in narrative theories: equilibrium/disruption/repair, turning points, suspense, psychological depth, character agency.

D

Anti-homogenization metrics

If everyone uses the same models, entertainment risks converging on the same emotional grammar. Plot diversity metrics and surprise measures may become as important as coherence metrics.

E

Ethics of affect optimization

Emotional design is the essence of entertainment; affect exploitation is the risk. Future systems need guardrails against maximizing compulsive arousal, loneliness bonding, humiliation loops, or outrage retention.

Where the frontier still feels genuinely open.

The most interesting work is not another emotion classifier. It is a set of missing primitives for affective narrative systems: benchmarks, state representations, controllable decoders, reader simulators, and safety-aware product loops.

Breakthrough 1

ArcBench: evaluate trajectories, not outputs

A benchmark where models receive a target multi-track arc and must produce scenes whose human-rated audience affect matches the curve. This would expose the difference between “emotional prose” and controlled emotional movement.

Breakthrough 2

An affective story compiler

A small DSL for story states: goals, appraisal, audience knowledge, tension, possible escape routes, and required affect deltas. The LLM becomes the renderer; the compiler guards structure.

Breakthrough 3

Reader-state simulation

Instead of predicting “emotion,” predict the reader's changing beliefs: what they expect, fear, hope, and misunderstand. Surprise, dread, and catharsis are forecast-update phenomena.

Breakthrough 4

Appraisal memory for characters

Persistent character emotion should be derived from goals, beliefs, shame, blame, social debt, and perceived control. This would make long-form emotional change less arbitrary.

Breakthrough 5

Anti-template decoding

Generation should penalize common LLM emotional shortcuts: sudden confession, tidy redemption, convenient rescue, moral sermon, sentimental closure. Diversity metrics become creative control.

Breakthrough 6

Latent affect steering with safeguards

If internal emotion concepts are causal, future systems may steer empathy, dread, intimacy, or aggression below the prompt layer. That is powerful for art and dangerous for manipulation.

The product moat is not “we use an LLM.” It is owning the data and theory layer that maps content changes to audience-state changes.

Selected research corpus.

Primary and high-signal sources used to build this dossier. Links were title-checked after the earlier incorrect-reference issue; source labels now avoid unsupported paper names.

Emotional arcsReagan et al. — The emotional arcs of stories are dominated by six basic shapes
LLM narrative benchmarkTian et al. — Are Large Language Models Capable of Generating Human-Level Narratives?
Suspense generationXie & Riedl — Creating Suspenseful Stories: Iterative Planning with LLMs
Narrative theory surveyLiu, Joshi & Dawson — Narrative Theory-Driven LLM Methods
Theory-informed RLLiu et al. — Retell, Reward, Repeat
Story rewardXia et al. — StoryAlign / StoryReward
Plot diversityXu et al. — Echoes in AI: Quantifying Lack of Plot Diversity
Psychological depthHarel-Canada et al. — Measuring Psychological Depth in Language Models
Story evaluationChhun et al. — Do Language Models Enjoy Their Own Stories?
Creative writingIsmayilzada et al. — Evaluating Creative Short Story Generation in Humans and LLMs
Narrative transportationGreen & Brock — The role of transportation in public narratives
Narrative planningRiedl & Young — Narrative Planning: Balancing Plot and Character
Drama managementRoberts & Isbell — Drama management survey
Foundation affectSchuller et al. — Affective computing has changed
Emotion promptingLi et al. — Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
Mechanistic interpretabilityAnthropic — Emotion concepts and their function in a large language model
GamesYannakakis & Melhart — Affective Game Computing: A Survey
Arc UITaleBrush — Sketching Stories with Generative Pretrained Language Models
Possibility-space UIElsewise — Authoring AI-Based Interactive Narrative with Possibility Space Visualization
Long-form memoryDynamic Hierarchical Outlining with Memory-Enhancement
Long-form reasoningLearning to Reason for Long-Form Story Generation
Visual explorationNarrative Studio — Visual Narrative Exploration using LLMs and MCTS
StoryletsDrama Llama / Dramamancer — LLM-Powered Storylets
Agent societiesBookWorld — From Novels to Interactive Agent Societies
Interactive narrativeRiedl & Bulitko — Interactive Narrative: An Intelligent Systems Approach
Meaningful choiceForeseeing Meaningful Choices
Experience managementAdversarial Strong Story Experience Management
State visualizationState Space Visualization for Strong Story Experience Management Design
LLM branchingGENEVA — Generating and Visualizing Branching Narratives Using LLMs
Converging narrativesGenerating Converging Narratives for Games with Large Language Models
Emergent game narrativePlayer-Driven Emergence in LLM-Driven Game Narrative
Dramamancer designDesign Techniques for LLM-Powered Interactive Storytelling
AI game masterStatic Vs. Agentic Game Master AI for Solo Role-Playing
Dungeon master LLMExploring the Potential of ChatGPT as a Dungeon Master in Dungeons & Dragons
Roleplay agentsRoleLLM / RoleBench — Benchmarking Role-Playing Abilities
Roleplay evaluationCharacterEval — Role-Playing Conversational Agent Evaluation
Emotional memoryEmotional RAG — Enhancing Role-Playing Agents through Emotional Retrieval
TTRPG toolsComputational Tools for Table-Top Role-Playing Games: A Scoping Review
AI + TTRPG cultureHow Do You Want to View This? Generative AI, Creative Ethos, and TTRPG Hobbyists
TTRPG therapyScoping Review of TTRPG as Psychological Intervention
D&D wellbeingCan Playing Dungeons and Dragons Be Good for You?
D&D self-conceptEfficacy of Dungeons & Dragons for Improving Mental Health and Self-Concepts
TTRPG affectBardo — Emotion-Based Music Recommendation for Tabletop Role-Playing Games
Story generation surveyA Survey on LLMs for Story Generation — Findings EMNLP 2025