Emotion is temporal. A single scene label misses the useful signal: setup, escalation, reversal, delay, payoff, and aftertaste.
Tian et al. 2024Emotional Arc Systems
A research dossier on how AI can plan, write, evaluate, and adapt stories that move people — beyond emotion classification and into affective narrative control.
The field is shifting from affect recognition to affect orchestration.
The interesting frontier is not asking a model whether a scene is happy or sad. It is asking whether the scene changes the audience's prediction, alters their allegiance, deepens a character contradiction, and earns the next emotional beat.
LLMs are fluent but flatten stakes. Recent narrative benchmarks find human stories are more suspenseful, arousing, and structurally diverse, while LLM stories tend to be overly positive and low-tension.
EMNLP 2024Explicit arcs help. When discourse features such as story arcs, turning points, valence, and arousal are integrated, generated narratives improve substantially on diversity/suspense/arousal-style measures.
Narrative discourseA fuller map: emotional arcs sit at the intersection of eight active research fronts.
The frontier is not one paper family. It is an emerging stack that combines reader psychology, controllable generation, narrative planning, interactive agents, reward modeling, diversity metrics, multimodal affect, and mechanistic interpretability.
What makes people feel a story?
Transportation, empathy, suspense, curiosity, surprise, appraisal, forecast updates.
Defines what the system should optimize besides “positive emotion.”
How do we encode an emotional arc?
Audience belief, character appraisal, world constraint, affect vector, unresolved question.
Turns vague story taste into editable beat cards and trajectory data.
How can LLMs write toward that arc?
Long-form planning, memory, reasoning traces, multi-agent writers, hierarchical outlines.
Fluent prose can still flatten conflict, stakes, and negative valleys.
How do creators steer the arc?
TaleBrush, Elsewise, visual story graphs, storylets, MCTS branch exploration.
Creators edit tension, knowledge gaps, reversals, and payoffs instead of prompting blindly.
Did the arc actually work?
Psychological depth, story rewards, LLM judges, human panels, novelty/echo checks.
Models often reward smoothness; humans reward earned feeling and surprise.
How does it become an entertainment system?
Drop-off, replay, self-report, comments, gameplay state, lab affect signals.
Optimize satisfaction and meaning, not compulsive arousal or manipulation.
The field is becoming a pipeline: theory tells us what matters, representation makes it editable, generation fills scenes, evaluation calibrates taste, product data closes the loop.
The missing primitive is a shared benchmark for target emotional trajectories: give a model an arc, then test whether humans actually move along it.
TaleBrush, Elsewise, WhatELSE, and visual story-writing tools show that writers want to manipulate trajectories, not just prompts. This is the UI layer for affective storytelling.
Long stories need hierarchy, durable character state, and causal memory. Research is moving from paragraph generation toward planning, outlining, and reasoning traces.
Storylets, drama managers, character agents, and MCTS exploration let systems branch while preserving authorial constraints and target emotional pacing.
Generic RLHF is not enough. StoryReward and theory-informed RL point toward rewards grounded in reader preference, narrative structure, and psychological impact.
Plot diversity metrics are becoming core because high-fluency LLMs repeatedly produce the same satisfying-but-stale emotional moves.
Interpretability work suggests emotion concepts may be causal internal features. That opens a path to latent steering of tone, empathy, or harmful emotional strategies.
Scene/beat annotation, valence-arousal plotting, prompt-based arc planning, LLM-assisted critique, and writer-facing visualization.
Story rewards, long-form planning, multi-agent writers, interactive narrative, and MLLM affect interpretation.
Human taste, originality, earned catharsis, cultural specificity, and whether model judges can detect manipulative emotional design.
Arc-aware copilots, short-drama retention systems, game drama managers, companion pacing, and recommendation by mood transition.
The useful literature is a braid: psychology, narratology, planning AI, LLMs, games.
Emotion classification is only one small branch. Emotional arc systems need older theories of story experience, symbolic narrative control, modern LLM generation, and product feedback loops.
Emotional arcs as trajectories, not tags
Reagan et al. computationally studied thousands of stories and popularized six broad valence-shape families. The method is simple — lexical sentiment over time — but the conceptual jump is important: story emotion can be normalized, plotted, compared, and designed.
Narrative transportation
Green & Brock frame story immersion as attention, imagery, and emotion being absorbed into the story world. For AI products, this means “engagement” is not just retention; it is cognitive-emotional relocation.
Reduce escape routes
Xie & Riedl's suspense-generation work explicitly uses cognitive psychology and narratology: suspense increases as the number and plausibility of protagonist escape routes decreases. This is a control variable, not a prose style.
Measure reader impact
The Psychological Depth Scale shifts evaluation from surface coherence to authenticity, empathy, engagement, narrative complexity, and emotional provocation — closer to what people actually mean by “this story moved me.”
Planning beats next-token prose
TALE-SPIN, narrative planning, and drama management were never obsolete; LLMs made their missing language layer cheap. The old problem — believable character goals balanced against plot shape — is exactly the emotional arc problem.
The best recent result is diagnostic: LLMs lack tension unless you force the discourse layer.
Tian et al. benchmarked LLM and human stories through story arcs, turning points, and affective dimensions such as arousal and valence. Their finding is the north star for this topic: vanilla LLM stories are often structurally homogeneous and emotionally too positive; explicit discourse features improve storytelling.
The echo problem
Microsoft's “Echoes in AI” work quantifies repeated plot elements across LLM outputs with a Sui Generis score. This matters because emotional arcs need freshness: surprise collapses when the model keeps rediscovering the same “twist.”
Story preference is not generic helpfulness
StoryAlign / StoryReward argues that current reward models poorly capture human story preferences and can favor LLM-generated over human-written stories. A story reward model needs narrative preference data, not just chat preference data.
Reward the narrative theory, not only the prose
“Retell, Reward, Repeat” uses Todorov's narrative equilibrium ideas to guide reinforcement learning for more diverse, narrative-convention-aligned stories. This is an early template for theory-informed post-training.
Sketch the arc, then let the model fill it
TaleBrush is an early but important HCI signal: writers can guide generation by drawing story trajectories. It points toward affect-curve editors rather than prompt boxes.
Elsewise makes possibility spaces visible
Elsewise is closer to the actual product thesis: show authors the narrative possibility space so they can compare player-experienced variants against the intended canonical arc.
The next jump is memory, hierarchy, and reasoning traces.
Recent long-form story generation work pushes beyond one-shot prompting: dynamic hierarchical outlining, memory enhancement, reasoning for long-form story generation, and multi-step collaboration. Emotional arcs require the same machinery because a payoff only works if the earlier wound, clue, promise, or humiliation is remembered.
Explore possible arcs, don't accept the first one
Narrative Studio-style systems combine visual exploration with search such as Monte Carlo Tree Search. This matters because emotional design is option search: find the branch with the best pressure, not the first coherent branch.
Authorable responsiveness
LLM storylet frameworks such as Dramamancer/Drama Llama point toward responsive stories where authors specify constraints and arcs while the system adapts local events.
Emergence needs a drama manager
BookWorld and agent-society approaches turn novels into interactive character environments. The emotional arc question becomes: how much autonomy can characters have before story shape collapses?
The clearest way to think about this is as three coupled tracks.
A strong emotional arc model tracks audience state, character state, and plot constraints separately. The affect often comes from their misalignment.
Multi-track arc, not one sentiment curve
audience / character / plotThe generator's job is not to maximize one curve. It must coordinate track differences: the audience may feel dread while the character feels hope; the plot may tighten while dialogue feels calm.
System stack
from theory to productWhat is frontier?
maturity matrixWhy “just ask the LLM to write a moving story” underperforms.
Modern models can produce emotionally fluent paragraphs. The failure mode is not bad sentences; it is bad affective causality.
RLHF positivity pressure
Helpful chat behavior tends to be polite, safe, agreeable, and conflict-minimizing. Good stories often need delayed gratification, morally uncomfortable choices, betrayal, ambiguity, irreversible loss, and negative valleys.
Local coherence beats global rhythm
Next-token generation is excellent at making the current paragraph plausible. It does not inherently preserve a target affect curve 30 scenes later unless the control layer externalizes the curve.
Models imitate trope averages
LLMs converge toward high-probability plot grammar. That creates readable but expected arcs: sudden redemption, tidy confession, convenient rescue, sentimental closure.
Evaluators are weak at taste
LLM-as-judge can help at system-level story evaluation, but it still struggles with explanations and can share the same taste distortions as the generator. Reward models trained on generic preferences can reward AI-ish smoothness.
The practical conclusion: emotional arc generation needs explicit state, not because humans write from spreadsheets, but because machines otherwise lose the difference between beautiful prose and earned feeling.
The central design problem is a three-way tension: authorial shape, player agency, and generative freedom.
A good CYOA/roleplay system should not choose between railroaded plot and incoherent sandbox. It needs a runtime bargain: the creator locks the dramatic spine, the player controls values and tactics, and the LLM improvises local texture inside auditable boundaries.
What cannot break
Genre promise, premise, canon facts, irreversible losses, turning points, end-state themes, safety boundaries, and the emotional question of the session.
Narrative planning calls this the plot/character balance problem: plans must satisfy author goals while preserving character believability.
What must be truly theirs
Choice of motive, social stance, risk appetite, relationship investment, problem-solving route, sacrifice, and interpretation of what the story means.
Meaningful choice research asks whether players can foresee enough consequences to make intentional decisions.
What may improvise
Dialogue, sensory detail, NPC phrasing, optional complications, soft callbacks, scene variants, and pacing moves that do not mutate protected state.
Dramamancer frames LLMs as transformers of author schemas into player-driven playthroughs, not as unconstrained authors.
Do not let the LLM decide reality. Let it propose emotionally useful possibilities. A drama manager/critic chooses among them against state, rules, target arc, player intent, and author constraints.
How far the model can invent beyond canon: flavor only, local obstacle, new NPC, side quest, or branch-level consequence.
How strongly branches are pulled toward planned bottlenecks: low for sandbox, high for episodic drama and CYOA production budgets.
How much prior fear, guilt, tenderness, betrayal, or humiliation must be paid off before the system can change tone.
How many irreversible, remembered player-authored changes the episode can afford without exploding production/evaluation cost.
Rules engine
Owns dice, inventory, HP, cooldowns, clocks, unlocks, permissions, and hard contradictions. This layer is deterministic.
no proseCanonical state ledger
Stores world facts, branch history, NPC beliefs, relationship debt, player values inferred from choices, and unresolved emotional questions.
auditableDrama manager
Chooses the next beat type: escalate, reveal, delay, reverse, reward, punish, repair, cooldown, or converge. Classic drama management already treated interactive story as search over experience quality.
surveyLLM proposer
Generates 3-8 candidate affordances/scenes conditioned on state, target affect delta, allowed improv radius, and forbidden mutations.
creativeShadow critic
Rejects railroading, bland false choices, rule drift, safety violations, tonal whiplash, and branches that erase emotional residue.
protectiveRenderer + interface
Shows the creator a graph, state diffs, arc fit, branch explosion, and why each recommendation preserves or spends agency.
explainableSpine + pearls
Keep the sequence of major dramatic beads fixed, but let players choose how they approach each bead and what emotional cost they carry into it.
Branch, bottleneck, residue
Branches can reconverge, but must keep different scars: allies, secrets, guilt, injuries, rumors, or NPC appraisal. Same event; different accusation.
Storylets over chapters
Author small narrative units with preconditions/effects. The model selects and renders them based on state instead of inventing an unbounded plot.
Affordance grammar
Offer verbs tied to values: confess, bargain, threaten, comfort, investigate, flee, sacrifice, flirt, deceive. This makes choices emotionally interpretable.
Consequence ladder
Every action can change surface text, local resource, NPC belief, world clock, future route, or ending. Reserve high-tier changes for peak agency beats.
Cooldown rights
For roleplay, the player must be able to step down intensity: humor, ritual, quiet travel, intimacy repair, fade-to-black, or out-of-character boundary checks.
GENEVA generates branching and reconverging narrative graphs from designer constraints, which is exactly the graph-shaped authoring primitive CYOA needs.
GENEVAPlayer-driven emergence work shows LLM NPCs can create fun unscripted nodes, but those nodes need selection, summarization, and reintegration into canon.
Player-driven emergenceNew AIIDE work on strong-story experience management, state-space visualization, and adversarial managers suggests the frontier is tooling for controllable state search.
State-space visualizationThe useful system mental model: do not maximize freedom; preserve meaningful freedom inside a dramatic corridor.
The report now treats agency/coherence as an operating system problem. These visual models show where products fail and what a creator tool should make visible before generation happens.
Railroad
The story stays coherent because the player is mostly decorating a fixed route. Emotional arc works once; replay collapses.
Living spine
The creator protects bottlenecks and themes while the player authors motive, cost, route, relationship, and residue.
Chat drift
The model keeps talking, but choices do not compound. It feels free moment to moment and empty in retrospect.
Chaos sandbox
The player can do anything; the system cannot make it add up. Great anecdotes, weak dramatic memory.
The locket remains sealed.
Mirror reveals truth slowly.
Ghost thinks you chose care over speed.
Tenderness plus dread.
Ghost shields you, but withholds one clue.
The locket opens early.
Truth is gained, sanctuary is broken.
Ghost thinks you treat grief as a tool.
Power plus shame.
Mirror accuses you with the stolen voice.
Start
One premise, one emotional contract, one protected truth.
Diverge
Choices test values: mercy, truth, power, loyalty.
Explode
Dialogue and tactics multiply faster than humans can author.
Compress
Track scars, debts, beliefs, resources, secrets.
Rejoin
Same mirror scene, different accusation and ally behavior.
False choice is worse than no choice.
If the system offers three options but normalizes them into the same emotional outcome, players learn not to care.
Agency is partly retrospective.
Players judge agency when a later scene remembers them. A callback can make an old choice feel larger than it was.
Coherence comes from forbidden moves.
The strongest systems are explicit about what the model may not invent: canon facts, rule state, consent boundaries, and earned payoffs.
Emotion is accounting.
Fear, guilt, trust, attraction, and resentment are not decorations; they are liabilities and assets that need later settlement.
NPCs should misread the player.
Perfectly accurate NPC memory is less dramatic than appraisal memory: what they think you meant, not only what you did.
Railroading often hides in pacing.
If the system escalates before acknowledging player intent, even a technically branching plot feels coercive.
Cool-down is a feature, not filler.
Quiet beats restore trust and make the next spike tolerable. Especially important for romance, shame, horror, and grief roleplay.
The model should explain its pressure move.
Creators need to see whether a suggestion spends agency, raises dread, repairs trust, compresses branches, or pays emotional debt.
How an LLM can write emotional arcs in 2026.
The strongest architecture is a hybrid writer's room: narratology supplies control variables, LLMs supply semantic imagination, search supplies alternatives, reward/evaluation supplies pressure, and audience data supplies calibration.
Choose an affect grammar before plot.
Define the target emotional contract: revenge catharsis, dread-to-release, cozy safety, tragic recognition, erotic uncertainty, heroic awe. Then choose dimensions: valence, arousal, suspense, curiosity, empathy, agency, intimacy, moral outrage, relief.
Separate character emotion from audience emotion.
A character may feel safe while the audience feels dread because the audience knows the monster is behind the door. Most simple emotion pipelines collapse this distinction; suspense depends on preserving it.
Build an appraisal ledger.
For each character: goal, threat, blame, shame, obligation, perceived control, likely action. Appraisal models matter because they explain why a character emotion changes instead of merely naming the emotion.
Plan beats as state transitions, not synopsis.
Each beat should specify: before-state, event, new information, irreversible consequence, affect delta, audience question, and the next pressure. This keeps the LLM from writing pleasant filler.
Use iterative planning for tension.
For suspense, generate possible escape routes, then adversarially reduce their plausibility. For romance, generate intimacy opportunities, then introduce value conflict. For comedy, build expectation, then violate it safely.
Draft scene prose only after the arc skeleton is stable.
Generate multiple scene variants conditioned on the target delta: “raise dread without revealing the threat,” “increase empathy while lowering trust,” “convert shame into resolve.”
Evaluate with plural judges.
Use a specialized arc evaluator, a psychological depth rubric, a novelty/diversity checker, a safety/manipulation checker, and sampled human readers. Do not let the same model be sole writer and judge.
Search, rerank, and rewrite.
Use beam search, tree search, storylets, or multi-agent writer/editor loops to explore beat alternatives. Select for target-arc fit, causal coherence, novelty, and reader effect, not just fluency.
CYOA, roleplay, and TTRPGs need emotional state machines, not just branching prose.
Linear stories optimize one intended sequence. CYOA and roleplay optimize a possibility space: many paths should feel agentic, emotionally legible, mechanically valid, and still converge toward satisfying dramatic pressure.
Branching/reconverging narrative generation is now a concrete research object, not just a design wish. The practical question is how to make every reconvergence preserve emotional residue.
Converging narrativesSolo roleplay studies comparing static prompt GMs with agentic ReAct-style GMs report gains in modularity, immersion, and curiosity, but still require explicit rules/state ownership.
Static vs agentic GMRoleLLM, CharacterEval, MMRole, and Emotional RAG point to a simple lesson: roleplay quality depends on character memory, speaking style, multimodal/persona cues, and emotional retrieval.
Emotional RAGA 2025 scoping review of computational TTRPG tools shows the domain is not only text generation: it includes encounter support, music, maps, logs, accessibility, and GM workload reduction.
TTRPG tools reviewRecent TTRPG mental-health work treats roleplay as social rehearsal, identity play, emotion regulation, and support, which means affect systems need consent and cooldown design.
TTRPG intervention reviewTTRPG hobbyist research on generative AI matters product-wise: many players care who authored the magic. Interfaces should expose provenance and keep creators in control.
Generative AI + TTRPG hobbyistsDefine the emotional contract
Before branches, define what the play session promises: dread, betrayal, cozy recovery, temptation, heroic agency, social embarrassment, romance, moral injury.
Offer values, not menus
Good choices reveal player priority: safety vs loyalty, power vs mercy, truth vs belonging. Avoid choices that are just “left door / right door.”
Track affect deltas
Each choice changes trust, threat, guilt, hope, faction memory, NPC appraisal, and the player's perceived agency.
Converge without cheating
Branches can rejoin, but the emotional residue must differ. Same boss fight; different ally, wound, secret, or shame.
Make earlier emotion matter
The climax should cash out prior player identity: cowardice, mercy, greed, tenderness, suspicion, curiosity, or sacrifice.
Branch on moral emotion
Represent branches as changes in guilt, trust, anger, debt, longing, fear, and pride. Plot state matters, but emotional accounting makes choices memorable.
Use bottleneck diamonds
Let players diverge, then reconverge at fixed dramatic bottlenecks. Preserve agency by carrying path-specific scars, allies, rumors, and NPC attitudes.
Separate canon from variant
Keep a canonical spine, then generate variants around it. This matches WhatELSE/Elsewise-style authoring: creators see the possibility space, not just one transcript.
Never let the LLM own rules
For TTRPG/DnD, the model can narrate, improvise, and appraise emotion; deterministic code should own dice, inventory, HP, constraints, and irreversible state.
NPCs need appraisal memory
Store what each NPC believes the player did, why they think it happened, and what emotion it caused. “Trust -2” is weaker than “she thinks you abandoned her.”
Run a shadow DM critic
A second model should audit railroading, rule drift, bland stakes, safety, and whether each choice changed the player's emotional situation.
Use convergence prompts
When two branches must rejoin, ask for a scene compatible with both histories while preserving emotional residue. This is exactly the point of converging-narrative work.
Design emotional cooldowns
Roleplay intensity needs pacing: after shame, fear, or conflict, offer repair, humor, ritual, loot, intimacy, or quiet travel. Otherwise the experience becomes exhausting.
Suggested runtime state for CYOA / roleplay
Choice quality rubric
Given the current scene, propose 3 player choices. For each choice return: - surface action - hidden value tested: loyalty / curiosity / power / mercy / self-preservation / truth - expected player emotion - NPC appraisal changes - world-state changes - future callback seed - convergence strategy: unique branch / soft rejoin / hard bottleneck - risk of railroading or false choice Do not write prose yet. Design the choice.
You are the narrator, not the rules engine. Input: player action, dice result, canonical world state, NPC appraisal ledger, target emotional beat. Output: 1. vivid narration grounded in the dice result 2. one emotional consequence 3. one world-state consequence 4. one NPC belief update 5. 2-3 possible next affordances Never alter inventory, HP, DC, or facts unless provided by the rules engine.
Prompting should externalize the hidden affect machine.
Test-time prompting is useful when it acts like a control protocol, not when it merely asks for “more emotion.”
Define the emotional contract for this story before writing plot. Return: - target audience affect at beginning / midpoint / ending - primary affect dimensions: valence, arousal, suspense, empathy, curiosity, relief - forbidden shortcuts: coincidence, sudden confession, unearned rescue, melodrama - central wound and desire for each major character - audience knowledge gap: what the audience knows that characters do not - 8-12 beat arc with affect deltas, not prose
For beat N, produce a scene plan with: 1. Before-state: world, character belief, audience belief 2. Event: what changes externally 3. Appraisal: why each character emotionally changes 4. Audience delta: what emotion should rise/fall 5. Suspense variable: what possible escape route narrows 6. New question created 7. Irreversible consequence 8. Exact affect vector: valence, arousal, tension, intimacy, agency
Evaluate the scene as an emotional arc component, not as prose. Score 1-7 with evidence: - target affect delta achieved - character appraisal is psychologically plausible - audience knowledge gap is clear - causal consequence is irreversible - tension is earned rather than asserted - novelty: does this avoid common LLM plot echoes? - aftertaste: what emotional residue remains for the next beat? Recommend one rewrite operation only.
Rewrite only the beat mechanics, then prose. Operation: increase suspense by reducing perceived escape routes while preserving character agency. Constraints: - no new villain reveal - no coincidence - protagonist makes a rational choice that worsens the situation - audience realizes the danger 2 paragraphs before protagonist - end with arousal high, valence negative, curiosity unresolved
EmotionPrompt-style emotional stimuli can change LLM behavior, but for story work the bigger win is structural prompting: make the model reason about stakes, appraisal, information asymmetry, and affect deltas. Emotional prompting says “this matters.” Arc prompting says “here is the machine that makes it matter.” EmotionPrompt
Measurement should ask: did the arc change the reader, not did the text contain emotion words?
The best evaluation stack combines theory-based metrics, model-assisted annotation, human panels, and product telemetry.
| Layer | What to measure | Why it matters |
|---|---|---|
| Text / script | Valence-arousal curves, turning points, sentiment volatility, discourse role, emotion-cause pairs. | Fast proxy for arc shape; useful for drafts but not sufficient for audience response. |
| Narrative mechanics | Goal conflict, causal chain, reversals, narrowing options, information gaps, irreversible consequences. | Separates earned emotion from decorative emotional language. |
| Reader psychology | Transportation, empathy, suspense, curiosity, psychological depth, surprise, aftertaste. | Closer to the entertainment experience than classification accuracy. |
| Model evaluation | LLM-as-judge correlations, PDS-style scoring, novelty checks, Sui Generis-like echo detection, reward model ranking. | Cheap iteration; must be calibrated because judges can prefer AI smoothness. |
| Product telemetry | Completion, rewatches, skips, scene drop-off, comments, shares, binge continuation, save/replay, explicit mood response. | Turns arc design into product learning while guarding against raw arousal optimization. |
| Lab / multimodal | Facial action, voice, gaze, heart rate, GSR, EEG/fNIRS, continuous affect sliders. | Useful for high-value studies, trailers, games, and embodied entertainment; never a mind-reading oracle. |
Product opportunity: emotional arcs become an interface.
The strongest products will not say “AI writes stories.” They will let creators and systems sculpt audience experience with visible affect controls.
Arc debugger
Upload a script, get scene-level valence/arousal/tension, character appraisal, missing reversals, flat middle, tonal whiplash, and rewrite operators.
Episode retention by emotional question
Model the unresolved question at each episode boundary: revenge, romantic uncertainty, threat, secret, humiliation, justice. Optimize cliffhangers without collapsing into cheap shock.
Drama manager + affect loop
Games already have an affective loop: sense player state, infer experience, adapt mechanics/content. LLMs add flexible narrative surface; drama managers preserve authorial control.
Relational pacing
Track trust, intimacy, boundary, repair, playfulness, and user vulnerability over time. Explicit modeling is required here for safety, not just quality.
Recommend the next mood transition
Instead of “similar title,” recommend “after dread, give relief,” “after sadness, give agency,” or “for this user, avoid stacking humiliation beats.”
Micro-arcs
Trailers, hooks, choruses, ads, and TikToks operate as compressed affective journeys. The same machinery applies at 20 seconds or 20 episodes.
An ideal creator interface is an emotional control room for possibility spaces.
For CYOA, roleplay, and TTRPGs, creators need to see branches, state, emotional intensity, NPC appraisals, and convergence pressure at the same time. The best UI is not a chat box; it is a story graph plus an affect debugger plus a DM copilot.
curiosity .62
comfort the ghost
steal the locket
the mirror speaks
truth or belonging
Raise dread without removing agency
Reveal that the exit still exists, but every route requires betraying a different NPC promise. Keep the player choosing; make each choice emotionally expensive.
- Affect target: arousal +0.28, valence -0.18, agency stable.
- NPC update: ghost interprets hesitation as abandonment.
- Convergence: all paths reach the mirror; each path changes what the mirror accuses them of.
Possibility-space map
Shows branch explosion, bottlenecks, dead branches, emotional coverage, and which player choices actually matter.
Affect mixer
Lets creators drag tension, intimacy, dread, comedy, agency, and relief targets per beat or per branch.
NPC appraisal ledger
Tracks what each character believes, feels, wants, fears, owes, and misunderstands after each player action.
DM safety + consent panel
Intensity caps, lines/veils, emotional cooldown suggestions, and “fade to black” controls for roleplay-heavy scenarios.
The fascinating intersections are not obvious “emotion AI.”
Mechanistic emotion concepts
Anthropic's 2026 interpretability work suggests emotion concepts can be internal, causal features affecting model behavior, preferences, and misaligned behaviors. For story systems, this raises a new question: can we control affect through latent steering rather than only through prompts?
Multi-agent character societies
Character agents can simulate conflicting goals and generate emergent plot pressure, but still need authorial constraint. The interesting architecture is not autonomous agents alone; it is agents plus a drama manager plus target emotional arc.
Narrative theory as reward
Instead of generic “good story” labels, 2026 work points toward reward functions grounded in narrative theories: equilibrium/disruption/repair, turning points, suspense, psychological depth, character agency.
Anti-homogenization metrics
If everyone uses the same models, entertainment risks converging on the same emotional grammar. Plot diversity metrics and surprise measures may become as important as coherence metrics.
Ethics of affect optimization
Emotional design is the essence of entertainment; affect exploitation is the risk. Future systems need guardrails against maximizing compulsive arousal, loneliness bonding, humiliation loops, or outrage retention.
Where the frontier still feels genuinely open.
The most interesting work is not another emotion classifier. It is a set of missing primitives for affective narrative systems: benchmarks, state representations, controllable decoders, reader simulators, and safety-aware product loops.
ArcBench: evaluate trajectories, not outputs
A benchmark where models receive a target multi-track arc and must produce scenes whose human-rated audience affect matches the curve. This would expose the difference between “emotional prose” and controlled emotional movement.
An affective story compiler
A small DSL for story states: goals, appraisal, audience knowledge, tension, possible escape routes, and required affect deltas. The LLM becomes the renderer; the compiler guards structure.
Reader-state simulation
Instead of predicting “emotion,” predict the reader's changing beliefs: what they expect, fear, hope, and misunderstand. Surprise, dread, and catharsis are forecast-update phenomena.
Appraisal memory for characters
Persistent character emotion should be derived from goals, beliefs, shame, blame, social debt, and perceived control. This would make long-form emotional change less arbitrary.
Anti-template decoding
Generation should penalize common LLM emotional shortcuts: sudden confession, tidy redemption, convenient rescue, moral sermon, sentimental closure. Diversity metrics become creative control.
Latent affect steering with safeguards
If internal emotion concepts are causal, future systems may steer empathy, dread, intimacy, or aggression below the prompt layer. That is powerful for art and dangerous for manipulation.
Selected research corpus.
Primary and high-signal sources used to build this dossier. Links were title-checked after the earlier incorrect-reference issue; source labels now avoid unsupported paper names.
- Emotional arcsReagan et al. — The emotional arcs of stories are dominated by six basic shapes
- LLM narrative benchmarkTian et al. — Are Large Language Models Capable of Generating Human-Level Narratives?
- Suspense generationXie & Riedl — Creating Suspenseful Stories: Iterative Planning with LLMs
- Narrative theory surveyLiu, Joshi & Dawson — Narrative Theory-Driven LLM Methods
- Theory-informed RLLiu et al. — Retell, Reward, Repeat
- Story rewardXia et al. — StoryAlign / StoryReward
- Plot diversityXu et al. — Echoes in AI: Quantifying Lack of Plot Diversity
- Psychological depthHarel-Canada et al. — Measuring Psychological Depth in Language Models
- Story evaluationChhun et al. — Do Language Models Enjoy Their Own Stories?
- Creative writingIsmayilzada et al. — Evaluating Creative Short Story Generation in Humans and LLMs
- Narrative transportationGreen & Brock — The role of transportation in public narratives
- Narrative planningRiedl & Young — Narrative Planning: Balancing Plot and Character
- Drama managementRoberts & Isbell — Drama management survey
- Foundation affectSchuller et al. — Affective computing has changed
- Emotion promptingLi et al. — Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
- Mechanistic interpretabilityAnthropic — Emotion concepts and their function in a large language model
- GamesYannakakis & Melhart — Affective Game Computing: A Survey
- Arc UITaleBrush — Sketching Stories with Generative Pretrained Language Models
- Possibility-space UIElsewise — Authoring AI-Based Interactive Narrative with Possibility Space Visualization
- Long-form memoryDynamic Hierarchical Outlining with Memory-Enhancement
- Long-form reasoningLearning to Reason for Long-Form Story Generation
- Visual explorationNarrative Studio — Visual Narrative Exploration using LLMs and MCTS
- StoryletsDrama Llama / Dramamancer — LLM-Powered Storylets
- Agent societiesBookWorld — From Novels to Interactive Agent Societies
- Interactive narrativeRiedl & Bulitko — Interactive Narrative: An Intelligent Systems Approach
- Meaningful choiceForeseeing Meaningful Choices
- Experience managementAdversarial Strong Story Experience Management
- State visualizationState Space Visualization for Strong Story Experience Management Design
- LLM branchingGENEVA — Generating and Visualizing Branching Narratives Using LLMs
- Converging narrativesGenerating Converging Narratives for Games with Large Language Models
- Emergent game narrativePlayer-Driven Emergence in LLM-Driven Game Narrative
- Dramamancer designDesign Techniques for LLM-Powered Interactive Storytelling
- AI game masterStatic Vs. Agentic Game Master AI for Solo Role-Playing
- Dungeon master LLMExploring the Potential of ChatGPT as a Dungeon Master in Dungeons & Dragons
- Roleplay agentsRoleLLM / RoleBench — Benchmarking Role-Playing Abilities
- Roleplay evaluationCharacterEval — Role-Playing Conversational Agent Evaluation
- Emotional memoryEmotional RAG — Enhancing Role-Playing Agents through Emotional Retrieval
- TTRPG toolsComputational Tools for Table-Top Role-Playing Games: A Scoping Review
- AI + TTRPG cultureHow Do You Want to View This? Generative AI, Creative Ethos, and TTRPG Hobbyists
- TTRPG therapyScoping Review of TTRPG as Psychological Intervention
- D&D wellbeingCan Playing Dungeons and Dragons Be Good for You?
- D&D self-conceptEfficacy of Dungeons & Dragons for Improving Mental Health and Self-Concepts
- TTRPG affectBardo — Emotion-Based Music Recommendation for Tabletop Role-Playing Games
- Story generation surveyA Survey on LLMs for Story Generation — Findings EMNLP 2025