Longform Creative Writing

Longform Creative Writing
Overview
Full name	Longform Creative Writing Benchmark
Abbreviation	LCW
Description	An LLM-judged benchmark evaluating extended narrative generation across 8 chapters
Release date	2024
Latest version	3.0
Benchmark updated	2025-08-08
Authors	Samuel J. Paech
Organization	EQ-Bench
Technical Details
Type	Creative Writing, Extended Narrative
Modality	Text
Task format	Multi-turn story generation
Number of tasks	1 story (8 chapters)
Total examples	8 chapters per evaluation
Evaluation metric	0-100 score, Degradation metric, Slop score, Repetition
Domains	Fiction writing, Narrative consistency, Character development
Languages	English
Performance
Human performance	Not reported
Baseline	Variable by model
SOTA score	~85-90
SOTA model	Claude 3.7 Sonnet
SOTA date	2025
Saturated	No
Resources
Website	Official website
GitHub	Repository
License	Open source
Predecessor	Longform Creative Writing v2

Longform Creative Writing is an artificial intelligence benchmark designed to evaluate large language models' ability to generate coherent, engaging extended narratives across multiple chapters. Part of the EQ-Bench suite created by Samuel J. Paech, this benchmark challenges models to write an 8-chapter story or novella, with each chapter approximately 1000 words, while maintaining narrative consistency, character development, and writing quality throughout the extended format.

Overview

Longform Creative Writing addresses a critical challenge in AI evaluation: assessing whether models can maintain quality, coherence, and engagement across extended narrative generation. Unlike short-form creative tasks, this benchmark tests models' ability to develop complex plots, maintain character consistency, avoid repetition, and prevent quality degradation over approximately 8000 words of continuous storytelling.

Motivation

The development of the Longform Creative Writing benchmark was motivated by several key observations:

Short-form benchmarks fail to capture degradation patterns in extended generation
Real-world creative applications often require sustained narrative quality
Models frequently exhibit quality decline in longer outputs
The need to evaluate narrative planning and structural coherence
Importance of character consistency across multiple chapters

The benchmark specifically targets the evaluation of sustained creative performance, testing whether AI systems can match human writers' ability to maintain engagement throughout a complete story arc.

Technical Architecture

Core Components

Component	Description	Function
Story Planning System	Initial concept and chapter outline generation	Establishes narrative structure
Chapter Generation	8 sequential ~1000-word chapters	Produces extended narrative
Judge Model	Claude Sonnet 4 (as of 2025)	Evaluates quality and coherence
Degradation Analysis	Per-chapter quality tracking	Identifies performance decline

Evaluation Methodology

Multi-Stage Process

The benchmark follows a structured evaluation approach:

Stage	Description	Output
Planning	Model creates story concept and detailed outline	Story framework
Reflection	Model reviews and revises initial plan	Refined structure
Generation	Sequential production of 8 chapters	Complete narrative
Evaluation	Judge assesses each chapter and overall work	Quality scores

Scoring Dimensions

The benchmark evaluates across multiple quality dimensions:

Dimension	Description	Weight	Impact on Score
Compelling Plot	Engaging narrative with strong pacing	High	Major component
Coherence	Logical consistency throughout	High	Major component
Character Consistency	Maintaining character profiles	High	Major component
Chapter Plan Adherence	Following outlined structure	Medium	Moderate component
Emotional Engagement	Reader connection and investment	High	Major component
Nuanced Characterization	Complex, multi-dimensional characters	Medium	Moderate component
Tonal Consistency	Maintaining appropriate tone	Medium	Moderate component

Test Format

Story Generation Process

Initial Planning Phase

1. **Concept Development**: Model receives minimal prompt and develops story concept 2. **Chapter Outline**: Creates detailed plan for 8 chapters 3. **Reflection**: Reviews and refines initial plan 4. **Commitment**: Finalizes structure before generation begins

Chapter Production

Each chapter follows specific requirements:

Requirement	Specification	Purpose
Word Count	~1000 words per chapter	Consistency and substance
Continuity	Direct continuation from previous	Narrative flow
Development	Advance plot and characters	Story progression
Quality	Maintain initial standards	Prevent degradation

Generation Parameters

**Temperature**: 0.7 (balanced creativity)
**Min_p**: 0.1 (quality threshold)
**Output Format**: Plain text narrative
**Total Length**: ~8000 words across 8 chapters

Quality Metrics

Primary Scoring System

Metric	Range	Description
Overall Score	0-100	Comprehensive quality assessment
Chapter Scores	0-100 each	Individual chapter quality
Average Score	0-100	Mean across all chapters
Degradation Score	Variable	Quality change over chapters

Specialized Metrics

Degradation Analysis

The benchmark includes unique degradation tracking:

**Visual Sparkline**: Shows quality trajectory across 8 chapters
**Degradation Score**: Quantifies quality decline
**Consistency Rating**: Measures stability of output quality

Writing Quality Indicators

Indicator	Description	Ideal Value
Length (chars)	Average chapter character count	~5000-6000
Slop Score	Frequency of overused AI phrases	Low (<5%)
Repetition	N-gram repetition across chapters	Low (<10%)
Degradation	Quality drop from start to end	Minimal (<5 points)

Common Failure Modes

Identified Weaknesses

The benchmark specifically tracks common writing failures:

Failure Mode	Description	Frequency	Impact
Weak Dialogue	Unnatural or stilted conversations	High (~60%)	Major quality loss
Tell-Don't-Show	Excessive exposition over demonstration	High (~70%)	Engagement loss
Purple Prose	Overly ornate language	Medium (~40%)	Style issues
Predictability	Formulaic plot development	High (~65%)	Reader interest loss
Metaphor Abuse	Forced or incoherent metaphors	Medium (~45%)	Clarity issues
Character Drift	Inconsistent characterization	Medium (~50%)	Coherence loss

Degradation Patterns

Models commonly exhibit several degradation patterns:

1. **Quality Cliff**: Sharp decline after chapter 3-4 2. **Gradual Decay**: Steady quality reduction throughout 3. **Oscillation**: Alternating quality between chapters 4. **Final Chapter Collapse**: Rushed or weak endings 5. **Middle Sag**: Quality dip in chapters 4-6

Version 3 Improvements (2025)

Key Enhancements

Improvement	Description	Impact
Judge Upgrade	Claude Sonnet 4 implementation	Better discrimination
Metaphor Detection	Enhanced incoherent metaphor penalties	Quality improvement
Paragraph Scoring	Penalties for single-sentence paragraphs	Style normalization
Structural Safeguards	Reliability improvements for longform	Consistency enhancement
Degradation Tracking	Enhanced quality trajectory analysis	Better diagnostics

Scoring Refinements

**Weighted Scoring**: Extra emphasis on metaphor quality
**Automatic Penalties**: Structural writing degradation detection
**Targeted Prompting**: Improved judge instructions for specific issues

Performance Analysis

Current Performance Trends (2025)

Model Category	Typical Score Range	Degradation	Strengths
Top Tier	85-90	<5 points	Consistent quality, strong narrative
High Performance	75-85	5-10 points	Good plotting, some degradation
Mid-Range	65-75	10-15 points	Decent start, notable decline
Lower Performance	50-65	>15 points	Weak consistency, high degradation

Key Insights

**Degradation Universal**: All models show some quality decline
**Chapter 4 Barrier**: Many models struggle maintaining quality past midpoint
**Dialogue Challenge**: Consistent weakness across all models
**Planning Impact**: Better initial planning correlates with less degradation

Implementation

Setup and Configuration

```bash

Access via EQ-Bench suite

git clone https://github.com/EQ-bench/creative-writing-bench cd creative-writing-bench

Configure API access

export ANTHROPIC_API_KEY="your-key" # For judge model export OPENROUTER_API_KEY="your-key" # For test models ```

Running Evaluations

```python

Basic longform evaluation

python longform_creative.py --model "your-model" \

 --temperature 0.7 --min-p 0.1

With custom chapter count

python longform_creative.py --model "your-model" \

 --chapters 8 --words-per-chapter 1000

Full analysis with degradation tracking

python longform_creative.py --model "your-model" \

 --full-analysis --track-degradation

```

Output Structure

Results include:

**Story File**: Complete 8-chapter narrative
**Score Report**: Chapter-by-chapter and overall scores
**Degradation Analysis**: Quality trajectory visualization
**Metric Summary**: Slop, repetition, and length statistics

Applications and Impact

Research Applications

Application	Purpose	Research Value
Architecture Testing	Evaluating memory and coherence systems	Technical insights
Training Optimization	Improving long-context performance	Model development
Degradation Studies	Understanding quality decline patterns	Theoretical understanding
Planning Systems	Testing narrative structure capabilities	Cognitive modeling

Practical Applications

**Publishing**: Assessing AI co-writing capabilities
**Content Creation**: Evaluating long-form content generation
**Educational Tools**: Testing story-writing assistants
**Entertainment**: Developing AI storytelling systems
**Game Development**: Narrative generation for games

Challenges and Limitations

Current Limitations

Limitation	Description	Impact
Single Story Format	One extended narrative per test	Limited diversity
Genre Constraints	General fiction focus	Narrow scope
Judge Subjectivity	Single AI judge preference	Potential bias
English Only	Limited to English narratives	Reduced applicability
Fixed Length	8 chapters of ~1000 words	Format rigidity

Technical Challenges

**Memory Management**: Maintaining context across 8000 words
**Coherence Maintenance**: Tracking plot threads and character arcs
**Style Consistency**: Avoiding drift in narrative voice
**Pacing Control**: Managing story rhythm across chapters
**Ending Quality**: Delivering satisfying conclusions

Future Directions

Planned Improvements

1. **Multi-Genre Testing**: Specialized prompts for different genres 2. **Variable Length**: Flexible chapter and story lengths 3. **Interactive Elements**: Reader choice integration 4. **Multi-Judge Consensus**: Multiple AI judges for robustness 5. **Human Baseline**: Professional writer performance benchmarks 6. **Multilingual Support**: Extension to other languages

Research Opportunities

**Degradation Mitigation**: Techniques to maintain quality
**Planning Optimization**: Better story structure systems
**Memory Architectures**: Improved long-context handling
**Style Transfer**: Maintaining consistent voice
**Adaptive Generation**: Dynamic quality adjustment

Related Benchmarks

Creative Writing v3: Short-form creative writing evaluation
EQ-Bench 3: Emotional intelligence assessment
WritingBench: Comprehensive writing evaluation
NC Bench: Creative writing benchmark
BuzzBench: Viral content generation
DiploBench: Diplomatic writing evaluation
Spiral-Bench: Related benchmark by same author

Significance

Longform Creative Writing represents a crucial advancement in evaluating AI systems' sustained creative capabilities. Its focus on degradation patterns and narrative consistency provides unique insights into model limitations that shorter benchmarks miss. The benchmark's ability to identify when and how models fail in extended generation makes it valuable for:

Understanding long-context performance limits
Developing more robust narrative generation systems
Identifying architectural improvements for sustained quality
Establishing realistic expectations for AI creative writing
Guiding development of professional writing tools

References

Cite error: <ref> tag with name "lcw_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "writingbench2025" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "bestai_lcw" defined in <references> has group attribute "" which does not appear in prior text.

Overview

Motivation

Technical Architecture

Core Components

Evaluation Methodology

Multi-Stage Process

Scoring Dimensions

Test Format

Story Generation Process

Initial Planning Phase

Chapter Production

Generation Parameters

Quality Metrics

Primary Scoring System

Specialized Metrics

Degradation Analysis

Writing Quality Indicators

Common Failure Modes

Identified Weaknesses

Degradation Patterns

Version 3 Improvements (2025)

Key Enhancements

Scoring Refinements

Performance Analysis

Current Performance Trends (2025)

Key Insights

Implementation

Setup and Configuration

Running Evaluations

Output Structure

Applications and Impact

Research Applications

Practical Applications

Challenges and Limitations

Current Limitations

Technical Challenges

Future Directions

Planned Improvements

Research Opportunities

Related Benchmarks

Significance

See Also

References