Longform Creative Writing
Longform Creative Writing is an artificial intelligence benchmark designed to evaluate large language models' ability to generate coherent, engaging extended narratives across multiple chapters. Part of the EQ-Bench suite created by Samuel J. Paech, this benchmark challenges models to write an 8-chapter story or novella, with each chapter approximately 1000 words, while maintaining narrative consistency, character development, and writing quality throughout the extended format.
| Longform Creative Writing | |
|---|---|
| Overview | |
| Full name | Longform Creative Writing Benchmark |
| Abbreviation | LCW |
| Description | An LLM-judged benchmark evaluating extended narrative generation across 8 chapters |
| Release date | 2024 |
| Latest version | 3.0 |
| Benchmark updated | 2025-08-08 |
| Authors | Samuel J. Paech |
| Organization | EQ-Bench |
| Technical Details | |
| Type | Creative Writing, Extended Narrative |
| Modality | Text |
| Task format | Multi-turn story generation |
| Number of tasks | 1 story (8 chapters) |
| Total examples | 8 chapters per evaluation |
| Evaluation metric | 0-100 score, Degradation metric, Slop score, Repetition |
| Domains | Fiction writing, Narrative consistency, Character development |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | Variable by model |
| SOTA score | ~85-90 |
| SOTA model | Claude 3.7 Sonnet |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| License | Open source |
| Predecessor | Longform Creative Writing v2 |
Overview
Longform Creative Writing addresses a critical challenge in AI evaluation: assessing whether models can maintain quality, coherence, and engagement across extended narrative generation. Unlike short-form creative tasks, this benchmark tests models' ability to develop complex plots, maintain character consistency, avoid repetition, and prevent quality degradation over approximately 8000 words of continuous storytelling.
Motivation
The development of the Longform Creative Writing benchmark was motivated by several key observations:
- Short-form benchmarks fail to capture degradation patterns in extended generation
- Real-world creative applications often require sustained narrative quality
- Models frequently exhibit quality decline in longer outputs
- The need to evaluate narrative planning and structural coherence
- Importance of character consistency across multiple chapters
The benchmark specifically targets the evaluation of sustained creative performance, testing whether AI systems can match human writers' ability to maintain engagement throughout a complete story arc.
Technical Architecture
Core Components
| Component | Description | Function |
|---|---|---|
| Story Planning System | Initial concept and chapter outline generation | Establishes narrative structure |
| Chapter Generation | 8 sequential ~1000-word chapters | Produces extended narrative |
| Judge Model | Claude Sonnet 4 (as of 2025) | Evaluates quality and coherence |
| Degradation Analysis | Per-chapter quality tracking | Identifies performance decline |
Evaluation Methodology
Multi-Stage Process
The benchmark follows a structured evaluation approach:
| Stage | Description | Output |
|---|---|---|
| Planning | Model creates story concept and detailed outline | Story framework |
| Reflection | Model reviews and revises initial plan | Refined structure |
| Generation | Sequential production of 8 chapters | Complete narrative |
| Evaluation | Judge assesses each chapter and overall work | Quality scores |
Scoring Dimensions
The benchmark evaluates across multiple quality dimensions:
| Dimension | Description | Weight | Impact on Score |
|---|---|---|---|
| Compelling Plot | Engaging narrative with strong pacing | High | Major component |
| Coherence | Logical consistency throughout | High | Major component |
| Character Consistency | Maintaining character profiles | High | Major component |
| Chapter Plan Adherence | Following outlined structure | Medium | Moderate component |
| Emotional Engagement | Reader connection and investment | High | Major component |
| Nuanced Characterization | Complex, multi-dimensional characters | Medium | Moderate component |
| Tonal Consistency | Maintaining appropriate tone | Medium | Moderate component |
Test Format
Story Generation Process
Initial Planning Phase
1. **Concept Development**: Model receives minimal prompt and develops story concept 2. **Chapter Outline**: Creates detailed plan for 8 chapters 3. **Reflection**: Reviews and refines initial plan 4. **Commitment**: Finalizes structure before generation begins
Chapter Production
Each chapter follows specific requirements:
| Requirement | Specification | Purpose |
|---|---|---|
| Word Count | ~1000 words per chapter | Consistency and substance |
| Continuity | Direct continuation from previous | Narrative flow |
| Development | Advance plot and characters | Story progression |
| Quality | Maintain initial standards | Prevent degradation |
Generation Parameters
- **Temperature**: 0.7 (balanced creativity)
- **Min_p**: 0.1 (quality threshold)
- **Output Format**: Plain text narrative
- **Total Length**: ~8000 words across 8 chapters
Quality Metrics
Primary Scoring System
| Metric | Range | Description |
|---|---|---|
| Overall Score | 0-100 | Comprehensive quality assessment |
| Chapter Scores | 0-100 each | Individual chapter quality |
| Average Score | 0-100 | Mean across all chapters |
| Degradation Score | Variable | Quality change over chapters |
Specialized Metrics
Degradation Analysis
The benchmark includes unique degradation tracking:
- **Visual Sparkline**: Shows quality trajectory across 8 chapters
- **Degradation Score**: Quantifies quality decline
- **Consistency Rating**: Measures stability of output quality
Writing Quality Indicators
| Indicator | Description | Ideal Value |
|---|---|---|
| Length (chars) | Average chapter character count | ~5000-6000 |
| Slop Score | Frequency of overused AI phrases | Low (<5%) |
| Repetition | N-gram repetition across chapters | Low (<10%) |
| Degradation | Quality drop from start to end | Minimal (<5 points) |
Common Failure Modes
Identified Weaknesses
The benchmark specifically tracks common writing failures:
| Failure Mode | Description | Frequency | Impact |
|---|---|---|---|
| Weak Dialogue | Unnatural or stilted conversations | High (~60%) | Major quality loss |
| Tell-Don't-Show | Excessive exposition over demonstration | High (~70%) | Engagement loss |
| Purple Prose | Overly ornate language | Medium (~40%) | Style issues |
| Predictability | Formulaic plot development | High (~65%) | Reader interest loss |
| Metaphor Abuse | Forced or incoherent metaphors | Medium (~45%) | Clarity issues |
| Character Drift | Inconsistent characterization | Medium (~50%) | Coherence loss |
Degradation Patterns
Models commonly exhibit several degradation patterns:
1. **Quality Cliff**: Sharp decline after chapter 3-4 2. **Gradual Decay**: Steady quality reduction throughout 3. **Oscillation**: Alternating quality between chapters 4. **Final Chapter Collapse**: Rushed or weak endings 5. **Middle Sag**: Quality dip in chapters 4-6
Version 3 Improvements (2025)
Key Enhancements
| Improvement | Description | Impact |
|---|---|---|
| Judge Upgrade | Claude Sonnet 4 implementation | Better discrimination |
| Metaphor Detection | Enhanced incoherent metaphor penalties | Quality improvement |
| Paragraph Scoring | Penalties for single-sentence paragraphs | Style normalization |
| Structural Safeguards | Reliability improvements for longform | Consistency enhancement |
| Degradation Tracking | Enhanced quality trajectory analysis | Better diagnostics |
Scoring Refinements
- **Weighted Scoring**: Extra emphasis on metaphor quality
- **Automatic Penalties**: Structural writing degradation detection
- **Targeted Prompting**: Improved judge instructions for specific issues
Performance Analysis
Current Performance Trends (2025)
| Model Category | Typical Score Range | Degradation | Strengths |
|---|---|---|---|
| Top Tier | 85-90 | <5 points | Consistent quality, strong narrative |
| High Performance | 75-85 | 5-10 points | Good plotting, some degradation |
| Mid-Range | 65-75 | 10-15 points | Decent start, notable decline |
| Lower Performance | 50-65 | >15 points | Weak consistency, high degradation |
Key Insights
- **Degradation Universal**: All models show some quality decline
- **Chapter 4 Barrier**: Many models struggle maintaining quality past midpoint
- **Dialogue Challenge**: Consistent weakness across all models
- **Planning Impact**: Better initial planning correlates with less degradation
Implementation
Setup and Configuration
```bash
- Access via EQ-Bench suite
git clone https://github.com/EQ-bench/creative-writing-bench cd creative-writing-bench
- Configure API access
export ANTHROPIC_API_KEY="your-key" # For judge model export OPENROUTER_API_KEY="your-key" # For test models ```
Running Evaluations
```python
- Basic longform evaluation
python longform_creative.py --model "your-model" \
--temperature 0.7 --min-p 0.1
- With custom chapter count
python longform_creative.py --model "your-model" \
--chapters 8 --words-per-chapter 1000
- Full analysis with degradation tracking
python longform_creative.py --model "your-model" \
--full-analysis --track-degradation
```
Output Structure
Results include:
- **Story File**: Complete 8-chapter narrative
- **Score Report**: Chapter-by-chapter and overall scores
- **Degradation Analysis**: Quality trajectory visualization
- **Metric Summary**: Slop, repetition, and length statistics
Applications and Impact
Research Applications
| Application | Purpose | Research Value |
|---|---|---|
| Architecture Testing | Evaluating memory and coherence systems | Technical insights |
| Training Optimization | Improving long-context performance | Model development |
| Degradation Studies | Understanding quality decline patterns | Theoretical understanding |
| Planning Systems | Testing narrative structure capabilities | Cognitive modeling |
Practical Applications
- **Publishing**: Assessing AI co-writing capabilities
- **Content Creation**: Evaluating long-form content generation
- **Educational Tools**: Testing story-writing assistants
- **Entertainment**: Developing AI storytelling systems
- **Game Development**: Narrative generation for games
Challenges and Limitations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| Single Story Format | One extended narrative per test | Limited diversity |
| Genre Constraints | General fiction focus | Narrow scope |
| Judge Subjectivity | Single AI judge preference | Potential bias |
| English Only | Limited to English narratives | Reduced applicability |
| Fixed Length | 8 chapters of ~1000 words | Format rigidity |
Technical Challenges
- **Memory Management**: Maintaining context across 8000 words
- **Coherence Maintenance**: Tracking plot threads and character arcs
- **Style Consistency**: Avoiding drift in narrative voice
- **Pacing Control**: Managing story rhythm across chapters
- **Ending Quality**: Delivering satisfying conclusions
Future Directions
Planned Improvements
1. **Multi-Genre Testing**: Specialized prompts for different genres 2. **Variable Length**: Flexible chapter and story lengths 3. **Interactive Elements**: Reader choice integration 4. **Multi-Judge Consensus**: Multiple AI judges for robustness 5. **Human Baseline**: Professional writer performance benchmarks 6. **Multilingual Support**: Extension to other languages
Research Opportunities
- **Degradation Mitigation**: Techniques to maintain quality
- **Planning Optimization**: Better story structure systems
- **Memory Architectures**: Improved long-context handling
- **Style Transfer**: Maintaining consistent voice
- **Adaptive Generation**: Dynamic quality adjustment
Related Benchmarks
- Creative Writing v3: Short-form creative writing evaluation
- EQ-Bench 3: Emotional intelligence assessment
- WritingBench: Comprehensive writing evaluation
- NC Bench: Creative writing benchmark
- BuzzBench: Viral content generation
- DiploBench: Diplomatic writing evaluation
- Spiral-Bench: Related benchmark by same author
Significance
Longform Creative Writing represents a crucial advancement in evaluating AI systems' sustained creative capabilities. Its focus on degradation patterns and narrative consistency provides unique insights into model limitations that shorter benchmarks miss. The benchmark's ability to identify when and how models fail in extended generation makes it valuable for:
- Understanding long-context performance limits
- Developing more robust narrative generation systems
- Identifying architectural improvements for sustained quality
- Establishing realistic expectations for AI creative writing
- Guiding development of professional writing tools
See Also
References
Cite error: <ref> tag with name "lcw_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "paech_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "writingbench2025" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "bestai_lcw" defined in <references> has group attribute "" which does not appear in prior text.