Creative Writing v3

Creative Writing v3
Overview
Full name	Creative Writing Benchmark Version 3
Abbreviation	CW v3
Description	An LLM-judged creative writing benchmark using hybrid rubric and Elo scoring for enhanced discrimination
Release date	2025
Latest version	3.0
Benchmark updated	2025
Authors	Samuel J. Paech
Organization	Independent Research
Technical Details
Type	Creative Writing, Text Generation
Modality	Text
Task format	Generative writing prompts
Number of tasks	32 prompts (96 iterations total)
Total examples	96
Evaluation metric	Elo rating, Rubric scoring, Repetition metric, Slop score
Domains	Fiction writing, Humor, Romance, Spatial awareness
Languages	English
Performance
Human performance	Not reported
Baseline	Variable by model
SOTA score	~1500 (normalized Elo)
SOTA model	DeepSeek V3
SOTA date	2025
Saturated	No
Resources
Website	Official website
GitHub	Repository
Dataset	Download
License	Open source
Predecessor	Creative Writing v2

Creative Writing v3 is an artificial intelligence benchmark designed to evaluate the creative writing capabilities of large language models (LLMs) through a comprehensive assessment framework. Released in 2025 by Samuel J. Paech, Creative Writing v3 employs a hybrid approach combining rubric scoring with Elo rating systems, using Claude 3.7 Sonnet as the judge model to assess generated creative content across multiple dimensions of writing quality.

Overview

Creative Writing v3 addresses the challenge of objectively evaluating subjective creative output from AI systems. The benchmark tests models' abilities to generate engaging, coherent, and creative text across diverse writing prompts, including scenarios that require humor, romance, spatial awareness, and unique first-person perspectives, areas where language models traditionally struggle to match human writers.

Motivation

The development of Creative Writing v3 was driven by several key factors:

The need for better discrimination between high-performing models in creative tasks
Limitations of previous benchmarks in detecting subtle writing quality differences
The importance of addressing known LLM judge biases in creative evaluation
The goal of exposing specific weaknesses in AI-generated creative content

The benchmark specifically targets areas where language models struggle, creating a steeper gradient for meaningful evaluation of creative capabilities.

Technical Architecture

Core Components

Component	Description	Function
Prompt Dataset	32 distinct creative writing prompts	Provides diverse creative challenges
Generation System	Temperature 0.7, min_p 0.1 settings	Balances creativity with consistency
Judge Model	Claude 3.7 Sonnet	Evaluates creative output quality
Scoring Framework	Hybrid rubric + Elo system	Comprehensive quality assessment

Evaluation Methodology

Dual Scoring System

Creative Writing v3 employs a sophisticated dual-scoring approach:

Scoring Method	Description	Output
Rubric Assessment	36 criteria scored 0-10	Individual quality metrics
Elo Rating	Pairwise comparisons using Glicko-2	Overall ranking with uncertainty

Key Metrics

The benchmark tracks four primary metrics:

Metric	Description	Purpose
Rubric Score	Aggregate score across 36 writing criteria	Quality assessment
Elo Score (Normalized)	Relative ranking from pairwise comparisons	Competitive positioning
Repetition Metric	Frequency of repeated words/phrases	Diversity measurement
Slop Score	Tracking of overused "GPT-isms"	Cliché detection

Test Structure

Prompt Categories

Creative Writing v3 includes prompts designed to challenge models in specific areas:

Category	Description	Example Challenge
Humor	Comedy and wit generation	Writing genuinely funny content
Romance	Emotional and romantic scenarios	Creating authentic emotional connection
Spatial Awareness	Physical space descriptions	Accurate spatial reasoning in narrative
Unique Perspectives	Unusual first-person viewpoints	Non-standard narrator voices
Character Development	Complex character creation	Multi-dimensional personalities
Plot Construction	Narrative structure	Coherent story progression

Generation Parameters

**Iterations**: 3 iterations per prompt (96 total generations)
**Temperature**: 0.7 (encourages creativity)
**Min_p**: 0.1 (maintains quality threshold)
**Output Length**: Truncated to 4000 characters (controls length bias)

Evaluation Criteria

36-Point Rubric

The comprehensive rubric evaluates writing across multiple dimensions:

Category	Criteria Examples	Weight
Coherence	Logical flow, consistency, clarity	High
Creativity	Originality, unexpected elements, imagination	High
Style	Voice, tone, prose quality	Medium
Technical	Grammar, punctuation, structure	Medium
Engagement	Hook, pacing, reader interest	High
Character	Depth, believability, development	Medium
Dialogue	Natural speech, distinct voices	Medium
Description	Vivid imagery, sensory details	Medium

Bias Mitigation

Controlled Biases

Creative Writing v3 implements specific controls for known biases:

Bias Type	Mitigation Strategy	Implementation
Length Bias	Output truncation	4000 character limit
Position Bias	Bidirectional comparison	A/B and B/A averaging
Verbosity Bias	Penalty for excessive prose	Targeted judge prompting
Poetic Incoherence	Detection and punishment	Forced metaphor penalties

Uncontrolled Biases

The benchmark acknowledges certain biases remain:

Judge self-bias (potential preference for similar style)
Positivity/negativity preference
NSFW content aversion ("smut bias")
Stylistic preferences
"Slop" bias (overused tropes)

Version 3 Improvements

Key Enhancements from v2

Improvement	Description	Impact
Judge Upgrade	Claude 3.7 Sonnet replacing previous version	Better discrimination
Metaphor Detection	Targeted prompting for incoherent metaphors	Quality improvement
Paragraph Scoring	Automatic scaling for single-sentence paragraphs	Style normalization
Elo Integration	Pairwise comparisons added	Enhanced discrimination
Glicko-2 System	Rating uncertainty and volatility tracking	Robust rankings

Slop Detection

Creative Writing v3 introduces sophisticated "slop" detection:

Master word list of overused AI phrases
Tracking of "GPT-isms" and clichés
Penalty system for formulaic writing
Encouragement of fresh, original expression

Performance Analysis

Current Leaders (2025)

Rank	Model	Elo Score (Normalized)	Notable Strengths
1	DeepSeek V3	~1500	Exceptional creativity and coherence
2	Claude 3.7 Sonnet	~1400	Natural, human-like prose
3	GPT-4o	~1350	Versatile across genres
4	Gemini 2.5 Pro	~1300	Strong technical writing
5	Grok 3	~1200	Unique voice and humor

Performance Insights

**Wide Performance Spread**: Significant variation between top and bottom performers
**Style Differentiation**: Models show distinct writing personalities
**Weakness Patterns**: Consistent struggles with humor and spatial reasoning
**Improvement Trajectory**: Newer models showing marked creative improvements

Implementation

Installation and Setup

```bash

Clone the repository

git clone https://github.com/EQ-bench/creative-writing-bench cd creative-writing-bench

Install dependencies

pip install -r requirements.txt

Configure judge model API

export ANTHROPIC_API_KEY="your-key-here" ```

Running Evaluations

```python

Basic evaluation

python creative_writing_v3.py --model "your-model" --iterations 3

Full benchmark with Elo

python creative_writing_v3.py --model "your-model" --full-benchmark

Custom temperature settings

python creative_writing_v3.py --model "your-model" --temperature 0.7 --min-p 0.1 ```

Output Format

Results are stored in multiple formats:

**Raw Outputs**: Individual generated texts
**Rubric Scores**: Detailed scoring breakdowns
**Elo Results**: Pairwise comparison outcomes
**Aggregate Metrics**: Overall performance summary

Applications and Impact

Research Applications

Application	Use Case	Research Value
Model Development	Training creative capabilities	Performance optimization
Architecture Comparison	Evaluating design choices	Technical insights
Prompt Engineering	Optimizing generation techniques	Methodology refinement
Bias Studies	Understanding AI writing patterns	Fairness research

Practical Applications

**Content Generation**: Assessing suitability for creative writing tasks
**Educational Tools**: Evaluating AI writing assistants
**Entertainment**: Testing story generation capabilities
**Marketing**: Assessing creative copywriting abilities
**Publishing**: Screening AI co-writing tools

Challenges and Insights

Key Challenges for Models

Challenge	Description	Success Rate
Genuine Humor	Creating actually funny content	<30%
Emotional Depth	Authentic romantic/emotional scenes	~40%
Spatial Consistency	Maintaining accurate spatial descriptions	~35%
Original Voice	Avoiding formulaic patterns	~45%
Complex Metaphors	Creating coherent extended metaphors	~25%

Common Failure Modes

1. **Formulaic Structure**: Overreliance on standard narrative patterns 2. **Cliché Overuse**: Heavy use of common phrases and tropes 3. **Emotional Shallowness**: Surface-level emotional expression 4. **Forced Creativity**: Awkward attempts at being original 5. **Inconsistent Tone**: Shifts in voice and style mid-narrative

Limitations and Considerations

Current Limitations

Limitation	Description	Impact
Subjective Nature	Creative quality is inherently subjective	Evaluation variance
Judge Dependency	Relies on Claude 3.7 Sonnet preferences	Potential bias
English Only	Limited to English language prompts	Reduced applicability
Genre Constraints	Focus on specific creative genres	Limited scope
Length Limits	4000 character truncation	May penalize longer narratives

Future Directions

1. **Multi-Judge Systems**: Using multiple AI judges for consensus 2. **Human Baseline**: Establishing human writer performance benchmarks 3. **Genre Expansion**: Adding specialized prompts for different genres 4. **Multilingual Support**: Extension to other languages 5. **Interactive Writing**: Multi-turn creative collaboration testing

Related Benchmarks

EQ-Bench 3: Emotional intelligence evaluation
Longform Creative Writing: Extended narrative generation
WritingBench: Comprehensive writing evaluation
NC Bench: Creative writing assessment
Spiral-Bench: Related benchmark by same author
BuzzBench: Viral content generation
DiploBench: Diplomatic writing evaluation

Significance

Creative Writing v3 represents a significant advancement in evaluating AI systems' creative capabilities. Its hybrid scoring approach and sophisticated bias controls provide nuanced assessment of creative output quality. The benchmark's ability to discriminate between models with similar technical capabilities but different creative strengths makes it valuable for:

Identifying models suitable for creative applications
Guiding development of more creative AI systems
Understanding the relationship between technical and creative capabilities
Establishing standards for AI-generated creative content

References

Cite error: <ref> tag with name "paech2025cw" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "cw_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "eqbench_about" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "writingbench" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "nc_bench" defined in <references> has group attribute "" which does not appear in prior text.