After 150 real-world tasks across six categories, one model edges ahead — but the difference depends entirely on your use case.
Table of contents
Open Table of contents
The Test Setup
We ran identical prompts for 150 tasks across:
- Code generation (25 tasks — Python, TypeScript, SQL)
- Long-form writing (25 tasks — articles, reports, emails)
- Reasoning & logic (25 tasks — math, puzzles, analysis)
- Summarization (25 tasks — papers, transcripts, documents)
- Creative tasks (25 tasks — storytelling, brainstorming)
- Instruction following (25 tasks — multi-step, constrained)
Outputs were graded blind by three independent reviewers on accuracy, helpfulness, and format.
Code Generation
Winner: Claude Sonnet 4.6 (64 vs 58 points)
Both models are exceptional coders. Claude’s edge comes from:
- More idiomatic TypeScript type inference
- Better handling of complex async patterns
- Less hallucinated library APIs in Python
- Superior edge-case handling in SQL queries
GPT-4o wrote slightly faster responses and excelled at one-liners and quick utility functions.
// Claude's solution to: "Write a debounce utility with TypeScript generics"
function debounce<T extends (...args: unknown[]) => ReturnType<T>>(
fn: T,
delay: number
): (...args: Parameters<T>) => void {
let timeoutId: ReturnType<typeof setTimeout>;
return (...args: Parameters<T>) => {
clearTimeout(timeoutId);
timeoutId = setTimeout(() => fn(...args), delay);
};
}
Long-Form Writing
Winner: GPT-4o (61 vs 57 points)
GPT-4o’s writing flows more naturally for marketing copy and storytelling. Claude, however, produces more structured and factually-grounded research content. For technical documentation or white papers, Claude wins narrowly.
Reasoning & Logic
Winner: Claude Sonnet 4.6 (69 vs 63 points)
This is Claude’s strongest category. The model shows working, catches edge cases, and rarely hallucinates numerical answers. GPT-4o is close but had two notable arithmetic errors in 25 tasks vs zero for Claude.
Summarization
Tie (60 vs 60 points)
Both models produce excellent summaries. Claude’s tend to be slightly shorter and denser; GPT-4o adds more context and examples. It depends on your preference.
Creative Tasks
Winner: GPT-4o (63 vs 58 points)
GPT-4o is more playful and takes more creative risks. Claude’s creative output is polished but safer. For marketing campaigns, scripts, and brand voices, GPT-4o is the better choice.
Instruction Following
Winner: Claude Sonnet 4.6 (72 vs 65 points)
This is where Claude Sonnet 4.6 most dramatically outperforms. Complex, multi-step prompts with constraints (word limit, format, tone, persona) are followed more precisely. GPT-4o occasionally drifts or truncates outputs.
Overall Scores
| Category | Claude 4.6 | GPT-4o |
|---|---|---|
| Code | 64 | 58 |
| Writing | 57 | 61 |
| Reasoning | 69 | 63 |
| Summarization | 60 | 60 |
| Creative | 58 | 63 |
| Instructions | 72 | 65 |
| Total | 380 | 370 |
Context Window & Pricing
| Model | Context | Input (per 1M) | Output (per 1M) |
|---|---|---|---|
| Claude Sonnet 4.6 | 200K tokens | $3.00 | $15.00 |
| GPT-4o | 128K tokens | $5.00 | $15.00 |
Claude wins on context length and cost.
Verdict
For developers and technical work: Claude Sonnet 4.6
For creative and marketing work: GPT-4o
For cost-sensitive applications: Claude Sonnet 4.6 (lower input cost, larger context)
Neither model is universally superior. The smart move is to benchmark both on your specific use case — free tiers exist for both.
Testing conducted March 2026. Model behavior may change with future updates from Anthropic and OpenAI.