Claude Code vs GPT-4o: Is Anthropic's AI Getting Better?

8/10/2025

Is Claude Code ACTUALLY Getting Better? A Look at Recent Performance & User Complaints

What’s up, everyone? If you're in the coding world & you’ve been using AI assistants, you’ve probably been on the same rollercoaster I have with Anthropic's Claude. One minute, it feels like a coding prodigy, the next, it’s like it has short-term memory loss & you want to throw your monitor out the window.

So, what's the real story? Is Claude Code actually improving, or is it just hype? Turns out, it’s a bit of both. There have been some SERIOUSLY impressive upgrades lately, but the user experience can still be... well, let's call it "inconsistent."

Let's dive deep into what’s been going on with Claude, from the latest model releases to the everyday frustrations & genius workarounds I’ve seen fellow devs sharing.

The Big News: Claude 3.5 Sonnet & Opus 4.1 Hit the Scene

Honestly, the biggest reason for this conversation is Anthropic’s recent model drops. They haven't been quiet, & the releases of Claude Opus 4.1 & especially Claude 3.5 Sonnet have made some major waves.

Anthropic is making some bold claims, positioning Sonnet 3.5 as setting a new "industry standard" for intelligence. And you know what? The benchmarks seem to back it up, especially for us coders.

Here’s the breakdown of the good stuff:

Beating GPT-4o in Coding: This is the headline for many. Multiple sources & benchmarks are showing Claude 3.5 Sonnet outperforming GPT-4o in coding tasks. One internal evaluation from Anthropic showed Sonnet 3.5 solving 64% of problems in an agentic coding test, compared to just 38% for its predecessor, Claude 3 Opus. That's a HUGE leap. It seems to be particularly good at fixing bugs & adding new features to existing codebases.
Speed & Cost-Effectiveness: Sonnet 3.5 is reportedly twice as fast as the previous top-tier model, Claude 3 Opus. This is a game-changer when you're in the zone & don't want to wait for an AI to spit out a response. It’s also more cost-effective, which is always a plus.
Impressive Benchmark Scores: The numbers are pretty telling. On the HumanEval code generation benchmark, Claude 3.5 Sonnet scored 92.0%, giving it a slight edge over GPT-4o Mini's 87.2%. Another benchmark, SWE-bench, which tests real-world coding problems from GitHub issues, saw Sonnet 3.5's score jump from 33.4% to 49%, outperforming all other public models, including OpenAI's o1 preview.
Real-World Developer Feedback: It's not just about the numbers. I’ve seen a ton of positive chatter from developers. One Reddit user, a programmer who subscribes to both ChatGPT & Claude, said, "I've found Claude to be exceptionally impressive. In my experience, it consistently produces nearly bug-free code on the first try, outperforming GPT-4 in this area." Another user on Reddit mentioned that for C# coding, "Claude seems WAY superior to 4o."

So, on paper & in many real-world scenarios, Claude is not just getting better; it’s starting to lead the pack in coding capabilities. It seems to have a more nuanced understanding of code & refactors complex functions with an impressive level of detail.

"Why Does Claude Feel So Broken Then?" - The User Complaints Are Real

Okay, so if Claude is so great, why do I see so many threads on Reddit with titles like "The Real Reason Claude Code Feels Broken"? This is where the conversation gets interesting. The benchmarks are one thing, but the day-to-day user experience can be a different story.

Here are some of the most common complaints I’ve seen, & honestly, I’ve experienced a few of these myself:

Context Leakage & Short-Term Memory: This is probably the biggest one. You're deep into a project, feeding Claude code across multiple files, & suddenly it seems to forget a key instruction from five minutes ago. You get duplicated files, broken context, & weirdly named, hallucinated files that make you question its sanity. It’s like trying to work with an intern who has severe short-term memory loss.
Janky UI Output & Formatting Issues: Sometimes, the way Claude presents information can be a bit… off. It might refuse to end files with newlines unless you practically beg it, or the formatting of its output can be inconsistent. These are small things, but they add up & disrupt your workflow.
The "Drunk" AI: One Reddit user described GPT-4o as seeming "drunk" because it ignores important details & just spews out code. This feeling can definitely apply to Claude at times, too. You give it a clear correction, & it seems to ignore it, forcing you to repeat yourself over & over.
Message Limits: For Pro subscribers, the message limits can be a real pain point, especially for complex workflows that involve uploading large files like PDFs. Some users report hitting their message limit after just 5-8 messages, which doesn't align with the advertised limits.

It's these kinds of issues that lead to the frustration & the feeling that the tool is "broken." It’s not that it can’t perform the task; it’s that it requires a very specific, almost ritualistic, way of interacting with it to get the best results.

The "Claude Whisperer" Manual: How to Make it Work for You

Here’s the thing: it turns out a lot of the frustration with Claude comes from how we’re using it. A Reddit user had a breakthrough that I think is key: they stopped just "vibe-coding" & started treating Claude like a structured-thinking machine.

When they started documenting everything first—what each file is for, how functions connect, creating

README

blurbs in folders, & even making checklist-style

TASKS.md

files—Claude's performance skyrocketed. The bugs dropped, multi-step plans were actually followed, & it started reusing code instead of rewriting it.

This suggests that Claude isn’t broken; it just has a different operating model. It thrives on structure. If you just throw code at it & hope for the best, you’re probably going to have a bad time. But if you give it "breadcrumbs," as one user put it, it can build your gingerbread house with plumbing.

Here are some practical tips that have been floating around the community:

Context Engineering is Key: A Wordfence article put it perfectly: if you're noticing a drop in quality, check what's in your context. Is your
1CLAUDE.md
file cluttered? Have you imported a bunch of irrelevant files? Are you using command-line tools that are adding "tainted" data to your context window? Being mindful of what you're feeding Claude is crucial.
Use
1/clear
Frequently: This is a simple but powerful command. During long coding sessions, the context window can fill up with all sorts of conversational cruft. Using
1/clear
between tasks helps reset the context & keeps the AI focused on the current problem.
Treat it Like a Pair of Programmers: Anthropic itself suggests an interesting workflow: use one Claude instance to write code, then
1/clear
or open a new instance to have a second Claude review the first one's work. This is like having a fresh pair of eyes on your code, & it can help catch errors & improve quality.
Master Your
1CLAUDE.md
File: This file is prime real estate for influencing Claude's behavior. Don't just dump a bunch of instructions in there & forget about it. Experiment with it, iterate on it, & see what kind of instructions produce the best results for your specific workflow.

So, Is It Worth Switching? Claude vs. GPT-4o for a Developer

This is the million-dollar question, isn't it? Based on everything I’ve seen, here’s my take:

For pure coding tasks, especially complex refactoring, bug fixing, & generating new features, Claude 3.5 Sonnet seems to have the edge right now. The benchmarks & the experiences of many developers point to it producing higher-quality, more accurate code. One reviewer even went as far as to say that after using Claude, going back to GPT-4 feels like a step backward to GPT-3.5.
GPT-4o still holds its own in some areas. It seems to be faster in terms of raw token generation speed & some benchmarks show it having an advantage in mathematical reasoning. Its multimodal capabilities are also a strong point.
The "best" tool really depends on your workflow. If you're a developer who thrives on structure & is willing to put in a little extra effort to manage context & provide clear, detailed instructions, Claude will likely reward you with incredible results. If you prefer a more conversational, "do-it-all" tool, you might still find GPT-4o more comfortable.

Honestly, the rise of powerful AI assistants like Claude is changing how businesses & individuals approach development & customer interaction. This is where tools like Arsturn come into the picture. Imagine taking the power of these advanced AI models & training them on your own business data. With Arsturn, you can build no-code AI chatbots that provide instant, personalized customer support, answer questions, & engage with website visitors 24/7. It's about taking this raw coding power & channeling it into a polished, customer-facing solution, which is a pretty cool application of this tech.

My Final Verdict

So, is Claude Code getting better? ABSOLUTELY. The leap from previous versions to Claude 3.5 Sonnet is significant, & in many ways, it has set a new standard for AI-powered coding assistants. It's faster, smarter, & in the right hands, incredibly powerful.

However, the user complaints are also valid. It’s not a perfect tool, & it has its quirks. The key to unlocking its full potential seems to lie in understanding its "personality" – its need for structure, clear context, & detailed instructions.

If you’ve been on the fence, I’d say now is a GREAT time to give Claude another try, especially with the latest models. Just go in with the right expectations. Don't treat it like a magical black box. Treat it like a junior developer who is brilliant but needs clear guidance. If you can do that, you might just find that it becomes an indispensable part of your toolkit.

Hope this was helpful! I’d love to hear what you think. Have you tried the latest Claude models? What have your experiences been like? Let me know in the comments.