I tested GPT-5’s coding skills, and it was so bad that I’m sticking with GPT-4o (for now)

Vaselena/Getty Images

ZDNET’s key takeaways

  • OpenAI’s new GPT-5 flagship failed half of my programming tests.
  • Previous OpenAI releases have had just about perfect results.
  • Now that OpenAI has enabled fallbacks to other LLMs, there are options.

So GPT-5 happened. It’s out. It’s released. It’s the talk of the virtual town. And it’s got some problems. I’m not gonna bury the lede. GPT-5 has failed half of my programming tests. That’s the worst that OpenAI’s flagship LLM has ever done on my carefully designed tests.

Also: The best AI for coding in 2025 (and what not to use)

Before I get into the details, let’s take a moment to discuss one other little feature that’s also a bit wonky. Check out the new Edit button on the top of the code dumps it generates.

edit-button

Screenshot by David Gewirtz/ZDNET

Clicking the Edit button takes you into a nice little code editor. Here, I replaced the Author field, right in ChatGPT’s results.

editor

Screenshot by David Gewirtz/ZDNET

That seemed nice, but it ultimately proved futile. When I closed the editor, it asked me if I wanted to save. I did. Then this unhelpful message showed up.

wonky-save

Screenshot by David Gewirtz/ZDNET

I never did get back to my original session. I had to submit my original prompt again, and let GPT-5 do its work a second time.

But wait. There’s more. Let’s dig into my test results…

1. Writing a WordPress plugin

This was my very first test of coding prowess for any AI. It’s what gave me that first “the world is about to change” feeling, and it was done using GPT-3.5.

Subsequent tests, using the same prompt but with different AI models, generated mixed results. Some AIs did great, some didn’t. Some AIs, like those from Microsoft and Google, improved over time.

Also: How I test an AI chatbot’s coding ability – and you can, too

ChatGPT’s model has been the gold standard for this test since the very beginning. That makes the results of GPT-5 all that much more curious.

So, look, the actual coding with GPT-5 was partially successful. GPT-5 generated a single block of code, which I pasted into a file and was able to run. It provided the requisite UI.

When I pasted in the test names, it dynamically updated the line count, although it described it as “Line to randomize” instead of “Lines to randomize.”

plugin

Screenshot by David Gewirtz/ZDNET

But then, when I clicked Randomize, it didn’t. Instead, it redirected me to tools.php. What?? ChatGPT has never had a problem with this test, whether GPT-3.5, GPT-4, or GPT-4o. You mean to tell me that OpenAI’s much-anticipated GPT-5 is failing right out of the gate? Ouch.

I then gave GPT-5 this prompt.

When I click randomize, I’m taken to http://testsite.local/wp-admin/tools.php. I do not get a list of randomized results. Can you fix?

The result was a line to patch. I’m not thrilled with that approach because it requires the user to dig through code and to make no mistakes replacing a line.

patch

Screenshot by David Gewirtz/ZDNET

So, I asked GPT-5 for a full plugin. It gave me the full text of the plugin to copy and paste. This time, it worked.

plugin2

Screenshot by David Gewirtz/ZDNET

This time, it did randomize the lines. When it encountered duplicates, it separated them from each other, as it was instructed. Finally.

Also: I found 5 AI content detectors that can correctly identify AI text 100% of the time

I’m sorry, OpenAI. I have to fail you on this test. You would have passed if the only error was not using the plural of “line” when appropriate. But the fact that it gave me back a non-working plugin on the first try is fail territory, even if the AI did eventually make it work on the second try.

No matter how you spin it, this is a step back.

2. Rewriting a string function

This second test is designed to rewrite a string function to better check for dollars and cents. The original code that GPT-5 was asked to rewrite did not allow for cents (it only checked for integers).

test2

Screenshot by David Gewirtz/ZDNET

GPT-5 did fine with this test. It did return a minimal result because it didn’t do any error checking. It didn’t check for non-string input, extra whitespace, thousands separators, or currency symbols.

But that’s not what I asked for. I told it to rewrite a function, which itself did not have any error checking. GPT-5 did exactly what I asked with no embellishment. I’m kind of glad of that because it doesn’t know whether or not code prior to this routine already did that work.

GPT-5 passed this test.

3. Finding an annoying bug

This test came about because I was struggling with a less-than-obvious bug in my code. Without going into the weeds about how the WordPress framework works, the obvious answer is not the right answer.

You need some fairly arcane knowledge about how WordPress filters pass their information. This test has been a stumbling block for more than a few AI LLMs.

Also: Gen AI disillusionment looms, according to Gartner’s 2025 Hype Cycle report

GPT-5, however, like GPT-4 and GPT-4o before it, did understand the problem. It articulated a clear solution.

GPT-5 passed this test.

4. Writing a script

This test asks the AI to incorporate a fairly obscure Mac scripting tool called Keyboard Maestro, as well as Apple’s scripting language AppleScript, and Chrome scripting behavior.

It’s really a test of the reach of the AI in terms of knowledge, its understanding of how web pages are constructed, and the ability to write code across three interlinked environments.

Quite a few AIs have failed this test, but the failure point is usually a lack of knowledge about Keyboard Maestro. GPT-3.5 didn’t know about Keyboard Maestro. But ChatGPT has been passing this test since GPT-4. Until now.

Where should we start? Well, the good news is that GPT-5 handled the Keyboard Maestro part of the problem just fine. But it got the coding so wrong that it even doubled down on its lack of understanding of how case works in AppleScript.

gpt5-applescript

Screenshot by David Gewirtz/ZDNET

It actually invented a property. This is one of those cases where an AI confidently presents an answer that is completely wrong.

Also: ChatGPT comes with personality presets now – and other upgrades you might have missed

AppleScript is natively case-insensitive. If you want AppleScript to pay attention to case, you need to use a “considering case” block. So, this happened.

lowercase

Screenshot by David Gewirtz/ZDNET

The reason the error message referred to the title of one of my articles is because that was the front window in Chrome. This function checks the front window and does stuff based on the title.

search-term

Screenshot by David Gewirtz/ZDNET

But misunderstanding how case works wasn’t the only AppleScript error GPT-5 generated. It also referenced a variable named searchTerm without defining it. That’s pretty much an error-creating practice in any programming language.

Fail, fail, fail, McFaildypants.

The internet hath spoken

OpenAI seemed to suffer from the same hubris that its AIs do. It confidently moved everyone to GPT-5 and burned the bridges back to GPT-4o. I’m paying $200 a month for a ChatGPT Pro account. On Friday, I couldn’t move back to GPT-4o for coding work. Neither could anyone else.

There was, however, just a tiny bit of user pushback on the whole bridges burning thing. And by tiny, I mean the entire frickin’ internet. So, by Saturday, ChatGPT had a new option.

revert

Screenshot by David Gewirtz/ZDNET

To get to this, go to your ChatGPT settings and turn on “Show legacy models.” Then, as it has always been, just drop down the model menu and choose the one you want. Note: this option is only available to those on paid tiers. If you’re using ChatGPT for free, you’ll take what you’re given, and you’ll love it.

Ever since the whole generative AI thing kicked off at the beginning of 2023, ChatGPT has been the gold standard of programming tools, at least according to my LLM testing.

Also: Microsoft rolls out GPT-5 across its Copilot suite – here’s where you’ll find it

Now? I’m really not sure. This is only a day or so after GPT-5 has been released, so its results will probably get better over time. But for now, I’m sticking with GPT-4o for coding, although I do like the deep reasoning capabilities in GPT-5.

What about you? Have you tried GPT-5 for programming tasks yet? Did it perform better or worse than previous versions like GPT-4o or GPT-3.5? Were you able to get working code on the first try, or GPT-4o did you have to guide it through fixes? Are you going to use GPT-5 for coding or stick with older models? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.



Original Source: zdnet

Leave a Reply

Your email address will not be published. Required fields are marked *