The returns diminished

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

GPT-4.5 offers marginal gains in capability and poor coding performance despite 30x the cost.

Benj Edwards – Feb 28, 2025 4:35 PM | 176

The verdict is in: OpenAI's newest and most capable traditional AI model, GPT-4.5, is big, expensive, and slow, providing marginally better performance than GPT-4o at 30x the cost for input and 15x the cost for output. The new model seems to prove that longstanding rumors of diminishing returns in training unsupervised-learning LLMs were correct and that the so-called "scaling laws" cited by many for years have possibly met their natural end.

An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).

Former OpenAI researcher Andrej Karpathy wrote on X that GPT-4.5 is better than GPT-4o but in ways that are subtle and difficult to express. "Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."

OpenAI is well aware of these limitations, and it took steps to soften the potential letdown by framing the launch as a relatively low-key "Research Preview" for ChatGPT Pro users and spelling out the model's limitations in a GPT-4.5 release post published Thursday.

"GPT‑4.5 is a very large and compute-intensive model, making it more expensive⁠ than and not a replacement for GPT‑4o," the company wrote. "Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models."

Ars Video

According to OpenAI's own benchmark results, GPT-4.5 scored significantly lower than OpenAI's simulated reasoning models (o1 and o3) on tests like AIME math competitions and GPQA science assessments, with GPT-4.5 scoring just 36.7 percent on AIME compared to o3-mini's 87.3 percent. Additionally, GPT-4.5 costs five times more than o1 and over 68 times more than o3-mini for input processing.

And GPT-4.5 is terrible for coding, relatively speaking, with an October 2023 knowledge cutoff that may leave out knowledge about updates to development frameworks.

A graph of "Performance vs. Cost" assembled by X user Enrico using data from Paul Gauthier. Credit: Enrico - big-AGI / X

Tech investor Paul Gauthier performed independent testing of GPT-4.5's coding ability using Aider's Polyglot Coding benchmark and found that GPT-4.5 ranked 10th in overall ability (Claude 3.7 Sonnet with extended thinking is on top, and o1 and o3 ahead), and it also ranked poorly in performance versus cost, meaning that for coding tasks, GPT-4.5 isn't worth the price you'd have to pay through the API.

According to OpenAI's benchmarks, GPT-4.5 does show some improvements over GPT-4o in specific areas. The model scored higher on the multilingual MMMLU (general knowledge) test with 85.1 percent compared to GPT-4o's 81.5 percent, suggesting better performance on knowledge-based tasks across multiple languages. OpenAI also reports that GPT-4.5 demonstrated improved performance in reducing confabulations (hallucinations), with the company claiming it generates fewer false or misleading responses than previous versions.

OpenAI's testing also indicated that human evaluators preferred GPT-4.5's responses over GPT-4o in about 57 percent of interactions, suggesting modest but measurable improvements in overall user experience. However, these incremental gains come with significantly higher computational demands and costs.

High on vibes, low on reasoning

Upon 4.5's release, OpenAI CEO Sam Altman did some expectation tempering on X, writing that the model is strong on vibes but low on analytical strength. "It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."

GPT-4.5 is so massive and inefficient that Altman also wrote on X that the company would have liked to release GPT-4.5 for everyone but that the company is "out of GPUs." More are on the way, he said.

Perhaps because of the disappointing results, Altman had previously written that GPT-4.5 will be the last of OpenAI's traditional AI models, with GPT-5 planned to be a dynamic combination of "non-reasoning" LLMs and simulated reasoning models like o3.

A stratospheric price and a tech dead-end

And about that price—it's a doozy. GPT-4.5 costs $75 per million input tokens and $150 per million output tokens through the API, compared to GPT-4o's $2.50 per million input tokens and $10 per million output tokens. (Tokens are chunks of data used by AI models for processing). For developers using OpenAI models, this pricing makes GPT-4.5 impractical for many applications where GPT-4o already performs adequately.

By contrast, OpenAI's flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

OpenAI has likely known about diminishing returns in training LLMs for some time. As a result, the company spent most of last year working on simulated reasoning models like o1 and o3, which use a different inference-time (runtime) approach to improving performance instead of throwing ever-larger amounts of training data at GPT-style AI models.

OpenAI's self-reported benchmark results for the SimpleQA test, which measures confabulation rate. Credit: OpenAI

While this seems like bad news for OpenAI in the short term, competition is thriving in the AI market. Anthropic's Claude 3.7 Sonnet has demonstrated vastly better performance than GPT-4.5, with a reportedly more efficient architecture. It's worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

For now, it seems that GPT-4.5 may be the last of its kind—a technological dead-end for an unsupervised learning approach that has paved the way for new architectures in AI models, such as o3's inference-time reasoning and perhaps even something more novel, like diffusion-based models. Only time will tell how things end up.

GPT-4.5 is now available to ChatGPT Pro subscribers, with rollout to Plus and Team subscribers planned for next week, followed by Enterprise and Education customers the week after. Developers can access it through OpenAI's various APIs on paid tiers, though the company is uncertain about its long-term availability.

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

176 Comments

Ars Video

High on vibes, low on reasoning

A stratospheric price and a tech dead-end

nproxy.org