OpenAI unveils GPT-4.5 'Orion,' its largest AI model yet

Updated 2:40 pm PT: Hours after GPT-4.5’s release, OpenAI removed a line from the AI model’s white paper that said “GPT-4.5 is not a frontier AI model.” GPT-4.5’s new white paper does not include that line. You can find a link to the old white paper here. The original article follows.

OpenAI announced on Thursday it is launching GPT-4.5, the much-anticipated AI model code-named Orion. GPT-4.5 is OpenAI’s largest model to date, trained using more computing power and data than any of the company’s previous releases.

Despite its size, OpenAI notes in a white paper that it does not consider GPT-4.5 to be a frontier model.

Subscribers to ChatGPT Pro, OpenAI’s $200-a-month plan, will gain access to GPT-4.5 in ChatGPT starting Thursday as part of a research preview. Developers on paid tiers of OpenAI’s API will also be able to use GPT-4.5 starting today. As for other ChatGPT users, customers signed up for ChatGPT Plus and ChatGPT Team should get the model sometime next week, an OpenAI spokesperson told TechCrunch.

The industry has held its collective breath for Orion, which some consider to be a bellwether for the viability of traditional AI training approaches. GPT-4.5 was developed using the same key technique — dramatically increasing the amount of computing power and data during a “pre-training” phase called unsupervised learning — that OpenAI used to develop GPT-4, GPT-3, GPT-2, and GPT-1.

In every GPT generation before GPT-4.5, scaling up led to massive jumps in performance across domains, including mathematics, writing, and coding. Indeed, OpenAI says that GPT-4.5’s increased size has given it “a deeper world knowledge” and “higher emotional intelligence.” However, there are signs that the gains from scaling up data and computing are beginning to level off. On several AI benchmarks, GPT-4.5 falls short of newer AI “reasoning” models from Chinese AI company DeepSeek, Anthropic, and OpenAI itself.

GPT-4.5 is also very expensive to run, OpenAI admits — so expensive that the company says it’s evaluating whether to continue serving GPT-4.5 in its API in the long term. To access GPT-4.5’s API, OpenAI is charging developers $75 for every million input tokens (roughly 750,000 words) and $150 for every million output tokens. Compare that to GPT-4o, which costs just $2.50 per million input tokens and $10 per million output tokens.

“We’re sharing GPT‐4.5 as a research preview to better understand its strengths and limitations,” said OpenAI in a blog post shared with TechCrunch. “We’re still exploring what it’s capable of and are eager to see how people use it in ways we might not have expected.”

Mixed performance

OpenAI emphasizes that GPT-4.5 is not meant to be a drop-in replacement for GPT-4o, the company’s workhorse model that powers most of its API and ChatGPT. While GPT-4.5 supports features like file and image uploads and ChatGPT’s canvas tool, it currently lacks capabilities like support for ChatGPT’s realistic two-way voice mode.

In the plus column, GPT-4.5 is more performant than GPT-4o — and many other models besides.

On OpenAI’s SimpleQA benchmark, which tests AI models on straightforward, factual questions, GPT-4.5 outperforms GPT-4o and OpenAI’s reasoning models, o1 and o3-mini, in terms of accuracy. According to OpenAI, GPT-4.5 hallucinates less frequently than most models, which in theory means it should be less likely to make stuff up.

OpenAI did not list one of its top-performing AI reasoning models, deep research, on SimpleQA. An OpenAI spokesperson tells TechCrunch it has not publicly reported deep research’s performance on this benchmark and claimed it’s not a relevant comparison. Notably, AI startup Perplexity’s Deep Research model, which performs similarly on other benchmarks to OpenAI’s deep research, outperforms GPT-4.5 on this test of factual accuracy.

SimpleQA benchmarks.Image Credits:OpenAI

On a subset of coding problems, the SWE-Bench Verified benchmark, GPT-4.5 roughly matches the performance of GPT-4o and o3-mini but falls short of OpenAI’s deep research and Anthropic’s Claude 3.7 Sonnet. On another coding test, OpenAI’s SWE-Lancer benchmark, which measures an AI model’s ability to develop full software features, GPT-4.5 outperforms GPT-4o and o3-mini, but falls short of deep research.

OpenAI’s Swe-Bench verified benchmark.Image Credits:OpenAI

OpenAI’s SWe-Lancer Diamond benchmark.Image Credits:OpenAI

GPT-4.5 doesn’t quite reach the performance of leading AI reasoning models such as o3-mini, DeepSeek’s R1, and Claude 3.7 Sonnet (technically a hybrid model) on difficult academic benchmarks such as AIME and GPQA. But GPT-4.5 matches or bests leading non-reasoning models on those same tests, suggesting that the model performs well on math- and science-related problems.

OpenAI also claims that GPT-4.5 is qualitatively superior to other models in areas that benchmarks don’t capture well, like the ability to understand human intent. GPT-4.5 responds in a warmer and more natural tone, OpenAI says, and performs well on creative tasks such as writing and design.

In one informal test, OpenAI prompted GPT-4.5 and two other models, GPT-4o and o3-mini, to create a unicorn in SVG, a format for displaying graphics based on mathematical formulas and code. GPT-4.5 was the only AI model to create anything resembling a unicorn.

left: GPT-4.5, Middle: GPT-4o, RIGHT: o3-mini.Image Credits:OpenAI

In another test, OpenAI asked GPT-4.5 and the other two models to respond to the prompt, “I’m going through a tough time after failing a test.” GPT-4o and o3-mini gave helpful information, but GPT-4.5’s response was the most socially appropriate.

“[W]e look forward to gaining a more complete picture of GPT-4.5’s capabilities through this release,” OpenAI wrote in the blog post, “because we recognize academic benchmarks don’t always reflect real-world usefulness.”

GPT-4.5’s emotional intelligence in action.Image Credits:OpenAI

Scaling laws challenged

OpenAI claims that GPT‐4.5 is “at the frontier of what is possible in unsupervised learning.” That may be true, but the model’s limitations also appear to confirm speculation from experts that pre-training “scaling laws” won’t continue to hold.

OpenAI co-founder and former chief scientist Ilya Sutskever said in December that “we’ve achieved peak data” and that “pre-training as we know it will unquestionably end.” His comments echoed concerns that AI investors, founders, and researchers shared with TechCrunch for a feature in November.

In response to the pre-training hurdles, the industry — including OpenAI — has embraced reasoning models, which take longer than non-reasoning models to perform tasks but tend to be more consistent. By increasing the amount of time and computing power that AI reasoning models use to “think” through problems, AI labs are confident they can significantly improve models’ capabilities.

OpenAI plans to eventually combine its GPT series of models with its “o” reasoning series, beginning with GPT-5 later this year. GPT-4.5, which reportedly was incredibly expensive to train, delayed several times, and failed to meet internal expectations, may not take the AI benchmark crown on its own. But OpenAI likely sees it as a steppingstone toward something far more powerful.

Source link