Category: Analysis
Admin
Share
Listen On

AI models can now solve complex programming tasks in hours while still failing at simple everyday questions. According to Andrej Karpathy, that is not a contradiction but a reflection of how progress in AI is uneven across domains.

Karpathy says there are currently two very different perspectives on AI progress. One group has tried the free version of ChatGPT or the voice mode and formed an opinion based on obvious mistakes, weak reasoning, and hallucinations. In his view, however, those older or less capable models no longer reflect the current frontier.

The second group uses the latest professional-grade systems such as OpenAI Codex or Claude Code in technical domains like programming, mathematics, and research. There, Karpathy argues, progress this year has been dramatic: these models can independently refactor entire codebases or identify security vulnerabilities. As a result, the two groups are often talking past each other.

“It is simultaneously true that OpenAI’s free and, in my opinion, somewhat neglected Advanced Voice Mode fails at the dumbest questions in Instagram Reels, while at the same time OpenAI’s most expensive paid Codex model can spend an hour coherently restructuring an entire codebase or finding and exploiting vulnerabilities in computer systems.” - Х

Behind that observation is a deeper point: fields such as coding and mathematics, where outcomes can be clearly verified and reinforced through feedback, are currently benefiting far more from AI progress than areas without clean evaluation metrics, such as writing, consulting, or open-ended advice.

Verifiability as the key to progress

Karpathy’s argument touches on one of the central questions in AI research today: can language models develop into a more general intelligence, or can they only be optimized to perform efficiently in specific domains with well-defined feedback loops?

He addressed this structural issue in an earlier essay on what he called the “Software 2.0” paradigm. In that framework, the critical factor is not whether a task can be precisely specified, but whether it can be verified. Only when a system can receive automated feedback, such as right-or-wrong judgments or clear reward signals, can it be effectively improved through reinforcement learning. As Karpathy put it, “the more verifiable a task is, the better it can be automated in this new programming paradigm.”

Last summer, rumors circulated about a possibleUniversal Verifier” at OpenAI that could extend reinforcement learning across all areas of knowledge. So far, however, nothing concrete has emerged. Meanwhile, Jerry Tworek, one of the leading figures behind OpenAI’s reinforcement learning strategy, has left the company and recently said that “deep learning research is essentially complete”  - Х

Podcast Producer
Wannabe alcohol nerd. Analyst. Web practitioner. Devoted travel trailblazer. Professional reader. Gamer.

Recent Podcasts

Adobe Reinvents Document Work with Acrobat Studio and AI

Adobe has fundamentally reimagined document workflows with the launch of Acrobat Studio, a...

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

Artificial intelligence is becoming the main role model for Generation Alpha. 2026 may mark a...

AI as a Toy: Why Humanity Always Misuses New Technology First

Artificial intelligence could, in theory, help solve all of humanity’s problems. Stop wars, cure...