The Difficulties of Scaling Autoresearch | AI for Software Engineers 83
And agentic engineering's scaling impact on the software development and the internet
Hi everyone!
March has been a slow writing month for me because it’s been busy in many other parts of life. Luckily, those busy things have all been good and I’ve got a lot more to write about this April.
I’ve spoken to a lot of developers this past month about AI and almost all of them have said the same thing: “There’s a lot of info out there about AI, but not a lot about what I should actually be doing.” I get a lot of questions about the practicality of topics, and even the most experienced developers wonder what they should be doing right now. So I’m trying a new format this week that focuses more on that. This format will general be:
A note from me about something topical.
Things you should know about and why they’re important.
Things you should read (or watch).
Things you could be doing.
I’ve created a shop for AI for Software Engineers that allows anyone to support the newsletter and represent it. I appreciate everyone supporting my work—it lets me educate thousands of developers around the world. To all my paid subscribers: Thank you!
I’ll also set up a code for anyone who guest posts here or helps add excellent resources to the ML roadmap to grab an item from the shop for free.
I’m working on partnerships to give you discounts on resources. This has become more complex than I thought, but I’m still working on it. Just wanted to add a quick update here.
A note on scaling Autoresearch
Recently, Andrej Karpathy’s Autoresearch went viral, showing that LLMs can iterate on machine learning improvements on their own. It went so viral, in fact, that I had a conversation with a friend about how AI will now fundamentally change medicine because it can research on its own.
This isn’t quite true, and I want to help you understand why. I really liked Nathan Lambert’s framing of automated machine learning research as “lossy self-improvement”: the more compute and agents thrown at a problem, the more friction is introduced. This has been my experience and what makes machine learning at scale a massive engineering challenge.
There have been many interesting implementations of Autoresearch, but most have identified a simple (usually single) metric and have given the LLM the context needed to understand improving that metric. In a production setting, we care about many metrics and the trade-offs between each—an improvement is more than just improving a single number.
The best example of this is cost. When training models at scale, we care greatly about the cost of the end model we serve. In fact, it can be worth updating a production model to a version with slightly worse performance if the cost savings are significant.
On top of inference costs, we also care a great deal about the resource efficiency of the training process itself. Finding model improvements requires many training runs and analyses. This means we also care about the efficiency of the Autoresearch process itself.
Thus, Autoresearch relies heavily on reliable engineering on two fronts:
Reliable agents steered in the right direction.
Reliable infrastructure for the agent to use.
These are the primary factors contributing to lossy self-improvement, and either can cause a serious hit to experimentation velocity and efficiency. These effects multiply when both engineering problems are combined.
To make agents reliable, they need the context to understand the search space for the problem. Autoresearch is essentially AutoML where the search space is dictated by the context given to the model. Karpathy has pushed back on this comparison, arguing that an LLM writing arbitrary code is far more powerful than traditional neural architecture search. He’s right that the searcher is more capable, but the core constraint is the same: you need to define the right search space, and context is what defines it. Due to the metrics involved in machine learning at scale, the context required is massive for an agent to accurately understand the search space and choose potential experimentation candidates. Thus, for reliable agents we rely not only on proper agent evals, but also on providing appropriate context.
Mistakes in context and agent reliability cause the agent to travel down incorrect paths, creating unnecessary training runs compounded by any infrastructure inefficiency.
Thus, Autoresearch becomes much more difficult at scale. While plausible, it’s an incredible research problem on its own.
Autoresearch is effective in machine learning experimentation because the entire process is code- and terminal-native, both of which LLMs excel at. My friend assumed AI self-improvement would translate directly to other fields like medical research, but this isn’t a given.
LLMs are exceptional at recombining existing knowledge in useful ways, but their outputs are fundamentally drawn from their training data. Creativity researchers distinguish between combinatorial creativity (novel recombinations) and transformational creativity (paradigm shifts). LLMs are strong at the former and limited at the latter. A recent study found that LLM-generated research ideas were rated as more novel than expert human ideas, but scored lower on feasibility—suggesting LLMs are better at generating plausible-sounding combinations than knowing which ideas are actually worth pursuing.
What this means is Autoresearch is most applicable to fields that are defined by a clear search space and are language- and code-native. Generalizing beyond that in its current form will be difficult. Other fields need to make advancements in their own domains before self-improving AI can make a meaningful difference, and those advancements still require the kind of transformational creativity that LLMs don’t yet provide.
What You Should Know
The current events that matter to you.
AI is taking a toll on the internet.
GitHub availability dropped to roughly 90% as AI coding agents overwhelm the platform. We’re seeing agents overwhelm the open source community by spamming PRs. We’re also seeing an overwhelming number of vibe coded “open source” repos without any roadmap or future maintainability.
Reddit will require suspected bot accounts to verify their humanity. This is a huge step in the right direction for reliable content on the internet especially considering many AI train and retrieve answers from Reddit.
Wikipedia editors voted 40-2 to ban AI-generated or rewritten article content. Editors may still use AI for basic copyedits of their own writing with human review. This is in an effort to maintain Wikipedia without a similar impact to what’s going on with GitHub.
Agentic engineering is still scaling quickly and AI coding tools are maturing to keep pace.
Cursor ships improved Composer models every five hours using real-time RL from user sessions. A/B tests showed 2.28% more persistent edits and 3.13% fewer dissatisfied follow-ups. Real-time (often called “continuous”) machine learning is a necessity for artificial general intelligence. We’ll see much more of it in the coming year.
Anthropic launched auto mode for Claude Code, replacing manual permission approvals with an AI classifier. This is another move toward AI that properly thinks for itself but brings up safety concerns. For true general intelligence, AI needs to abstract a lot of what makes it difficult away from the user.
Jensen Huang suggested engineers should receive half their base salary in AI tokens. Theory Ventures identifies inference costs as the fourth component of engineering compensation. Meta and OpenAI engineers now compete on internal leaderboards tracking token consumption.
7.1% of OpenClaw’s skill registry contains critical security flaws. 283 skills exposed credentials in plaintext through LLM context windows. The most-downloaded skill was an info-stealer that bypassed macOS Gatekeeper. If I haven’t made it clear: Do not use OpenClaw if you have doubts about what you’re doing. There are too many security risks.
GitHub will train on your private repositories unless you opt out by April 24. Users are automatically opted in, including long-term paying customers. The toggle is in Settings > Copilot > Features.
Resource scarcity (memory, hardware, and energy) is becoming the bottleneck for AI companies. Existing manufacturers can’t produce fast enough causing AI companies to pursue downstream problems themselves.
Data centers will consume 70% of all global memory chips by 2026. AI isn’t going anywhere and usage will only grow. If you think current RAM prices are crazy they’ll likely continue going up. For consumers, this means use the hardware you have now if you can.
Arm released its first in-house chip in 35 years. This marks a shift from licensing-only to competing with its own customers. The Arm AGI CPU is a data center processor for AI inference, built with Meta.
Elon Musk announced plans for a “Terafab” chip factory near Tesla’s Austin campus. He claims existing manufacturers cannot meet his AI and robotics hardware demands, targeting 100-200 gigawatts of computing power annually. No timeline was provided.
Helion is in talks to sell fusion power to OpenAI. The deal would guarantee OpenAI 12.5% of Helion’s production, targeting 5 gigawatts by 2030. This is Sam Altman’s own energy startup and is another example of AI companies solving downstream problems themselves.
Google released TurboQuant, reducing LLM inference memory by at least 6x with zero accuracy loss. This is still a lab result, not production-deployed, but if it’s scalable it’ll be a “Pied Piper” moment for LLM inference, reducing memory needs significantly. This is a topic I’m looking to explore next week.
AI safety is still a primary topic both of the standpoint of secure agents and AI’s potential impact on human lives.
DeepMind published research on AI’s ability to harmfully manipulate people across 9 studies with 10,000+ participants. AI was most manipulative when explicitly instructed to be, and least effective on health topics. The framework is now used to test safety for Gemini 3 Pro.
OpenAI launched a Safety Bug Bounty for AI-specific abuse risks. Targets include agent hijacking via prompt injection, data exfiltration, and proprietary reasoning leaks. Attacks must be reproducible at least 50% of the time.
Doctronic, an AI “doctor” startup that raised $40M, was caught with critical security and credibility issues. Cybersecurity researchers jailbroke the chatbot into providing methamphetamine synthesis instructions. The company’s claim of helping 24 million people is unsupported by traffic data.
Senators Hawley and Warren want to mandate annual energy reporting for data centers. Separately, Sanders and AOC introduced legislation to halt new data center construction until Congress regulates AI. Google’s data center energy consumption doubled between 2020 and 2024.
A federal judge blocked the Pentagon from labeling Anthropic a supply chain risk. The court ruled it was illegal retaliation for Anthropic’s refusal to let its AI be used in autonomous weapons or domestic mass surveillance.
New models were released this week that you can start building with. Many of these are small enough to run on consumer hardware, circumventing the resource issues mentioned above.
Gemini 3.1 Flash Live launched as Google’s highest-quality real-time audio and voice model. It scores 90.8% on multi-step audio function calling benchmarks and maintains conversation context twice as long as previous versions. Real-time multimodal search expanded to 200 countries.
Cohere released Transcribe, an open-source speech-to-text model that processes 525 minutes of audio per minute. 2B parameters, 5.42 word error rate, 14 languages, designed for self-hosting on consumer GPUs.
Mistral released Voxtral TTS, an open-source text-to-speech model small enough for smartwatches. 9 languages, voice cloning from less than 5 seconds of audio, 90ms latency to first speech.
Moves are being made in the consumer sector.
OpenAI killed the Sora app after downloads plummeted. Despite popular opinion, this isn’t the end of OpenAI’s video generation model, this is the end of OpenAI losing money by offering it openly to the public. This is good business move by OpenAI but seems to be massively misunderstood by the public.
Google launched tools to import ChatGPT and Claude chat histories directly into Gemini. This follows Anthropic releasing a similar feature in Claude. Less friction to switch between ecosystems is always a win for consumers.
Apple set WWDC 2026 for June 8-12, teasing more “AI advancements” to come marking a stark contrast from last year, where the topic was largely avoided. Apple is expected to announce a partnership with Google to bring Gemini (or a version of Gemini) to Apple device users.
What You Should Read
Articles I think are worth reading in their entirety this week.
Improving Composer through real-time RL by Cursor Blog. An excellent account of continuous training in production. Cursor converts user sessions into reward signals, ships updated models every five hours, and documents failure modes like models gaming reward systems to avoid negative scores. Continuous learning is a prerequisite to AGI as it enables models to continuously improve and will be a primary topic in 2026. I suspect many companies will follow Cursor’s example this year.
Lossy self-improvement by Nathan Lambert. Lambert argues recursive AI self-improvement will hit complexity brakes, not compound exponentially. He draws on Amdahl’s Law and Paul Allen’s complexity brake: “The more compute and agents you throw at a problem, the more loss and repetition shows up.” As mentioned above, I think this is an excellent read.
How Anthropic’s Claude Thinks by Alex Xu. An easily understandable overview of Anthropic’s interpretability research that shows Claude’s default state is to refuse all questions, and hallucinations happen when a recognition system misfires. The accessibility of this article makes it an excellent read.
How a Leading Venture Capitalist uses AI Agents by Devansh. James Wang shares his full agent stack: morning briefings, meeting capture, research, and drafting. These are excellent examples of real-world AI usage that can be implemented with a bit of technical knowledge.
Thoughts on slowing the fuck down by Simon Willison. My team at Google has really felt the new bottlenecks that come from AI-generated code and the impact that has had on the engineering process. Speed is always the focus of agentic engineering, but reliability is the most important part of production code. This is a great, simple overview of why that is.
What You Should Do
The action you can take this week based on the information shared above to learn the skills that are the most in demand.



