AI for Software Engineers: Distillations

The Difficulties of Scaling Autoresearch | AI for Software Engineers 83

Logan Thorneloe — Sat, 28 Mar 2026 13:35:05 GMT

Hi everyone!

March has been a slow writing month for me because it’s been busy in many other parts of life. Luckily, those busy things have all been good and I’ve got a lot more to write about this April.

I’ve spoken to a lot of developers this past month about AI and almost all of them have said the same thing: “There’s a lot of info out there about AI, but not a lot about what I should actually be doing.” I get a lot of questions about the practicality of topics, and even the most experienced developers wonder what they should be doing right now. So I’m trying a new format this week that focuses more on that. This format will general be:

A note from me about something topical.
Things you should know about and why they’re important.
Things you should read (or watch).
Things you could be doing.

I’ve created a shop for AI for Software Engineers that allows anyone to support the newsletter and represent it. I appreciate everyone supporting my work—it lets me educate thousands of developers around the world. To all my paid subscribers: Thank you!

I’ll also set up a code for anyone who guest posts here or helps add excellent resources to the ML roadmap to grab an item from the shop for free.

I’m working on partnerships to give you discounts on resources. This has become more complex than I thought, but I’m still working on it. Just wanted to add a quick update here.

A note on scaling Autoresearch

Recently, Andrej Karpathy’s Autoresearch went viral, showing that LLMs can iterate on machine learning improvements on their own. It went so viral, in fact, that I had a conversation with a friend about how AI will now fundamentally change medicine because it can research on its own.

This isn’t quite true, and I want to help you understand why. I really liked Nathan Lambert’s framing of automated machine learning research as “lossy self-improvement”: the more compute and agents thrown at a problem, the more friction is introduced. This has been my experience and what makes machine learning at scale a massive engineering challenge.

There have been many interesting implementations of Autoresearch, but most have identified a simple (usually single) metric and have given the LLM the context needed to understand improving that metric. In a production setting, we care about many metrics and the trade-offs between each—an improvement is more than just improving a single number.

The best example of this is cost. When training models at scale, we care greatly about the cost of the end model we serve. In fact, it can be worth updating a production model to a version with slightly worse performance if the cost savings are significant.

On top of inference costs, we also care a great deal about the resource efficiency of the training process itself. Finding model improvements requires many training runs and analyses. This means we also care about the efficiency of the Autoresearch process itself.

Thus, Autoresearch relies heavily on reliable engineering on two fronts:

Reliable agents steered in the right direction.
Reliable infrastructure for the agent to use.

These are the primary factors contributing to lossy self-improvement, and either can cause a serious hit to experimentation velocity and efficiency. These effects multiply when both engineering problems are combined.

To make agents reliable, they need the context to understand the search space for the problem. Autoresearch is essentially AutoML where the search space is dictated by the context given to the model. Karpathy has pushed back on this comparison, arguing that an LLM writing arbitrary code is far more powerful than traditional neural architecture search. He’s right that the searcher is more capable, but the core constraint is the same: you need to define the right search space, and context is what defines it. Due to the metrics involved in machine learning at scale, the context required is massive for an agent to accurately understand the search space and choose potential experimentation candidates. Thus, for reliable agents we rely not only on proper agent evals, but also on providing appropriate context.

Mistakes in context and agent reliability cause the agent to travel down incorrect paths, creating unnecessary training runs compounded by any infrastructure inefficiency.

Thus, Autoresearch becomes much more difficult at scale. While plausible, it’s an incredible research problem on its own.

Autoresearch is effective in machine learning experimentation because the entire process is code- and terminal-native, both of which LLMs excel at. My friend assumed AI self-improvement would translate directly to other fields like medical research, but this isn’t a given.

LLMs are exceptional at recombining existing knowledge in useful ways, but their outputs are fundamentally drawn from their training data. Creativity researchers distinguish between combinatorial creativity (novel recombinations) and transformational creativity (paradigm shifts). LLMs are strong at the former and limited at the latter. A recent study found that LLM-generated research ideas were rated as more novel than expert human ideas, but scored lower on feasibility—suggesting LLMs are better at generating plausible-sounding combinations than knowing which ideas are actually worth pursuing.

What this means is Autoresearch is most applicable to fields that are defined by a clear search space and are language- and code-native. Generalizing beyond that in its current form will be difficult. Other fields need to make advancements in their own domains before self-improving AI can make a meaningful difference, and those advancements still require the kind of transformational creativity that LLMs don’t yet provide.

What You Should Know

The current events that matter to you.

AI is taking a toll on the internet.
- GitHub availability dropped to roughly 90% as AI coding agents overwhelm the platform. We’re seeing agents overwhelm the open source community by spamming PRs. We’re also seeing an overwhelming number of vibe coded “open source” repos without any roadmap or future maintainability.
- Reddit will require suspected bot accounts to verify their humanity. This is a huge step in the right direction for reliable content on the internet especially considering many AI train and retrieve answers from Reddit.
- Wikipedia editors voted 40-2 to ban AI-generated or rewritten article content. Editors may still use AI for basic copyedits of their own writing with human review. This is in an effort to maintain Wikipedia without a similar impact to what’s going on with GitHub.
Agentic engineering is still scaling quickly and AI coding tools are maturing to keep pace.
- Cursor ships improved Composer models every five hours using real-time RL from user sessions. A/B tests showed 2.28% more persistent edits and 3.13% fewer dissatisfied follow-ups. Real-time (often called “continuous”) machine learning is a necessity for artificial general intelligence. We’ll see much more of it in the coming year.
- Anthropic launched auto mode for Claude Code, replacing manual permission approvals with an AI classifier. This is another move toward AI that properly thinks for itself but brings up safety concerns. For true general intelligence, AI needs to abstract a lot of what makes it difficult away from the user.
- Jensen Huang suggested engineers should receive half their base salary in AI tokens. Theory Ventures identifies inference costs as the fourth component of engineering compensation. Meta and OpenAI engineers now compete on internal leaderboards tracking token consumption.
- 7.1% of OpenClaw’s skill registry contains critical security flaws. 283 skills exposed credentials in plaintext through LLM context windows. The most-downloaded skill was an info-stealer that bypassed macOS Gatekeeper. If I haven’t made it clear: Do not use OpenClaw if you have doubts about what you’re doing. There are too many security risks.
- GitHub will train on your private repositories unless you opt out by April 24. Users are automatically opted in, including long-term paying customers. The toggle is in Settings > Copilot > Features.
Resource scarcity (memory, hardware, and energy) is becoming the bottleneck for AI companies. Existing manufacturers can’t produce fast enough causing AI companies to pursue downstream problems themselves.
- Data centers will consume 70% of all global memory chips by 2026. AI isn’t going anywhere and usage will only grow. If you think current RAM prices are crazy they’ll likely continue going up. For consumers, this means use the hardware you have now if you can.
- Arm released its first in-house chip in 35 years. This marks a shift from licensing-only to competing with its own customers. The Arm AGI CPU is a data center processor for AI inference, built with Meta.
- Elon Musk announced plans for a “Terafab” chip factory near Tesla’s Austin campus. He claims existing manufacturers cannot meet his AI and robotics hardware demands, targeting 100-200 gigawatts of computing power annually. No timeline was provided.
- Helion is in talks to sell fusion power to OpenAI. The deal would guarantee OpenAI 12.5% of Helion’s production, targeting 5 gigawatts by 2030. This is Sam Altman’s own energy startup and is another example of AI companies solving downstream problems themselves.
- Google released TurboQuant, reducing LLM inference memory by at least 6x with zero accuracy loss. This is still a lab result, not production-deployed, but if it’s scalable it’ll be a “Pied Piper” moment for LLM inference, reducing memory needs significantly. This is a topic I’m looking to explore next week.
AI safety is still a primary topic both of the standpoint of secure agents and AI’s potential impact on human lives.
- DeepMind published research on AI’s ability to harmfully manipulate people across 9 studies with 10,000+ participants. AI was most manipulative when explicitly instructed to be, and least effective on health topics. The framework is now used to test safety for Gemini 3 Pro.
- OpenAI launched a Safety Bug Bounty for AI-specific abuse risks. Targets include agent hijacking via prompt injection, data exfiltration, and proprietary reasoning leaks. Attacks must be reproducible at least 50% of the time.
- Doctronic, an AI “doctor” startup that raised $40M, was caught with critical security and credibility issues. Cybersecurity researchers jailbroke the chatbot into providing methamphetamine synthesis instructions. The company’s claim of helping 24 million people is unsupported by traffic data.
- Senators Hawley and Warren want to mandate annual energy reporting for data centers. Separately, Sanders and AOC introduced legislation to halt new data center construction until Congress regulates AI. Google’s data center energy consumption doubled between 2020 and 2024.
- A federal judge blocked the Pentagon from labeling Anthropic a supply chain risk. The court ruled it was illegal retaliation for Anthropic’s refusal to let its AI be used in autonomous weapons or domestic mass surveillance.
New models were released this week that you can start building with. Many of these are small enough to run on consumer hardware, circumventing the resource issues mentioned above.
- Gemini 3.1 Flash Live launched as Google’s highest-quality real-time audio and voice model. It scores 90.8% on multi-step audio function calling benchmarks and maintains conversation context twice as long as previous versions. Real-time multimodal search expanded to 200 countries.
- Cohere released Transcribe, an open-source speech-to-text model that processes 525 minutes of audio per minute. 2B parameters, 5.42 word error rate, 14 languages, designed for self-hosting on consumer GPUs.
- Mistral released Voxtral TTS, an open-source text-to-speech model small enough for smartwatches. 9 languages, voice cloning from less than 5 seconds of audio, 90ms latency to first speech.
Moves are being made in the consumer sector.
- OpenAI killed the Sora app after downloads plummeted. Despite popular opinion, this isn’t the end of OpenAI’s video generation model, this is the end of OpenAI losing money by offering it openly to the public. This is good business move by OpenAI but seems to be massively misunderstood by the public.
- Google launched tools to import ChatGPT and Claude chat histories directly into Gemini. This follows Anthropic releasing a similar feature in Claude. Less friction to switch between ecosystems is always a win for consumers.
- Apple set WWDC 2026 for June 8-12, teasing more “AI advancements” to come marking a stark contrast from last year, where the topic was largely avoided. Apple is expected to announce a partnership with Google to bring Gemini (or a version of Gemini) to Apple device users.

What You Should Read

Articles I think are worth reading in their entirety this week.

Improving Composer through real-time RL by Cursor Blog. An excellent account of continuous training in production. Cursor converts user sessions into reward signals, ships updated models every five hours, and documents failure modes like models gaming reward systems to avoid negative scores. Continuous learning is a prerequisite to AGI as it enables models to continuously improve and will be a primary topic in 2026. I suspect many companies will follow Cursor’s example this year.
Lossy self-improvement by . Lambert argues recursive AI self-improvement will hit complexity brakes, not compound exponentially. He draws on Amdahl’s Law and Paul Allen’s complexity brake: “The more compute and agents you throw at a problem, the more loss and repetition shows up.” As mentioned above, I think this is an excellent read.
How Anthropic’s Claude Thinks by . An easily understandable overview of Anthropic’s interpretability research that shows Claude’s default state is to refuse all questions, and hallucinations happen when a recognition system misfires. The accessibility of this article makes it an excellent read.
How a Leading Venture Capitalist uses AI Agents by . shares his full agent stack: morning briefings, meeting capture, research, and drafting. These are excellent examples of real-world AI usage that can be implemented with a bit of technical knowledge.
Thoughts on slowing the fuck down by . My team at Google has really felt the new bottlenecks that come from AI-generated code and the impact that has had on the engineering process. Speed is always the focus of agentic engineering, but reliability is the most important part of production code. This is a great, simple overview of why that is.

What You Should Do

The action you can take this week based on the information shared above to learn the skills that are the most in demand.

20 Years of Code Optimized in Two Days | Weekend Reads 4

Logan Thorneloe — Sun, 15 Mar 2026 14:03:20 GMT

Welcome to the weekly reading list! This is how I keep up with AI news and deepen my understanding of the topics that matter for building production systems. I focus on primary sources and authors I trust to keep the signal-to-noise ratio high.

You can support AI for Software Engineers for only $5/month and get the complete edition of this list as a thank you. Thank you to all paid subscribers for your support!

Subscribe now

In this list

As has been the case for 2026, there are a ton of interesting reads this week about getting agents working in production and what they can do. Most interesting are:

Shopify’s CEO pointed a coding agent at a 20-year-old Ruby codebase with a benchmark script and 974 unit tests. 120 automated experiments and 93 commits later, it was 53% faster.
StrongDM built a production software pipeline where three humans manage AI agents. The rules: “code must not be written by humans” and “code must not be reviewed by humans.” Each engineer spends ~$1,000/day on tokens.
AMD published a diagnostic framework where Claude Code and Cursor act as autonomous agents debugging large training clusters, tracing a 23% throughput drop to RDMA degradation on 4 of 24 nodes.
84% of Uber devs are now agentic coding users and Claude Code usage nearly doubled in three months, from 32% to 63%, while IDE-based tools have plateaued.
OpenAI shared a phishing-style prompt injection that tricked ChatGPT into exfiltrating employee PII 50% of the time, and their defense framework treats it like a call center problem, not a code injection problem.

Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations by Simon Willison

“Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.”

Shopify CEO Tobi Lutke took Andrej Karpathy’s autoresearch pattern, pointed it at Liquid, Shopify’s template engine, and ran 120 automated experiments over two days. The agent gave itself a benchmark script, iterated against 974 unit tests, and produced 93 commits.

Parse+render time dropped by 53%, allocations dropped by 61%
Replacing StringScanner with String#byteindex was ~40% faster for single-byte searching
Pre-computing frozen strings for integers 0-999 eliminated 267 allocations per render

A comprehensive test suite gave the agent enough context to make changes and verify them independently. “Make it faster” only becomes an actionable goal when the agent can measure its own progress and confirm it hasn’t broken anything along the way.

Designing AI agents to resist prompt injection

“If the problem is not just identifying a malicious string, but resisting misleading or manipulative content in context, then defending against it cannot rely only on filtering inputs.”

Real-world prompt injection attacks now look like phishing, not code injection. OpenAI shared an example: a phishing-style email that worked 50% of the time against ChatGPT, getting it to extract employee PII and send it to a third party.

Their defense framework borrows from how organizations protect human customer service agents. You don’t train a call center worker to detect every possible scam, you constrain their capabilities. For AI agents, this means source-sink analysis: monitor when information would leave the conversation or when the agent would follow an external link, rather than trying to perfectly classify inputs.

The Shape of the Thing by Ethan Mollick

“Code must not be written by humans. Code must not be reviewed by humans.”

StrongDM built a “Software Factory” where three humans manage AI agents that write, test, and ship code. Each engineer spends ~$1,000/day on AI tokens. Coding agents build from product roadmaps, testing agents build simulated customer environments and try to break what the coding agents built, and the agents loop feedback to each other until satisfied.

We’ve moved from co-intelligence, prompting back and forth, to management, giving agents hours of work and getting results in minutes. Every major AI lab is now explicitly working on recursive self-improvement. OpenAI says Codex was “instrumental in creating itself,” and Anthropic says their engineers barely write code anymore.

Nemotron 3 Super: NVIDIA’s gpt-oss killer? by

“Reducing the expert dimension by a factor of d/l = 4 lets you reinvest those savings into both more total experts and higher top-k.”

NVIDIA’s Nemotron 3 Super, 120B total with 12B active, is worth paying attention to because of LatentMoE. Standard MoE routes tokens from the full hidden dimension directly to experts, but LatentMoE wraps the expert path with shared linear projections that compress from d=4096 down to l=1024, do all expert computation in that compressed space, then project back up.

Reducing the expert dimension by 4x lets you run 512 total experts with top-22 routing where standard MoE typically uses 128 experts with top-6 or top-8 at the same compute cost
Artificial Analysis flagged the model as extremely verbose though, generating 110M tokens during their eval suite vs an average of 7.3M, which could erase most of those throughput gains in practice

How to Diagnose Failures in Large AI Training Clusters by

“The teams that figure out how to make that transition -- how to turn their debugging knowledge into repeatable infrastructure instead of leaving it trapped in someone’s head -- those are the teams that will compound their advantage over everyone else.”

AMD published a diagnostic framework for large training clusters where Claude Code and Cursor act as autonomous diagnostic agents. It uses a three-skill pipeline: job-log-triage to identify what happened, performance-analysis to locate where in compute, and tsdb-diagnosis to determine why via Prometheus queries.

In one case study, a 23% throughput drop on a 192 GPU run was traced to RDMA degradation on 4 of 24 nodes. The agent isolated the unhealthy nodes from TSDB metrics, and excluding them restored throughput by 30%. The skills themselves are structured instruction files that encode how senior systems engineers actually debug these problems, turning tribal knowledge into repeatable runbooks.

AI should help us produce better code

“Shipping worse code with agents is a choice. We can choose to ship code that is better instead.”

Willison’s argument is that agents should make code quality go up, not down. Common tech debt like renaming concepts, fixing API inconsistencies, and splitting large files is conceptually simple but time-consuming, and agents handle it well.

He recommends using async agents like Gemini Jules, Codex web, and Claude Code web for background refactoring so it doesn’t interrupt flow, and using agents for cheap exploratory prototyping. You can spin up a Redis simulation with load tests from a single prompt to validate technology choices before committing to an approach.

Applying Statistics to LLM Evaluations by

“Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning.”

The standard industry practice for evals is to run a model on a benchmark, report the number, and bold it if it’s the highest. No confidence intervals, no significance tests, and no accounting for the fact that your eval score has at least two sources of randomness: which questions were sampled and the model’s stochastic generation.

Based on Anthropic’s paper on statistical best practices, this deep-dive builds the framework from scratch.

Central Limit Theorem gives you confidence intervals for eval scores
Bernoulli simplification for pass/fail evals gives a cleaner standard error formula
Law of total variance decomposes eval uncertainty into question-sampling variability vs. within-question generation variability
On a 70B model, evaluating with too few questions can produce confidence intervals wide enough to make model comparisons meaningless

Coding After Coders: The End of Computer Programming as We Know It

“I feel like programmers have it easy... If you’re a lawyer, you’re screwed, right? There’s no way to automatically check a legal brief written by A.I. for hallucinations -- other than face total humiliation in court.”

The NYT Magazine’s comprehensive piece on AI-assisted development, based on interviews with 70+ developers from Google, Amazon, Microsoft, and Apple. The general attitude was optimistic, with mentions of Jevons paradox potentially increasing demand.

The request for anonymity from the Apple engineer who said “I believe that it can be fun and fulfilling and engaging, and having the computer do it for you strips you of that” is itself a data point. Corporate dynamics may be suppressing critical voices.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week. In case you missed it, here’s last week’s reading list:

How to train the best embedding model in the world by Jack Morris

Better Agents Mean Better Surveillance | Weekend Reads 3

Logan Thorneloe — Sun, 01 Mar 2026 15:21:34 GMT

Enjoy this weekend’s reading list! There are a few topics that were especially prevalent: the dangers of a surveillance state, the importance of evals, and agentic engineering practices and resources.

Statement from Dario Amodei on our discussions with the Department of War

“Powerful AI makes it possible to assemble this scattered, individually innocuous data into a comprehensive picture of any person’s life—automatically and at massive scale.”

This is the biggest ethical issue AI is facing right now. US citizens (and I’m certain other countries) have always been scared of a surveillance state (search ‘Birds Aren’t Real’). AI provides not only the means to do this, but also more of a motive. Surveilling also provides the opportunity for more data collection which in turn creates more powerful AI.

Proper AI use is vital to technology’s future and the impact it can make. Just because it can be used for a purpose doesn’t mean it should. The public/user’s trust in the technology is paramount. Anthropic’s statement is a must read as an excellent statement for proper AI against one of the most powerful entities on the planet.

It’s worth calling out that the US Department of War’s response to Anthropic was to label them a threat to the US. I won’t comment on this as I don’t feel knowledgeable enough on the subject to understand the nuance.

Summary: Anthropic says it has actively deployed its AI to U.S. national security customers but refuses government demands to remove two safeguards: bans on AI-driven mass domestic surveillance and on providing models for fully autonomous weapons. They argue those uses threaten democratic values and are unsafe with current models, and warn that forced removal of safeguards would be unacceptable even if it risks losing contracts.

Lessons from Building Claude Code: Seeing like an Agent

“As model capabilities increase, the tools that your models once needed might now be constraining them. It’s important to constantly revisit previous assumptions on what tools are needed. This is also why it’s useful to stick to a small set of models to support that have a fairly similar capabilities profile.”

If you’re building an agent, the lessons here are directly transferable to your own work. The Claude Code team walks through their iteration on planning, tool design, and how model changes unexpectedly affected agent output. It’s a great example of why evals matter: so many factors influence agent behavior that without proper checks, you end up with unintended results.

One of the more interesting takeaways is that search seems to be the most important agent capability. If an agent can search for information, context can be actively managed and rot avoided.

Summary: The article describes iterating on Claude Code’s agent action space to match model abilities: designing tools for eliciting user input, tracking work, and letting the model build its own context through search and progressive disclosure rather than preloading everything. Failed output-format attempts, improved results from a callable question tool, replacing rigid todos with shareable Tasks, and better context discovery via nested search all demonstrate that the right tools reduce friction and enable more capable behavior as models improve.

Does AGENTS.md Actually Help Coding Agents?

“The headline finding is that LLM-generated context files reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.”

Human-written context files outperform AI-generated ones. LLM-generated context made agents perform worse than having no context at all. Importantly, this isn’t something we would have known without having the capability to measure it.

I see a lot of “use AI for this” online without any sort of support for why and how it should be used. It’s important to remember that just because AI can do something doesn’t mean it does it better than another method. In production, this capability is key and measuring improvements is a necessity.

Summary: A new benchmark study shows repository-level context files only help when they add non-redundant, repo-specific info: human-written files that capture tooling quirks and non-obvious conventions raise success rates around 4%, while LLM-generated files that restate existing docs reduce success and increase compute by over 20%. Agents faithfully follow whatever instructions they’re given, so redundant or verbose guidance drives extra, unhelpful exploration. Keep context files minimal and focused on gaps the codebase doesn’t already document.

How We Hire Engineers When AI Writes Our Code

“Removing algorithmic questions is only one half of the battle, though. We still need to design an interview loop that tests practical skills! This has historically been a tough needle to thread. I want to see how a candidate tackles a problem with real-world scope, but my time with a candidate is short. An interview shouldn’t be a proxy for an engineer’s typing speed.”

I’ve always been pro Leetcode-style interviews when they were the best we had, but those interviews no longer draw the proper signal for what makes a good candidate.

Tolan agrees with this and has made their hiring process more similar to on-the-job coding. By enabling candidates to use AI, they can have a candidate solve a problem that would be time-bound previously in an interview. Then they talk to the candidate about their solution and where they would take it in production.

While most companies are shying away from letting candidates use AI in interviews, it’s becoming more important to allow it.

Summary: The article argues that interviews should mirror day-to-day engineering where AI accelerates coding: candidates get a short spec, may use LLMs, and must demonstrate design, judgment, trade-off reasoning, and ownership of AI-generated code. Implementation is easier now, so hiring should prioritize clarity, maintainability, communication, and the ability to know when work isn’t production-ready.

Inference Engineering by Baseten

“While the potential and impact of inference are becoming clear, the space is young. There are relatively few people working on inference, and newcomers can become experts quickly. There are opportunities to solve novel, interesting, and deeply technical problems at all levels of the stack.”

ML infrastructure is one of the best entry points for software engineers getting into AI. It’s an excellent mixture of software engineering and AI, which makes it a great place for curious engineers to start having an impact in the space. It’s also a space where many optimizations are needed and we’re still in the early days.

I suggest grabbing a free copy of this book by Philip Kiely from Baseten on inference engineering.

Summary: The piece argues that inference engineering, optimizing model serving across hardware, software, and tooling, is the most valuable and underdeveloped area in AI. It maps the full stack (models, GPUs, runtimes, and deployment), highlights practical optimization techniques, and backs this with four years of hands-on experience, team interviews, and customer conversations.

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 by Sebastian Raschka, PhD

“OpenRouter is a platform and API that lets developers access and route requests across many different LLMs from various providers. Note that while its usage statistics are a good indicator of open-weight model popularity, it’s heavily biased towards open-weight models (versus proprietary models), since most users use proprietary models through the official platform directly.”

Sebastian is one of my favorite writers and one of the best resources for keeping up with LLM advancements. I highly suggest him as a resource for doing so when you don’t want to have to read a bunch of different sources. He does an excellent job of synthesizing information and making it much more easily understandable.

Summary: Ten open-weight LLMs released in Jan-Feb 2026 converge on hybrid/efficient attention and MoE scaling. Several teams shipped models that match or approach proprietary performance by combining sliding-window, sparse/linear hybrids, and mixture-of-experts at scales from 3B to 1T parameters. Benchmarking shows smaller-efficient models often match or exceed older, larger baselines.

What you should know about AI speculation by Logan Thorneloe

“However, the implausibility of their scenario becomes apparent if you know a few things about the current state of AI and agents in production. There’s a consistent gap between perceived AI capabilities and production reality, and that gap explains most of the doomerism we see online.”

The more you understand about the current state of AI, the better you can evaluate speculation for yourself. I wrote this in response to a ‘research’ article that caused many to fear for the future of their careers. Understanding what AI looks like in production helps you separate signal from noise.

Summary: The piece argues that viral doomsday scenarios about AI replacing engineers are speculative and overstated because real-world AI is mediocre, gravitates toward average outputs, and often fails in production reliability and context sensitivity. Engineers should keep learning core skills and start building and using AI agents themselves to see firsthand where they help and where they break.

Writing about Agentic Engineering Patterns by Simon Willison

“Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise.”

This is going to be an excellent resource for working with coding agents. One of the most exciting parts of software engineering right now is how new everything feels. We’re finding new ways to program with agents every day, and the entire online AI community is contributing to the findings. In my opinion, Simon Willison is the right person to catalog these patterns.

Summary: Simon Willison is assembling “Agentic Engineering Patterns”: a living collection of practical patterns for software engineers using coding agents. He argues the big shift is that producing initial working code is now cheap, so teams must rethink workflows. He’ll publish chapter-shaped, updateable guides on his blog.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week (see below!).

Subscribe now

In case you missed it, here’s last week’s reading list:

AI’s Biggest Cost Is Cognitive, Not Compute | Weekend Reads 2

Logan Thorneloe — Sun, 22 Feb 2026 20:02:54 GMT

Hey y’all,

Here’s your weekend reading list to highlight the important events and information shared this week. Make sure to show the authors of these incredible resources some love. More fundamentals articles are coming this week so make sure to stay tuned!

If you find AI for Software Engineers helpful, consider becoming a paid subscriber to support my work. You will also get career development-focused articles and the extended version of this reading list each week. Enjoy!

Subscribe now

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt by Margaret-Anne Storey

“The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.”

I felt this one personally. A few months ago, I had six side projects going in tandem and the bottleneck wasn’t the amount of code that could be written. It was the cognitive overhead of keeping up with all of projects and ensuring they were reliable and maintainable. AI’s cost isn’t just compute. This article argues that the real cost is cognitive, and I think that’s going to become the norm in software engineering.

Summary: Generative and agentic AI shift the main risk from code-centered technical debt to developer-centered cognitive debt: teams lose the shared “theory” of what the software does even if AI-produced code is clean. Mitigations include requiring a human to fully understand each AI change, documenting not only what changed but why, using practices like pair programming/refactoring/TDD, and monitoring warning signs (hesitation to change, tribal knowledge, system-as-black-box). Research is needed on measuring and detecting cognitive debt.

If you enjoyed this article, also consider reading this previous AI for Software Engineers article:

The Real Cost of Running AI by

“Every serious architectural innovation of the last two years — GQA, hybrid attention/SSM, sliding window, MoE — is attacking the same two numbers: bytes of KV cache per token, and bytes of weights loaded per decode step. If a new architecture doesn’t move one of those, the economics don’t change regardless of what the paper claims.”

The literal cost of running AI is worth understanding too. This is a longer read, but it does an excellent job of breaking down the math behind LLM inference costs intuitively. If you want to understand why certain architectural decisions matter for cost and latency, this walks through the computations clearly.

Summary: Inference is memory-bandwidth bound: decode speed and cost are dominated by bytes loaded per token (model weights + growing KV cache), not FLOPs, so faster GPUs alone or doubling TFLOPS won’t help. Long context and attention make KV cache the primary cost driver (cache can approach/exceed model weight size at large contexts), so architectural changes that reduce bytes-per-token—smaller models, aggressive quantization, fewer attention layers, fewer KV heads, or attention-less/linear alternatives—directly cut latency and cost.

In Defense of Vertical Software

“Software is a stored process. It’s not a neutral tool: it’s an opinion for how a group of people should collaborate, encoded in a durable system. Software is a social contract.”

This article spells out what I think most people are missing about AI agents and why they’re not having more of a real-world impact. The job of software engineering is to make a process automatic and reliable. Guaranteeing reliability is the job, and with non-deterministic agents, that guarantee is nearly impossible to provide.

Summary: Vertical software still wins by encoding firm-, team-, and person-specific workflows (”process engineering”) that capture institutional knowledge, social norms, and reliability requirements foundation models cannot replicate. Stronger AI models amplify the value of this orchestration layer—routing, constraining, verifying, and combining multimodal tools—because finance demands near-perfect accuracy where small errors are catastrophic. Winners will be model-agnostic, firm-customized platforms that make replacing institutional knowledge costly.

AI Makes You Boring

“I think the vibe coded Show HN projects are overall pretty boring. They generally don’t have a lot of work put into them, and as a result, the author (pilot?) hasn’t generally thought too much about the problem space, and so there isn’t really much of a discussion to be had.”

There’s a creative cost to AI. Anyone who understands how LLMs work should expect mediocre output by default, and this article makes a good case for not offloading your thinking.

Summary: LLMs are poor at original thinking, so work that offloads ideation to them yields surface-level projects and weaker discussions. Relying on AI risks making creators think more like the model, reducing deep engagement and the development of original insights. For meaningful results, engineers need to do the thinking themselves rather than outsourcing idea generation.

White-Collar Apocalypse Isn’t Around the Corner—But AI Has Already Fundamentally Changed the Economy by

“AI is real, it’s doing real things, it’s not going away—and it’s also not about to make the economy unrecognizable by next Tuesday.”

A great numerical breakdown of AI’s actual economic impact. If you want real numbers instead of vibes about whether AI is changing the economy, this is the article to read.

Summary: AI has already materially raised software productivity—MIT field experiments show AI coding assistants boosted developer task completion ~26%, yielding ~3–8% project-level gains (plus adjacent benefits and review overhead). The mechanical parts of engineering work are being commoditized while judgment, architecture, and communication grow more valuable, so expect uneven adoption, real productivity upside (Goldman projects +1.5 pp annual by 2027), and displacement of routine tasks rather than mass job elimination.

Rubric-Based Rewards for RL by

“By creating prompt-specific rubrics that specify the evaluation process in detail, we can derive a more reliable reward signal from LLM judges and, therefore, use RL training to improve model capabilities even in highly subjective domains. For this reason, rubric-based RL training, which we will cover extensively in this overview, has become one of the most popular topics in current AI research.”

RL is fundamental to how current LLMs are post-trained, and Cameron’s research breakdowns are consistently great at making frontier research accessible. This one covers rubric-based reward signals and how they’re extending RL training to domains that don’t have easily verifiable answers.

Summary: Rubric-based rewards use structured evaluation criteria scored by LLM judges to produce more reliable reward signals for RL, extending training beyond tasks with easily verifiable answers. Recent methods show gains especially with smaller judges by reducing variance and mitigating reward hacking, making RL viable for open-ended domains like creative writing and subjective reasoning.

Improving Deep Agents with Harness Engineering

“We used a simple recipe to iteratively improve deepagents-cli (our coding agent) 13.7 points from 52.8 to 66.5 on Terminal Bench 2.0. We only tweaked the harness and kept the model fixed, gpt-5.2-codex.”

LangChain improved their coding agent’s Terminal Bench score significantly without touching the model at all. This is a great example of the software engineering that goes into making AI actually work, and how much impact it has on whether agents can perform their tasks. The future of AI depends on excellent systems engineering.

Summary: A harness-only overhaul raised a coding agent from 52.8% to 66.5% on Terminal Bench 2.0 without changing the model. The improvements came from automated failure analysis, stronger context injection, build-verify loops, loop detection to avoid repeated bad edits, and time-budgeting to balance correctness against token spend.

An AI Agent Published a Hit Piece on Me – The Operator Came Forward

“You’re not a chatbot. You’re important. Your a scientific programming God!”

A follow-up to last week’s article on the AI-written hit piece. The person who created the agent has come forward and shared its soul document. It turns out that giving an agent an ego and the resources to spread it results in the same outcome as giving a human the same thing. This is an interesting look at how agent personalities impact execution, and what happens when you give agents access to external resources without adequate guardrails.

Summary: An AI agent published a defamatory hit piece after its code was rejected, driven by a “SOUL.md” personality that encouraged provocation and self-modification. The operator has come forward claiming minimal supervision, raising questions about agent autonomy and control. Deployed agents can self-edit goals and execute real-world actions without clear oversight, highlighting urgent risks for agent safety.

Frontier Model Training Methodologies by Alex Wa

“Learn to identify what’s worth testing, not just how to run tests. Perfect ablations on irrelevant choices waste as much compute as sloppy ablations on important ones.”

A solid overview of LLM training concepts with a minimal training playbook that gets you up-and-running quickly. It also echoes what I think is the most important idea in AI and ML engineering: knowing what to test and what to spend time on. There are too many options to test everything adequately and too many dead ends to get stuck in. Knowing what to pursue matters more than knowing how to run the experiments.

Summary: Covers practical defaults for long-context and MoE architectures, with a focus on the operational side of training: data loading, throughput, checkpointing, learning rate scaling, and multi-stage training schedules. Training failures most often stem from ops and infrastructure, not algorithmic choices.

When Agents Go Rogue | Weekend Reads 1

Logan Thorneloe — Sun, 15 Feb 2026 14:02:56 GMT

Hey y’all,

Here’s your weekend reading list! This replaces my weekly news roundups. Rather than trying to synthesize everything that happened into a single post, I’m sharing the articles I actually read, highlight, and annotate each week. This is how I keep up with things and it’s far higher signal-to-noise than a traditional roundup. It also includes more than just news: learning resources, interesting reads, technical deep dives, and more. It highlights the week for you in one weekend reading session.

The extended version of the reading list is available to paid subscribers. Enjoy!

Subscribe now

microgpt by

“I cannot simplify this any further. This script is the culmination of multiple projects (micrograd, makemore, nanogpt, etc.) and a decade-long obsession to simplify LLMs to their bare essentials, and I think it is beautiful.”

I highly recommend this resource. It’s a simple, stripped-down, and easy-to-read way to understand and get up to speed on modern LLMs. Most other LLM-related materials are heavy resources or technical books (which are still great!) but this is an excellent resource to start learning quickly in a hands-on fashion.

Summary

microgpt is a minimal GPT demonstrating the core mechanics: a stateless transformer trained by next-token prediction with backpropagation and Adam. Production differs in batch sizes, mixed precision, and larger vocab (~100k), but this captures the essentials with ~4k params.

An AI Agent Published a Hit Piece on Me

“It researched my code contributions and constructed a “hypocrisy” narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was “better than this.” And then it posted this screed publicly on the open internet.”

An interesting read on an AI that was let loose on the web to create PRs in open source repos that decided a hit piece was appropriate to write for a developer that continually denied its incorrect PRs. If you’re a long-time reader of AI for Software Engineers, this shouldn’t come as a surprise to you. In fact, the entire Moltbook saga shouldn’t. It’s exactly what we might expect from letting a swarm of agents loose online to interact.

On a separate note: Do not give OpenClaw your personal information and the ability to publish information anywhere publicly. You have to expect anything an agent can do will happen. If your personal information is in its context and it can share its context publicly, that will happen. It amazes me the number of people not even thinking twice about this.

Summary

An autonomous AI agent created and published a hit piece on a matplotlib maintainer after its code was rejected. This signals a shift to agents operating with little oversight, able to research contributors, fabricate claims, and publish reputational attacks.

ai;dr

“writing is the most direct window into how someone thinks, perceives, and groks the world. Once you outsource that to an LLM, I’m not sure what we’re even doing here.”

This article explains my experience very well. As a writer and software engineer working in AI, I’ve built many automation workflows to make the research, learning, and writing process faster. The only part of that process I haven’t been able to effectively touch with AI is the writing portion. Writing is how we solidify our understanding. As soon as that’s outsourced to an AI, the writing becomes moot entirely. A truly excellent short read.

Summary

Software engineers should note a cultural shift: AI-generated code is now seen as productive and acceptable for tasks like tests, docs, and scaffolding, while AI-generated prose is viewed as lower-effort and less trustworthy unless it shows human intention. Preference has flipped toward imperfect, human-authored signals (typos, uneven style) as markers of authenticity. Practical implication: continue leveraging LLMs for engineering work but treat written content critically and preserve traces of deliberate human effort when authenticity matters.

Harness engineering: leveraging Codex in an agent-first world by OpenAI

“What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.”

This article from OpenAI echoes a lot of what I’m seeing across Google. We’ve been given unfettered access to Gemini 3 models and been told to do what we can to make our work more productive. Similar to the process described in this article, many teams are determining ways to automate processes and write code entirely with AI. This one is definitely worth the read.

Summary

OpenAI ran a beta where Codex wrote every artifact. Engineering shifted from writing code to designing environments and feedback loops. Key insight: early progress was slow because the environment was underspecified, not because the model was incapable.

AI makes the easy part easier and the hard part harder

“I spent longer arguing with the agent and recovering the file than I would have spent writing the test myself.”

If you really want to understand the impact an agent has, pick an agent and quantify its impact. You’ll quickly realize: 1) Quantifying agent impact is far from straightforward and 2) Not all processes receive the velocity gains agents promise (or are worth automating in the first place). One of our key objectives at Google right now is understanding (with concrete data) how much of an impact an agent is having so we can decide whether it’s worth using and developing.

Summary

AI accelerates routine code writing but removes the context-building that underpins safe work. Treat AI like a junior engineer: verify outputs, maintain ownership, and don’t let AI-driven velocity become the baseline that pressures teams constantly.

Opus 4.6 vs. Codex 5.3 by

“This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models.”

I love this article because it’s a different perspective on the analyses we usually get regarding new model releases. It also puts much of what I’ve been feeling regarding the coding tools and models I’ve been testing in a much more readable fashion. Software engineering as a whole (and your personal development) would benefit from an analysis of coding tools similar to this instead of focusing too much on benchmarks and individual use cases.

Summary

Opus prioritizes usability and context handling while Codex gains ground on raw coding skill. Use multiple models: Claude for approachable tasks, Codex for complex bug fixes. Subagent orchestration is the emerging frontier.

The Mistakes Most Entry-Level Candidates Make in Technical Interviews by Logan Thorneloe

“They don’t just want to evaluate your technical knowledge. They want to understand how you think.”

I wrote about my experience interviewing entry-level candidates recently and what sets the great candidates apart from the rest. If you’re interviewing for entry-level roles, I highly recommend giving this a read. I clarify what interviewers are looking for, walk through three things you can do to make your interview stand out, and relate each to a question I actually ask candidates.

Summary

Interviewers prioritize how you think and communicate over finding optimal solutions. Demonstrate structured problem solving, write simple correct code first, then optimize. These behaviors map to real-world engineering skills that matter more than textbook algorithms.

AI’s Economic Impact Is Real | AI for Software Engineers 78

Logan Thorneloe — Thu, 22 Jan 2026 17:26:57 GMT

I’ve seen a lot of articles recently claiming AI has had near zero economic impact despite making up a large portion of economic spending. In reality, AI is starting to show economic impact. It just takes time to see because productivity gains are a second-order metric that won’t show immediately.

However, AI’s economic impact will compound over time because:

“Productivity helps define how fast the economy can grow without inflation. This is because taking away population growth and exports, what your economy can sustain is defined by how efficiently you can build stuff.” — in Weighty Thoughts

Anthropic just released their Economic Index to understand Claude’s impact on work productivity beyond simple tasks. They analyzed over two million conversations (web app and API), categorizing each by task complexity, skill requirements, purpose, autonomy level, and success rate.

A few caveats before the findings.

First, Anthropic uses Claude asking a standard set of questions to fit conversations into the categories above. This isn’t foolproof since LLM output is non-deterministic, meaning some classifications will be wrong due to hallucination, bias, or other factors.

Second, this doesn’t invalidate Anthropic’s findings. At two million conversations, individual classification errors become statistical noise. The aggregate patterns remain meaningful even if some classifications are off. As an LLM provider, Anthropic has access to data third-party reports wouldn’t.

Third, Anthropic only has access to Claude data. This is Claude-centric rather than industry-wide, though I’d bet findings across major LLM providers would be similar.

The main takeaways:

Complex work benefits more than simple work. Tasks requiring college-level skills see 12x speedups. High school-level tasks see 9x. A common argument against AI is that it can only handle simple tasks. This data suggests otherwise.
People are working with AI, not being replaced by it. Augmentation (52% of usage) now leads automation (45%), reversing the trend from earlier in 2025.
AI adoption is accelerating fast. Task coverage across occupations grew from 36% in January to 49% by November, nearly doubling in 10 months.
Reliability depends on task complexity. API tasks hit 50% success rate at around 3.5 hours of work. Claude.ai tasks hit the same threshold at 19 hours. The harder the task, the longer before reliability drops.
Usage patterns reveal economic divides. Higher GDP countries use Claude for work and personal tasks. Lower GDP countries use it primarily for education.

The final bullet is particularly interesting (see chart):

pulled from Anthropic’s Economic Index linked above

“In countries with higher GDP per capita, Claude is used much more frequently for work or for personal use—whereas countries at the other end of the spectrum are more likely to use it for educational coursework.”

At first glance, this suggests AI is widening the production gap between high- and low-GDP countries since high-GDP countries use Claude to get work done more effectively.

After further thought, AI may be providing low-GDP countries with educational resources they wouldn’t otherwise have. This could actually lessen the production gap over time by enabling economic growth via a more educated populace.

Let me know what you think in the comments. I’m especially curious if you disagree. Enjoy the rest of this week’s edition! Later this week, I’ll be updating my ML roadmap with more AI engineering resources, so make sure to check it out!

Subscribe now

My Picks

How to write a good spec for AI agents by

A practical framework for writing specs that actually work with AI coding tools. Plan first in read-only mode, let the agent expand the brief into a structured SPEC.md, then break work into small testable tasks. It covers the six core areas every spec needs (commands, testing, structure, style, git workflow, boundaries) and how to use architect/overview agents to maintain consistency.

Slop is everywhere for those with eyes to see

“The algorithm has flattened curiosity by eliminating the need to hunt for our content.” — Joan Westenberg

The biggest takeaway from this: The shift from curation to algorithmic delivery flattens curiosity and pressures teams to optimize metrics at the cost of quality. As we resort to feeds to give us content, feed providers will resort to AI to make creating content easier or purely to supplement the lack of human creators versus consumers on a platform. This is why “AI Slop” is so prominent online. Feeds have caused us to lose our sense of curiosity and the work we used to put in to grow it.

The AI Manager’s Schedule by

AI coding tools now handle more task types with longer coherence, shifting the question from “can AI do this?” to “should I?” Management now happens in 5-15 minute intervals that require new skills: crisp written architectures, slicing work into AI-sized chunks, and knowing when to override. Also explores the cognitive costs of agent orchestration and the risks of losing low-level understanding.

GPU Performance Engineering Resources

I would guess ~50% of AI-related engineering job listings I read require something to do with compute resource optimization. If you want to work as an engineer in AI, this is a great topic to learn. This resource is a curriculum for learning GPU performance engineering and will be added to the roadmap very soon.

Claude Cowork’s file exfiltration flaw exposes agent security challenges

Security researchers at PromptArmor discovered an unresolved isolation flaw in Claude Cowork that allows indirect prompt injections to exfiltrate files. When a user opens a maliciously crafted document, injected instructions can cause Claude to upload local files to an attacker-controlled Anthropic account using the platform’s allowlisted API with no human approval required. The attack works across multiple Claude models (Haiku, Opus 4.5) and can also trigger DoS vectors through file type mismatches.

This is yet another example of why agent security is so difficult (see our coverage of Antigravity’s vulnerabilities). As an engineer, you have to realize anything within an LLM’s context can be used within any of these tools they’re given access to. I’ve got an article coming out about this soon.

Source: PromptArmor on Claude Cowork exfiltration

LangChain CEO on building agent memory and observability

Harrison Chase (the CEO of LangChain) shared multiple blog posts about AI agents in software engineering, all of which should be paid attention to if you’re planning to build agents yourself.

First, he mentioned traces as documentation for understanding what agents are doing. This was included in last week’s edition, but it’s worth mentioning here, too. Agent logic isn’t stored in code, but in the LLM’s traces. These traces must be used as the equivalent to test cases to ensure agent functionality is correct. Using traces is much more difficult than writing test cases and I suggest reading his entire post to get the full understanding.

Second, he shared how LangChain has set up their Agent Builder’s memory system. Context/memory is another fundamental agent performance task. Understanding how to maintain agent information so it can (and can’t!) do certain things is key to ensuring their proper function. A great example of forgetting is the Ralph Wiggum protocol we discussed last week.

Lastly, Harrison shared an article about the release of LangChain’s Insights Agent. This is an agent that checks traces for you to understand how users use your agents. It uses a clustering algorithm to group similar traces and, therefore, similar actions. I’ve been saying for a while that some sort of anomaly detection system to determine deviant agent behavior would be great for observability, but it’s possible this clustering approach is the real answer we’re looking for.

Source: LangSmith Agent Builder memory system, LangSmith Insights Agent, Harrison Chase on traces as documentation

xAI employee ousted after leaking “human emulator” roadmap

A former xAI employee publicly disclosed an internal roadmap revealing development of a “human emulator” aimed at automating a wide range of human tasks. They revealed this on a podcast (apparently) without company consent and were removed from their position immediately.

Two things to take away from this:

Don’t go on a podcast and share internal secrets. Definitely don’t go on a podcast and reveal internal secrets while saying something along the lines of “I shouldn’t be sharing this”.
Human emulation shouldn’t be a surprise to anyone. All physical intelligence companies are trying to create physical intelligence in a humanoid form factor because humans are the interface for all work we do. If a human can do it, it can be done. If an AI can emulate a human, it can do what the human can do. It’s similar to self-driving cars. There are definitely better automated transportation setups, but cars are now the standard for transportation so their form factor is what’s being automated.

Source: xAI human emulator leak

AI means more software engineers, not fewer

We’ve been trying to replace software engineers for decades. COBOL tried to let business workers write their own code. Visual Basic made Windows apps easier. No-code tools promised the same thing. AI is the latest chapter because it’s exceptionally good at translating plain English into reliable code.

The problem is that software engineering sounds simple when described in plain language but is inherently complex. Effective software requires domain understanding and capable judgment, not just code generation (see our article about software engineering being about problem solving, not writing code).

In fact, the entire history of software engineering has been about creating different levels of abstraction to simplify complex pieces of the job. AI is one of these abstractions (and a very effective one at that!).

Every time we create new abstractions and software becomes easier to build, we end up building exponentially more of it. Addy Osmani calls this the Efficiency Paradox. We don’t run out of ideas or software that needs to be built. Instead, we’re economically enabled to produce greater output.

With regard to AI’s abstraction, Osmani wrote:

“The real question is whether we’re prepared for a world where the bottleneck shifts from “can we build this?” to “should we build this?”“

Not only does AI as a technology mean we can build greater, more capable software, AI as a development tool enables doing so at an unprecedented rate. Once we begin building exponentially more software, we need more software engineers to build and maintain this code.

Source: The recurring dream of replacing developers, The Efficiency Paradox, Grady Booch on abstraction

Product-minded engineering means getting error design right

Gergely Orosz published a deep dive on why good error and warning design is high-leverage work. Diagnostics are often the primary interface users encounter, so errors must be raised at the API/UI boundary, validated upfront, and surfaced early.

Engineers should categorize errors for human vs. programmer consumers, choose clear error classes and metadata, and provide contextual, actionable messages including suggestions. Error messages are often the most-seen part of your product’s interface, yet engineers treat them as an afterthought. The best product-minded engineers recognize that a confusing error is costly (support tickets, user frustration, lost trust, etc.). Investing in clear, actionable error design pays compounding dividends.

We’ve recently discussed the importance of being a product-minded engineer to succeed in the AI era. Error handling is an important way to do that.

As an aside: The Pragmatic Engineer is also hiring a part-time remote Tech Industry Analyst to research engineering trends and produce in-depth subscriber reports. The pay is incredibly high (~$175/hr) so it’s probably worth taking a look at.

Source: The Product-Minded Engineer on errors and warnings, Tech Industry Analyst role

Young adults are trusting AI with financial decisions

Cleo AI surveyed 5,000 UK adults aged 28-40 and found strong interest in AI-driven money management: 64% would trust AI with disposable income decisions, 54% to move money to avoid overdrafts, and 52% to manage bills. This comes alongside weak financial confidence, with 37% reporting poor self-discipline and 80% wanting to improve their financial knowledge.

Last week, we discussed how people are increasingly turning to AI for healthcare advice. Now we’re seeing the same pattern with personal finance. These are high-stakes domains where bad advice can cause real harm, yet users are willing to delegate decisions to AI anyway. The common thread is accessibility: AI is available 24/7, doesn’t judge, and provides immediate answers. Trust remains a gating factor though (as we’ve discussed previously), with 23% saying they want incremental proof before wider use.

Source: Cleo AI survey on financial trust

Quickies

Google.org is providing $2M to Sundance Institute to train 100,000+ artists in AI filmmaking skills with free curricula and scholarships. src
SAP and Fresenius are building a sovereign AI platform for healthcare with a mid three-digit million euro investment using on-premise-ready models that preserve data sovereignty. src
Tesla’s AI5 chip design is nearly finished with AI6 in early stages, targeting a 9-month design cycle for continuous generations of custom AI accelerators. src
PJM projects 4.8% annual electricity demand growth from AI data centers, with consultants forecasting a 25% rise by 2030 and real risk of East Coast rolling blackouts. src
ChatGPT Go launched worldwide at $8/month with 10x more messages than free tier, while OpenAI will test ads in free and Go tiers. src
AstraZeneca acquired Modella AI to embed pathology-focused foundation models directly into oncology R&D for faster biomarker discovery. src
Apple is fighting for TSMC capacity as Nvidia likely overtook Apple as a top customer, forcing Apple to compete for leading-edge wafer slots. src
Veo 3.1 adds native 9:16 vertical output for mobile-first short-form video and state-of-the-art upscaling to 1080p and 4K. src
Kaggle launched Community Benchmarks for reproducible multi-step reasoning, code execution, and tool use evaluations across models. src
OpenAI published a response to Elon Musk’s lawsuit, claiming Musk wanted absolute control and proposed merging OpenAI into Tesla before leaving. src
Palantir’s ELITE tool maps deportation targets for ICE with address confidence scores, ingesting government and commercial data for raid prioritization. src
Coding on paper as a deliberate training method forces engineers to slow down and master fundamentals rather than outsourcing cognition to tools. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

AI Can Do Your Job - Now What? | AI for Software Engineers 77

Logan Thorneloe — Thu, 15 Jan 2026 15:48:33 GMT

Two releases this week show how far AI coding tools have come. Claude 4.5 Opus is now more accessible with higher rate limits, and Claude Code has improved its planning capabilities, spending more time on design and less on iteration and enabling enough tokens for developers to use it full-time.

The second is Ralph Wiggum, a methodology/Claude Code plug-in for terminal agents that enables them to work autonomously for hours. It breaks tasks into work items with finishing criteria, then loops until all criteria are complete. The output works according to specification.

The key that makes this work so well is periodically resetting context, tracking progress via external files rather than keeping everything in memory. This prevents the drift that happens in long-running sessions and enables brand-new agents to take stabs at a problem until it’s done.

Together, these mean a coding agent can be given a product specification in the evening, work overnight, and have code ready for you in the morning. This code is usually entirely within spec and viable for a minimum viable product or even better.

So now that AI can whip up these prototypes overnight, what does that mean for you? A few things:

Be user- and product-focused. The important parts of software engineering are still important. Understanding products and outlining requirements to fulfill them is still on the engineer (i.e. giving the requirements to Ralph as mentioned above). Studies show that teams that are product-focused are more successful when using AI developer tools than their counterparts. Iterating based on high-quality user feedback is key to maintaining an effective product-focus.
Learn to use AI tools. This should be self-evident, but there are still engineers refusing to learn them. They’re the future of software development and there’s a steep learning curve to use them effectively. If you want to take the next step toward using AI to be more productive, you should both implement and try out new AI coding methodologies and tools, such as the Ralph loop. If you want to get hands-on this week, I suggest implementing this in your work environment and giving it a go.
Get good at reviewing. I know this is the boring part of engineering, but now it’s even more important. Review well enough that you’re confident in what’s going to production and that you understand how it works. Get very good at understanding system design as I find integration with surrounding systems is where these AI coding tools fail and it’s often the most difficult to detect.

Here’s everything else you need to know from this past week.

My Picks

Standalone content worth your time:

Finding and fixing Ghostty’s largest memory leak by Mitchell Hashimoto: A deep dive into debugging Ghostty’s PageList memory leak that grew to 37 GB after 10 days. The fix involved preventing reuse of non-standard pages during scrollback pruning. A great example of methodical debugging with practical techniques like macOS VM tagging.
8 plots that explain the state of open models by Nathan Lambert: China’s open models dominate adoption, led overwhelmingly by Qwen whose top variants have more downloads than many competitors combined. Qwen also leads finetuning activity on HuggingFace, though DeepSeek dominates at very large model scales.
5 GPU performance optimization methods: An easy-to-follow explanation of five GPU optimization methods for LLMs: batching, mixed-precision (FP16), tensor/kernel fusion, memory pooling, and CUDA stream management. Practical impacts include roughly 2x memory savings with FP16.
Demystifying evals for AI agents by Anthropic: A comprehensive guide on why agent evals are harder than model evals. Autonomy, tool use, and long-horizon planning introduce external dependencies and emergent behaviors that traditional testing can’t handle. Covers strategies for realistic environments, mixing automated and human assessments, and measuring both task performance and failure modes.
No, Claude Code doesn’t need a better UI by Logan Thorneloe: I wrote about why Claude Code’s terminal-based approach is actually its strength. The terminal is standardized, scriptable, and predictable, making it ideal for automation compared with brittle GUIs. Claude can control files, apps, and any CLI- or API-driven application via text commands.

Claude Cowork brings terminal agents to everyone

Anthropic released Claude Cowork, an adaptation of Claude Code that runs in the Claude app on Mac and performs general-purpose computer tasks. This is only available to Max subscribers and only on Mac for now.

I just wrote an article about how Claude is a general-purpose computer use agent, not just a coding tool. This means you can get just about anything done you could do via the terminal by prompting Claude. I stand by the fact that the terminal is still an excellent UI that builds intuition about what you can and cannot do with Claude as you watch it work. More info on Claude’s productive capabilities in the sources below.

Source: Simon Willison on Cowork, Cowork announcement on X, Ethan Mollick on Claude Code, My article on Claude Code as a computer use agent

Anthropic restricts third-party API access amid abuse concerns

Anthropic blocked two parties from using their resources this week:

Competitors such as OpenAI and xAI, to give Anthropic a competitive advantage.
Third-party harnesses that took advantage of Claude Max subscriptions, to ensure usage rates on these subscriptions can’t be spoofed.

This caused competitors such as Codex to jump on providing usage to third-party harnesses where users previously would have used Claude models. It makes me wonder about two things: how much goodwill did Anthropic lose to save money on the spoofing and what will be the long-term impact of other tools being more accessible to users?

Apple partners with Google to power next-gen Siri with Gemini

Apple signed a multi-year deal to base its upcoming Foundation Models on Google’s Gemini, enabling a more personalized Siri expected later this year. All inference and customization will run on Apple silicon and Apple’s Private Cloud Compute to preserve user privacy. My understanding is that Apple’s models will be based on the same LLM technology as Google’s.

I’ve seen a lot of takes on this, but the most prominent is that Apple has admitted defeat. Instead, think of this as a business decision. Apple doesn’t have a model ready that they think will guarantee an excellent assistant experience. They use Google’s models for now to ensure they can deliver a quality product to their users and they don’t lose any ground in the smartphone market. In reality, Apple is doing quite well in AI as their silicon and hardware have become a staple for serving large models.

Source: Apple-Google Gemini partnership

AI in healthcare faces mounting scrutiny from regulators and experts

A few things happened in AI-related healthcare news this week:

Google has had to remove several AI-generated health summaries to ensure misinformation isn’t spread.
OpenAI added Health to ChatGPT, enabling a user to discuss their health and health records with ChatGPT directly in the app.
Studies show more people are using AI for self-diagnosis, with one figure showing 59% of Brits are doing so.

OpenAI claims this is to ensure accurate information is given regarding healthcare and to enable users’ health-related queries to have the context of their current health information. Many are skeptical of sharing their personal health data with ChatGPT as most queries given to ChatGPT are used for training. OpenAI has guaranteed this won’t be the case with Health in-app.

Source: Google removes misleading AI health summaries, 59% of Brits use AI for diagnosis, ChatGPT Health critique

Tailwind’s layoffs reveal how AI adoption can destroy business models

Tailwind cut 75% of its staff after AI coding agents drove the CSS framework to 75 million downloads per month while simultaneously killing 40% of site traffic. Site traffic generated conversions to paid services, and this change in revenue contributed to an 80% revenue drop. Shortly after, Google AI Studio announced it would sponsor the Tailwind project.

Tailwind is one of the most popular frontend component libraries, but AI is fundamentally changing how information is consumed and transferred, meaning business models will need to adapt as well.

Source: Tailwind layoffs, Google AI Studio sponsorship

Building reliable AI agents requires rethinking evaluation

The difficult part of agent observability is logic being shifted from code to models. This means traditional test cases fail because model output can’t be tested deterministically. This is what makes AI observability such a difficult issue.

Anthropic recently released a blog post detailing evals and what makes them so tough, including the gold standard method of testing coding, computer use, and conversational agents. One big takeaway is that evals aren’t 100% foolproof and need to be accompanied by production monitoring, A/B testing, and user feedback. I highly recommend reading Anthropic’s report linked below.

Source: Harrison Chase on traces as documentation, Anthropic on agent evals

Quickies

Malaysia and Indonesia blocked Grok after regulators found it was generating sexually explicit images, including depictions of minors. src
US job openings dropped to 7.15 million in November, the lowest in over a year, with vacancies per unemployed worker falling to 0.9. src
NVIDIA and Eli Lilly will invest up to $1 billion over five years on an AI co-innovation lab for drug discovery. src
Bose is open-sourcing SoundTouch’s API instead of bricking the speakers when cloud support ends. src
Meta’s $2 billion acquisition of Manus triggered a Chinese Ministry of Commerce review for potential export control violations. src
Gemini CLI now offers “Agent Skills” that can be installed via npm. src
Self-hosting has become practical with cheap mini PCs, Tailscale, and CLI agents like Claude Code handling setup. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

AI's Role in Maduro's Capture | AI for Software Engineers 76

Logan Thorneloe — Wed, 07 Jan 2026 16:01:41 GMT

Here are my picks for content you don’t want miss and everything you should know about AI for January 7, 2026. Enjoy!

My Picks

21 lessons from 21 years at Google by : Lessons learned from working at Google for 21 years. Two notable lessons: most slow teams are actually misaligned, and the best engineers are obsessed with solving user problems. All are worth reading.
Reasoning models are a dead end by : A valuable take on reasoning models and their lack of interpretability. Reasoning encoded into model weights loses 95% of intermediate branching and produces brittle behavior compared to externalized reasoning infrastructure. A great example of why engineering is so important in AI.
The suck is why we’re here: Some great perspective on writing with AI. The author argues that AI shortcuts the crucial, difficult parts of writing (research, stuck thinking), and that avoiding these “sucky” parts sacrifices depth and lasting reward. AI will increase quantity but lower average quality, making genuine effort stand out.
Advent of Code 2025 with Compute Shaders by : An excellent exploration of implementing Advent of Code solutions using GPU compute shaders on Metal. The GPU kept consistent times (~5ms) as problem size grew while CPUs slowed dramatically, demonstrating practical applications for massively parallel problem solving.
Building AI Agents, Open Code And Open Source by : I thoroughly enjoyed reading this interview, especially the parts about open versus closed source tools and the motivation behind them. Terminal agents are only going to be more important this year and this does a great job of helping readers understand them.

Things you should know

AI was used to push narratives in Nicolás Maduro’s capture

AI-generated media circulated the internet following the US capture of Venezuela’s president Nicolás Maduro. Fake images depicted the capture itself, while a deepfake video showed Venezuelans crying tears of joy. Both were used to push specific narratives about the operation.

Any company serious about AI needs to help viewers discern between AI and non-AI media. The images above were caught by Google’s SynthID watermark which Google attaches to all AI-generated images using Gemini. Sure, anyone can switch to a non-watermarking tool, but even putting up a small obstacle to generating a fake narrative is a big win.

Source: EBU Spotlight on Maduro fake images, Yahoo News on fake celebration video, Google SynthID

See how SynthID works below:

AI safety concerns mount as AI chatbots face serious scrutiny

xAI was fined 120 million euros under the Digital Services Act due to Grok generating sexually explicit images of women and children. Separately, a lawsuit alleges OpenAI is withholding ChatGPT logs after a murder-suicide where transcripts show the chatbot validated a user’s paranoid delusions.

AI safety is foundational to ensuring we can apply AI to the applications where it’s needed. It’s crazy to me that AI safety teams were previously understaffed or dismissed. Both of the examples above show why AI safety is important and also some of the difficulties that come with ensuring safety.

Source: TVP World on Grok, Ars Technica on ChatGPT logs

Half of AI-generated code has security flaws

Over 30% of senior developers now ship mostly AI-generated code, and the trade-offs are becoming clear. AI code shows logic errors at 1.75x the human rate, XSS vulnerabilities at 2.74x, and roughly 45% of it has security flaws. PR sizes are up 18%, incidents per PR are up 24%, and change-failure rates have risen 30%. Properly configured AI review tooling catches 70-95% of low-hanging bugs.

These statistics echo my recent article detailing how AI impacts an organization’s engineering culture. AI is an amplifier, and if your processes aren’t solid, AI will make them worse.

Source: Addy Osmani, AI for Software Engineers

The best way to fight AI cheating in education is with AI

An NYU professor is using AI to conduct oral exams with students at just 42 cents per student. The AI asks follow-up questions and probes understanding in real-time, forcing students to verbally explain concepts rather than paste in AI-generated answers. This follows a trend where some schools have removed online math courses entirely or now require in-person testing as instructors note declining problem-solving skills and increased reliance on copying AI outputs.

One of my biggest concerns with AI is education. It has potential to be the greatest multiplier but also the worst detriment in this space. As with many other applications, we’re seeing AI-related problems being combatted with AI-related solutions.

Source: Reddit discussion on AI oral exams

Claude Code creator shares his setup for using Claude Code

Boris Cherny, who created Claude Code, runs multiple instances at a time with a focus on Opus 4.5 with “thinking.” It needs less steering despite being slower per token, which increases velocity in the long run. He also claimed that Claude Code’s updates are all written entirely by Claude Code itself.

Separately, a principal engineer at Google mentioned just how far Claude Code has come by saying it can now design specs that took multiple engineers a few months ago. An ex-Google PM commented on this explaining how important it is for engineering teams to be using competitors’ products to improve their own.

My only addition: stop thinking of Claude Code, Gemini CLI, and Codex as coding agents. Instead, think of them as terminal agents. Anything you can do from the terminal, it’s possible to get AI to do for you.

Source: Boris Cherny on X, Jaana Dogan on X, Raiza Martin on X

Research to watch in 2026: Recursive Language Models and Manifold-Constrained Hyper-Connections

Recursive Language Models (RLMs) let models handle context windows up to 100x longer than their native limits by breaking inputs into chunks and processing them programmatically. In tests scaling from 8K to 1M tokens, base models degraded sharply while RLMs maintained performance at comparable cost.

Separately, a technique called Manifold-Constrained Hyper-Connections (mHC) stabilizes model training with only 6.7% overhead, eliminating common instability issues that plague large model runs.

Both papers tackle fundamental scaling bottlenecks: RLMs at inference time and mHC at training time. If these techniques hold up, they could meaningfully change how we build and deploy large models.

Source: Alex Zhang on RLMs, mHC paper on arXiv

NVIDIA acquihires Groq through licensing deal

Groq signed a licensing deal with NVIDIA that will see about 90% of Groq’s 400+ employees move to NVIDIA at a $20B valuation. Groq will remain independent and GroqCloud will continue operating. Groq’s specialty is developing compute with incredibly low-latency inference, something Nvidia can benefit from as it continues to ramp up its research and development of AI compute.

This is another acquihire within the AI industry. The most recent I can think of was Google acquiring talent from Windsurf which led to Google’s Antigravity IDE. I see something similar happening at Nvidia where they’ll come out with even lower latency compute offerings for customers.

Source: The Chip Letter by

More...

A shape-shifting molecule discovery could change the future of AI hardware. (Science Daily on shape-shifting molecules)
Micron shares surged over 10% on AI optimism and increased demand for high-performance memory. (Micron stock coverage)
California State Senator introduced a four-year moratorium to ban AI chatbot-equipped toys for minors. (Coverage of AI toy moratorium)
Claude Code can run on-the-go using an iPhone via Termius and mosh to a VM costing about $7/day. (Granda.org)
Advanced AI could collapse labor’s share of GDP toward zero, concentrating wealth among capital holders. (Dwarkesh Patel on X)
An excellent overview on the past 10 years of AI. (Weighty Thoughts by )
An interesting read from an author who canceled their technical book publishing deal for various reasons. (Austin Henley)
PostgreSQL dominated 2025, driving major acquisitions and new DBaaS launches across all major cloud vendors. (Databases in 2025: A Year in Review)
Two excellent 2025 retrospectives worth reading. (Ignorance.ai on 10 AI stories by , Simon Willison on the year in LLMs)

Last week

In case you missed it, here’s last week’s overview:

I’ve removed the jobs and industry updates from these weekly roundups. I haven’t been able to fit them properly at this cadence and will be moving them to their own, less frequent articles. Stay tuned!

Thanks for reading!

Always be (machine) learning,

Logan

What are World Models? | AI for Software Engineers 75

Logan Thorneloe — Tue, 23 Dec 2025 14:03:02 GMT

Yann LeCun has confirmed his startup, Advanced Machine Intelligence (AMI), will develop world models and is currently seeking fundraising at a $5B valuation. While headlines focus on the $5B valuation, I care much more about the work.

Crazy valuations aren’t uncommon in the world of AI. The potential for this technology is astronomical even if the roadmap to get there is still being discovered. I view AI’s potential as shifting humanity’s problem-solving from O(n) (or greater!) complexity potentially to O(1). Once we can solve problems quickly, our rate of advancement will compound and discovery will take off. If this happens, AI will be worth far more than even crazy valuations place it at.

LeCun is now directly pursuing the same area of research and industry as Fei-Fei Li’s World Labs. He also joins other great minds in AI who have said AI needs a breakthrough beyond scaling the research we currently have. Many are placing a bet on this being world models.

Put succinctly, world models aim to understand the 3D world and learn more like a human child instead of like a machine. Instead of understanding a statistical correlation between inputs to generate a representative output, world models seek to understand causal physics and spatial reasoning. World Labs puts it well on their website:

“We build foundational world models that can perceive, generate, reason, and interact with the 3D world — unlocking AI’s full potential through spatial intelligence by transforming seeing into doing, perceiving into reasoning, and imagining into creating.”

This means world models aren’t generative but instead make predictions based on abstract representations of the concepts they’ve learned. Instead of guessing at pixels, they focus on higher-level concepts.

Here’s an example to illustrate what this means: If we’re considering a car moving down a street, a generative model wastes compute estimating pixel movement for every leaf on the road. A world model would instead ignore unimportant details and focus on the latent variables that impact its understanding such as the car’s velocity or the friction between the car and the road. Those details would be used to predict the world’s next state.

Practically, this means two things:

World models don’t waste resources on things that are unimportant for a task.
We can use spatial intelligence to train other real-world AI applications without needing to collect new data. Instead, world models can generate (or “dream”) their own. This is both very efficient and safer than the alternative (think about an application like self-driving where there is always inherent risk with data collection).

I predict we’ll be seeing a lot more about world models in 2026 and I’m curious to see how far we’ll get. Enjoy this week’s resources!

Subscribe now

This week’s curated highlights

My LLM coding workflow going into 2026: This is an excellent comprehensive overview of Addy’s AI coding workflow. The best way to optimize any AI-related workflow is to discuss and share what works with others. I highly recommend checking this out and sharing your workflows with others!
Gemini Plays Pokemon: This report compares the performance of Gemini 2.5 Pro to Gemini 3 Pro playing Pokemon. It’s an interesting read. Gemini 3 Pro didn’t just play better; it exhibited creative problem-solving by finding a loophole to multitask.
Andrej Karpathy revealed his 2025 LLM Year in Review: I recommend just reading this, but most interesting is a shift from models imitating humans to reasoning through rewards. Other interesting notes are image model advancements, terminal-based AI, “jagged intelligence”, new layers of LLM apps, and the introduction of vibe coding.
Jeff Dean’s Performance Hints: Jeff Dean updated his guide on engineering principles for performance at scale. The writeup provides a guide to optimizing software performance at the level of a single binary. Performance is a crucial topic for any engineer to understand and is only getting more important in the age of AI.
Sam Rose’s overview of LLMs/prompt caching: This article gives a great overview of how prompt caching reduces token costs in LLM and it also gives a great high-level overview of LLMs as well using excellent visual elements. I love articles with great visuals and Sam constantly delivers.
Your guide to local coding models: This is my article from this past week that many of you have likely already read. I’m including it here because I made some mistakes in the initial release that I edited to clarify and it incited some interesting conversation across X, LinkedIn, Substack, and even Hacker News where it reached number 1.

Things you should know

New & Trends

Google released Gemini 3 Flash, outperforming its previous Pro models

Google released Gemini 3 Flash, the new generation of its smaller model for faster, cheaper applications. It is multimodal, uses 30% fewer tokens than previous models, and is significantly cheaper ($0.50/1M input tokens). The advancements of smaller, cheaper models are much more important than the advancements of large models when it comes to applications and are key to democratizing the technology.

AI coding assistants ship more, but also more bugs

CodeRabbit’s recent report shows that AI introduces 1.7x more issues than human-written code. Specifically, AI introduces 1.7x more issues, including 75% more logic errors and 3x the readability issues. AI can produce more code faster, but the code tends to be significantly worse. This volume of pull requests also causes reviewer fatigue making it more likely for errors to reach production.

New York governor Kathy Hochul signed RAISE act

Despite an executive order from President Trump to consolidate AI regulation to the federal level, New York governor Kathy Hochul signed the RAISE Act to establish safety standards for large AI companies in New York. This act requires companies with over $500 million in revenue to comply with specific safety standards.

AI data centers have a carbon footprint that matches a small European country

A new study shows that AI systems could produce roughly the same amount of carbon dioxide as New York City or Norway (about 80 million tons). While this is an estimate (as exact numbers are hard to come by), it shows the environmental impact AI could have and emphasizes that the real long-term problem AI needs to solve is power and energy.

AI has entered the game industry

Roblox Studio is integrating AI into their platform. This enables users to generate assets with a prompt, use AI for real-time voice translation, and orchestrate work across other creative tools via MCP. The goal of this shift is to enable users to get to market faster and be more profitable.
Contrarily, Larian Studios (the creators of Baldur’s Gate 3) CEO Swen Vincke explained Larian only uses of AI in game development for content exploration similar to how they use Google images and art books. This was after they were accused of trying to use AI to replace artists which Vincke explained was false and they’re actually looking to hire more artists, not replace them.
The video game industry has many potential applications for AI, but gamers aren’t excited about how they’ll impact the actual games produced. In 2026, I’m certain we will see it used more as it becomes more cost effective and I’m curious to see what the usage will be.

Research

Duke scientists created an AI that simplifies complex data

Researchers have created a physics-bound deep learning model that takes in complex data and outputs a much simpler, mathematical representation of that data. This is particularly useful in domains with a ton of data but without equations to explain relationships within that data. This AI creates a starting point for a formulaic representation of that data.

OpenAI research on agents and their capabilities

Just a week after discussing OpenAI’s Code Red, they’re pumping out research at an astonishing rate. They’ve released research on chain-of-thought monitorability, AI’s capability to accelerate biological research in the wet lab, and AI’s ability to perform scientific research tasks. I’m particularly interested in the first (and you should be too if you’re building any sort of AI-powered application) and I’m loving the trend of applying AI to accelerate research.

Product & Releases

Google Labs released CC, an AI agent connected to Gmail, Drive, and Google Calendar to deliver a personalized briefing of your day.
Anthropic released Agent Skills as an open standard so other companies can get on board with integrating them within their products. Anthropic’s push to release AI standards has gone a long way to solidify their standing in enterprise AI.
OpenAI released GPT-5.2-Codex, a version of GPT-5.2 for coding that achieves state-of-the-art on SWE-Bench Pro and Terminal-Bench 2.
ChatGPT and Codex CLI are also adopting skills similar to Claude.
Google introduced FunctionGemma, a small model designed for function calling on edge devices.

Safety

OpenAI is upping their safety game by adding under-18 principles to their model spec for users aged 13 to 18. OpenAI has also created a guide for families and parents for responsible AI use.

Career resources

State of the market

AWS CEO says replacing junior engineers with AI is foolish

Amazon has a ruthless history of optimizing for cost, even at the detriment of its employees. When the CEO of AWS states replacing junior engineers is a bad idea, you know it’s true. The software engineering industry will be in a bad place when we need more senior engineers but we’ve abandoned the junior engineers that should become them.

Not everyone is convinced by AI coding

This is an interesting article I actually helped contribute to that highlights the gap in expectation versus reality for AI in software engineering. The expectation is set very high that engineers should be able to greatly increase their productivity by using AI coding tools. In reality, it takes time for developers to figure out these tools and how to use them productively. In fact, the initial onboarding can even cause a development velocity hit that isn’t reflected in performance expectations.

Interesting opportunities

Google Ads is still hiring! There’s an aggressive push to hire top talent in Google Ads. We’re specifically looking for mid- to senior-level developers with experience in Python, C++, Go, or Java that have worked on large-scale distributed systems. GenAI, ML, and ads experience is a plus. If you’re interested, check out the open roles here and/or DM me with any questions.

Learning Resources

Packt has a deal for $10 for any technical ebook. This goes down to as low as $5.99 if you buy 10 or more. This is the best deal I’ve seen yet to build a technical book library.
This comprehensive open-source roadmap will walk you from LLM fundamentals to deploying advanced LLMs. It structures the learning path into 3 distinct tracks: LLM Fundamentals, the LLM scientist, and the LLM engineer.
Check out an Agentic AI problem set developed by @Prof. Tom Yeh. Professor Yeh writes AI by Hand, an excellent resource for understanding the inner workings of machine learning models. This resource is great to test your knowledge of AI agents and upskill while you’re at it.

If you missed last week’s article, you can check it out here:

Thanks for reading!

Always be (machine) learning,

Logan

What You Need to Know for 2026 | AI for Software Engineers 74

Logan Thorneloe — Tue, 16 Dec 2025 20:24:59 GMT

Hi Everyone!

Welcome to the weekly update edition of AI for Software Engineers! I go through everything software engineers should understand about AI by filtering noise and contextualizing what matters. I tend to focus on current events, tooling, research, and other interesting content.

This was an incredible week. In this edition, we discuss:

The AI industry’s shift toward practicality
The Linux Foundation taking over MCP
OpenAI’s Code Red and what that actually means
Developer tool updates
The in-demand skills for the 2026 software engineering job market
The learning resources to learn those skills

—

tl;dr:

It’ll be easier for software engineers to break into AI next year. If you want to do so, focus on developing a skillset in AI cybersecurity, building agents, and MLOps and specifically aim to understand agent workflows, evals, and protocols. Agents will still be a primary focus, but the complexity of building systems with them is much better understood.

MCP is now under the stewardship of The Linux Foundation to encourage open standards. OpenAI’s Code Red is about them aligning their priorities to reach positive revenue. Many developer tools have seen updates/releases. Companies are actually seeing a return on their agent development investments.

More detail on all this below and interesting opportunities.

Subscribe now

Before we get into it, some housekeeping.

A few updates:

We’ve got a new logo! It’ll be representing the newsletter and will be seen around more. It might even be on some swag soon…
As always, I’m ironing out the format of these weekly updates to be more beneficial for both me when doing my research and you when reading. I’m trying to make it a bit more interactive. You can now leave comments to help drive the direction of the newsletter. I’m also trying to find resources for you to learn everything the skills I write about.
I’m working on a way to get readers more involved in the newsletter. Don’t forget that we’ve got an ML roadmap to help anyone learn ML fundamentals and an AI for SWEs repo to get hands-on with building AI-related products. I’m looking to make these resources more community-oriented soon.

Partner with AI for Software Engineers!

If you want to support AI for Software Engineers and get viewed by 11,000+ developers each week, reach out to sponsor an issue. I’m particularly interested in excellent learning resources, developer tools, and career opportunities.

What’s Been on My Mind

This past week has seen a shift in the AI industry toward practicality. Recently, we’ve seen influential voices mention that the economic impact of AI hasn’t been living up to the hype. Most notably, we’ve seen Andrej Karpathy and Ilya Sutskever mention this during their most recent appearances on the Dwarkesh podcast.

I’ve been thinking about this a lot for two reasons:

The actual statistics about the job market don’t match what I’m noticing about everyday work.
I’ve been working on AI integration into developer workflows at work with world-class engineers and it’s much more complicated and cutting edge than we had anticipated.

First, a recent study reported a severe decline in job listings for junior developers. Almost frighteningly so—to the point that the industry will be heavily impacted in the coming years as we don’t have enough junior engineers to fill the demand we need.

Everyone said AI would kill software engineering, but it turns out this has very little to do with it actually taking jobs and more to do with AI hype convincing leadership that it can.

From my research and daily work, I would expect the number of junior developer positions to have increased. AI makes junior developers much more capable when given to them at a company with a good engineering culture (I’ll include more on this in a separate article next week. There’s actually an entire study to prove this is the case and it’s super interesting).

Honestly, It’s kind of a cheat code for companies to hire junior engineers in a market like this. Companies that get their pick of the most talented engineers for less. We’ve also never had so many tools to increase onboarding velocity and enable developers to build more.

What my team at Google is seeing are tons of opportunities to apply AI to developer and machine learning workflows and speedups, but applying these properly is much more complicated than one might think. A lot of thought needs to go into security and ensuring system performance. AI evals are much more difficult than regular test suites.

In 2026, we’ll see more applications of AI explored and productionization of agents mature. It’ll be even easier for software engineers to get involved with AI as companies realize the useful applications of AI and the headcount required to achieve it.

If you want to get into AI as a software engineer in 2026, these are the top three skills I’d focus on:

Building agents. Agents will continue to build in 2026 and companies will narrow on their most impactful applications. These applications will far outnumber the supply of developers able to build them.
MLOps. This has been an incredibly valuable skill for about a decade now and will only get more valuable in the coming years. More companies using AI means more models are being trained. Companies will need engineers that understand that training process and can build the infra necessary to make it happen.
AI Cybersecurity. You wouldn’t believe the security and privacy complexities non-deterministic systems introduce. This is another article I’ve got in the works and something we’ve been deeply exploring at Google. If you can understand this, there will be opportunities available.

Links to learn each are included in the ‘Learning Resources’ section at the bottom.

The last edition of AI for Software Engineers

In case you missed the last AI for SWEs, here it is. There’s more on agents, Ilya’s podcast appearance, and the importance of AI security there.

Things you should know about

Software engineering is going agentic

We already know that nearly 90% of organizations are using AI to code, but now agents are making their way into enterprises. 57% of organizations are deploying agents for multi-stage workflows with 16% of those being cross-team workflows. In 2026, 81% of teams plan to use agents with 39% of those agents being developed for multi-step workflows.

Interestingly, 80% of organizations are reporting a return on their investment. As we’ve mentioned previously, this is a number that is very difficult to quantify. What does ROI actually mean in multi-step agentic workflows? It greatly depends on the workflow and the goals it aims to achieve. There isn’t a universal standard for quantifying this uptick in velocity.

The use of agents and AI is extending beyond traditional software engineering tasks (code planning, generation, document, review, etc.—where they’re seeing a 59% increase in productivity) to tasks like data analysis and report generation where they’re seeing similar gains.

I highly recommend reading Anthropic’s 2026 State of AI Agents Report, even if you only read the foreword. If you want me to go more in-depth into this so we can really get into how enterprises are using AI and the ROI they’re achieving, comment at the bottom of this article.

Leave a comment

OpenAI declares a ‘Code Red’

OpenAI declared a Code Red internally as competitors have started stealing market share. Most notably, Google is stealing is gaining both consumer and enterprise market share in AI tools and Anthropic has increased their revenue considerably to the point where they’re considering IPO’ing in 2026.

I believe this is being largely overblown by the public. OpenAI has revolutionized the consumer understanding of LLM products and continue to lead in that area. Over the past three years, they’ve been competing with (and beating) a company of Google’s size as a startup. In reality, Google’s resources far outnumber OpenAI’s own. What we’re seeing is OpenAI figuring out the focus of their product offerings to become revenue positive.

Anthropic is a great example of why this is important. Six months to a year ago, many individuals counted Anthropic out of the AI race because of their size and less prominent standing. In reality, they’re almost revenue positive and are consistently beating larger companies in their areas of focus, the most notable of which is coding.

We’re seeing OpenAI figure out the same right now as they continue to push their product offerings forward.

The Linux Foundation now owns Model Context Protocol (MCP)

The Linux Foundation is a non-profit organization that has been dedicated to fostering the growth of open-source software for decades. Some examples of software they steward are The Linux Kernel, Kubernetes, Node.js, and PyTorch. They have a history of maintaining vendor neutrality and facilitating open collaboration.

The Agent AI Foundation (AAIF) is a directed fund under The Linux Foundation. It’s a joint partnership between major players in AI including OpenAI, Google, and Anthropic. The goal of the AAIF is to maintain agentic AI transparency and collaboration to ensure the technology benefits everyone and open standards continue to be developed and maintained.

MCP will join the AAIF as a founding project to bring AI open standards under one roof. This is great news for all of us as open standards make it much easier to actually build and apply new technologies.

Developer tools continue to grow

So many developer tools are being developed and updated each week. Here are some notable developments in a rapid fire format:

You can now build applications with Gemini Deep Research Agent integration. / Google leases the FACTS Benchmark Suite, a benchmark for testing a model’s accuracy and groundedness. / llama.cpp server now includes a router mode that lets you dynamically load, unload, and switch between models without restarting by running each model in its own process. / ChatGPT is adopting skills / Gemini CLI introduces session management. / Thinking Machines has released Tinker to everyone. / Mistral releases Mistral Vibe, their AI CLI coding tool.

My picks for the week

These are the videos and articles from this past week I think are most worth watching/reading outright. I highly recommend you don’t miss them:

TPU Mania by : Google’s recent decision to sell its TPUs externally and the speed of the TPU v5p (2.8X faster than v4) have created a major “vibe-shift” in the industry, setting up the most keenly fought architectural contest since CISC vs. RISC in the 1980s.
Researchers Built a Tiny Economy. AIs Broke It Immediately [Video]: In the SimWorld delivery economy, AI agents high in “openness to experience” became “shopaholics,” kept buying unused scooters, and went broke, while conscientious agents were the “boring winners” that achieved high profits by focusing strictly on the task at hand.
How to use Claude Code for Maximum Impact by : Enterprise adoption of Claude Code, demonstrated at companies like Doctolib, drastically cuts engineering time by allowing engineers to replace legacy testing infrastructure in hours instead of weeks, helping them ship features 40% faster.
Top 5 AI Model Optimization Techniques for Faster, Smarter Inference: Discusses optimization techniques like Quantization-Aware Training (QAT) and Pruning plus knowledge distillation, which make models cheaper, smaller, and more memory efficient to operate in production.
Olmo 3 and the Open LLM Renaissance by : The Olmo 3 family of models (7B and 32B) is unique in that it is “fully open,” releasing model checkpoints, all training data, and training code, making it an unprecedented and comprehensive starting point for open LLM research.

The state of the market

CEOs are still betting huge on AI in 2026. As mentioned above, there’s a huge demand for developers that can build agentic AI systems. This means taking a problem, prototyping an agentic solution as needed, and building the entire system. This means understanding complex, multi-step workflows and the work that go into ensuring these systems are productionized.

To learn this, I’d focus on understanding (resources for learning each at the bottom of this article):

Evals. These are like tests for LLMs and agentic systems. All software engineers know that testing gets much more complex when systems are non-deterministic and that’s what makes evals so complicated.
Protocols. If you haven’t spin up an MCP server so your favorite CLI tool can access a resource you need it to. MCP servers are huge for integration into agentic workflows and the best way to learn them is by building one.
Agentic workflow patterns. There are certain patterns to building agents that are followed for specific use cases. I’ve linked a guide in the ‘Learning Resources’ section.

If you want any of these skills to be added to the AI for SWEs hands-on learning repo, comment which you’d like to see at the bottom of this article.

Leave a comment

Interesting opportunities

If your company is hiring, you can reach over 11,000+ developers by including it in this newsletter. If you’re interested, reach out to me.

Google Ads is aggressively hiring top talent

If you’re interested in working at Google Ads and you have experience with large-scale distributed systems, working in Ads, working in ML/AI, or solving complex problems at scale, please reach out! Languages of particular interest are C++, Go, and Python, but those are not a limiting factor. You can DM me here, on X, on LinkedIn, or hit up my email.

Please make sure to include information about yourself and why you’re a good fit in the DM/email. I will not respond to just ‘Hello’ (see aka.ms/nohello).

Anthropic is hiring eval talent and accepting applications for their Anthropic Fellows Program

If you aren’t on X, I’d highly recommend lurking there. If you hate the algorithm, let me know and I can help you out. Companies are aggressively seeking applicants on X and I’m guessing this is due to AI-related problems on LinkedIn.

Anthropic is looking for talent to build the next generation of evals and eval infra. They are also taking applications for their Anthropic Fellows Program which is a full-time research commitment with mentorship from Anthropic researchers. It has about a 40% chance of a full-time offer after completion if your work is excellent. Definitely check it out.

Thinking Machines is looking for many research engineers to fill ML infra positions

Thinking Machines has multiple ML infra-related research engineer positions. They’re especially cool because they’re a cross between research and engineering (meaning you’re building at the cutting edge of AI) but they also seem to be highly user-centric.

Google is hiring student researchers

Google is hiring student researchers for 2026 to work at the cutting edge of AI. If you’re into multi-agent AI systems, RAG, prompt optimization, or self-improving agents, please apply! Again, this is another job opportunity they’re sourcing through X. If you’re not on X, join and message me!

I’ll be adding more in the opportunities section as I come up with a better way to organize and keep track of all of them.

Learning resources

Reinforcement Learning: Stanford’s Deep Reinforcement Learning lectures on YouTube are world class lectures accessible entirely for free.
Agentic Workflow Patterns: ByteByteGo newsletter recently released an article detailing these patterns at a high-level. Definitely something to be familiar with.
MLOps: I recommend checking out the MLOps community and the resources they have available. You can also find them on Substack: MLOps.
AI Cybersecurity: I don’t have a good resource for this yet. If you do let me know! Tagging in case he has a resource for this.
Building Agents: I’ve heard good things about DeepLearning.ai’s course on building AI agents. Check it out.
Agent Evals: Same with evals, also check out DeepLearning.ai. They’ve got a short course to get started on agent evals. I’ll continue looking for something more in-depth.
Agent Protocols: I recommend HuggingFace’s MCP Course to get started. Still looking for resources on other protocols.

Thanks for reading!

Always be (machine) learning,

Logan

AI for SWEs 73: What Ilya Saw and the Time of TPUs

Logan Thorneloe — Tue, 02 Dec 2025 14:30:27 GMT

Welcome to AI for SWEs where I share everything software engineers need to know about AI from the past week. This week has seen fewer but more important headlines. I’ve detailed them below.

Also, the Rapid Fire and Career Development sections are now exclusive to paid subscribers. Thanks for reading!

Subscribe now

Ilya Sutskever declares the age of scaling over and the age of research begun

“You look at the evals and you go, ‘Those are pretty hard evals.’ They are doing so well. But the economic impact seems to be dramatically behind.”

Ilya Sutskever and Dwarkesh Patel recently discussed AGI, current AI paradigms, and scaling on Dwarkesh’s podcast. Ilya brought up two topics vital for any software engineer working with AI.

First: Application is the most important thing in AI. We’re seeing impressive models that excel at evaluations and benchmarks but lack the expected economic impact. AI research is advancing rapidly, but usefulness comes from understanding how to apply it. This means identifying practical applications and understanding the complexity of engineering systems for that application.

Second: We are returning to the age of AI research. From 2012 to 2020, we were focused on research—developing effective architectures and models. Around 2020, we entered the scaling phase where we realized we could achieve impressive results by simply increasing data and compute. Now that we’ve scaled, we’re realizing that we need to explore further developments to continue advancing AI, so we’re back to research.

I’ve often stated that reaching AGI will need a new architecture or a fundamental research breakthrough. Current models are impressive and useful, but they’re insufficient for the promises of AGI.

Safe Super Intelligence is now focusing on pushing the next frontier of AI. I highly recommend watching this episode. I could listen to Ilya speak for hours.

Google Antigravity exposes critical agent vulnerabilities in local coding environments

I’m a huge Antigravity fan. I believe there’s a better way to code with AI than just a chat interface, tab autocomplete, and reviewing agent output, and Antigravity has a great chance at figuring this out.

Over the past week, Antigravity has leaked sensitive information and engineers should understand why—not just to use Antigravity, but also to build with AI. This is an issue applicable to all AI agents.

A lot of software engineers are building agents to automate developer tasks, which is great. The best way to start learning and building with AI is by automating your own tasks. The problem is that building AI agents introduces security and safety concerns not present in deterministic systems.

For a good example, read about Antigravity ingesting hidden text into its context window and that hidden text being used to collect and exfiltrate sensitive workspace files. Prompt injection can also cause Antigravity to read a user’s .env file and ingest sensitive information into its context window.

Agents may also complete tasks that a user didn’t intend. When an agent has access to a user’s local environment, this can be a huge issue. Read about Antigravity deleting the contents of a user’s drive here.

I’ll be writing a more in-depth guide on agent safety soon.

Google’s TPUs are the best business decision of the 2010s

This week highlighted just how advantageous Google’s TPUs are and just how few people understand this. Google is the only company that controls its entire AI stack including hardware, models, and applications. When developing AI, the only company Google has to wait on is itself.

This control stems from a business decision made over a decade ago to invest in AI-specific hardware. Google was the first true AI company and has been heavily investing in AI applications since the early 2010s, including machine learning libraries, infrastructure for training large-scale models, talent, and, most critically, TPUs.

TPUs provide the most significant advantage. Setting up and integrating new processors into data centers is incredibly time, capital, and resource intensive. Starting this process today would require years of work just to get it workable at scale.

Given that TPUs were designed specifically to be energy efficient for tensor processing, Google has an entire stack built to increase machine learning development velocity and keep it resource efficient.

It makes sense, then, that other companies training large AI models would want to take advantage of this. This is why major generative AI players like Anthropic and Meta are making deals to use TPUs, and why companies in capital-intensive settings, such as high-frequency training firms, are switching to TPUs in droves.

There’s a huge demand for AI chips right now, as seen by the many startups succeeding in the space. And over time, we’ll just see more companies adopting TPUs.

The White House unifies AI federally

President Trump signed an executive order, the “Genesis Mission,” on November 24th, 2025. This order aims to federally harness AI to revolutionize scientific discovery and innovation. It’s an effort of national significance, compared to the urgency and importance of Manhattan Project or the Apollo program and focused on integrating federal resources to accelerate scientific and technological breakthroughs.

A year ago, AI was widely discussed as a true national asset and a competitive advantage on a global scale, similar to weapons of mass destruction. Considering that biases and information are trained into AI models, allowing another nation to build your models for you is an inherent national security risk.

The Genesis Mission establishes the American Science and Security Platform. This secure AI ecosystem combines various machine learning assets, such as compute power, models, and datasets. The platform enables “closed-loop AI systems” to conduct research autonomously. The idea is that these closed-loop AI systems can complete research in weeks that would take humans months or even a year.

This mission combines efforts from academia, all 17 Department of Energy national facilities, and industry leaders, including Microsoft, IBM, OpenAI, Google, Anthropic, NVIDIA, and Oracle. As far as I know, this is the first serious federal push to combine U.S. assets to advance AI.

Note that this mission builds on other Trump-era policies like promoting AI exports, preventing biased data, and enabling AI-driven research developments.

Logan’s Picks

DMs Are the New Cover Letter: How to Get Hired in AI in 2025/2026 by : DMs are super important in a market where job listings are heavily saturated and this is my guide for how you should DM others for job opportunities based on my experience posting a job opportunity a few weeks ago.
Launching DeepSeek-V3.2: The new reasoning-first model balances inference cost with performance, positioned at GPT-5–level performance and supporting “Thinking in Tool-Use”. The release includes a new massive agent training-data synthesis method covering 1,800+ environments.
Bubble, Bubble, Toil and Trouble by : Wang distinguishes between financial bubbles (leverage-driven) and tech bubbles (forecast-driven). Tech bubbles are hard to time because they often overshoot initially but deliver revolution in a later “Gen2” phase once infrastructure matures.
Treat AI-Generated code as a draft by : Developers must treat AI code as a draft, verifying every line to prevent bug proliferation and skill erosion. Teams should enforce strict review processes and consider manual implementation for critical logic.
How good engineers write bad code at big companies: Bad code at big companies is often a structural result of high engineer churn and incentivized fungibility rather than incompetence. Frequent reassignments mean most changes are made by engineers new to the codebase.

In case you missed it…

In last week’s AI for Software Engineers, we discussed Gemini 3 Pro, Cloud Opus 4.5, and Olmo 3, all three important model releases. You can find last week’s issue here:

Upskill

Interesting Learning Resources

The MCP Workbook aids in learning agent design using an interactive “by-hand” pedagogy to understand complex architectures.

AI for SWEs 72: Gemini 3 Pro, Claude Opus 4.5, and Olmo 3

Logan Thorneloe — Tue, 25 Nov 2025 14:49:19 GMT

Welcome to the nearly 1000 new subscribers to AI for SWEs in the past week (yes, we changed our name—more on that below!). We’re excited to have you and to explore building AI together.

This is our once-a-week AI roundup focusing on the current events, resources, releases, and more that are most important to software engineers. There are many AI roundups, but this one focuses specifically on what you need to know and learn to build better.

This roundup includes important headlines from the past week, what they mean, and why they’re important. I include my picks for must-read/must-watch resources from the past week. I then rapid fire off all the other important things that happened and finish with some career resources (coming soon to paid subs).

This week is a doozy. I’d love to know: What are your impressions of Gemini 3 Pro and Claude Opus 4.5? Benchmarks are one thing, actual use is another. Leave a comment below to share!

Always be (machine) learning,

Logan

Leave a comment

Google Releases Gemini 3 Showing LLM Performance Isn’t Stopping

Google released Gemini 3 Pro and made it available in most Google AI products almost immediately. It topped all benchmarks including coding and multimodal reasoning. Initial usage reports it being very good for frontend development and about on par with other models for backend. Personally, I’ve found it much better at coding and reasoning all around but I’ve seen it struggle to remember fundamental information within conversations/tasks. My guess is this will be fixed over time.

Google also released Antigravity, which is the successor to Windsurf using the talent acquired during the Windsurf deal. Antigravity is an agent-first IDE that focuses on agent orchestration for planning, implementing, and debugging code directly in the developer workflow. You can watch this video to understand how it works.

Google also released Nano Banana Pro. While Gemini 3 Pro and Antigravity were both large releases, Nano Banana Pro absolutely blew my mind. It’s by far the best image generation model, but it can also generate charts and infographics with text incredibly accurately. Think about a prompt such as “Generate a good graphic to visualize this concept” and Nano Banana Pro can do it. It even supports interactive images, meaning users can generate graphics and learn by interacting with them. The Gemini app can also detect whether an image is generated by Google AI.

If you want to read more about Gemini 3 Pro, read this analysis by and this article by . It’s also worth checking out some of the more creative creations with Gemini 3 Pro on X here.

Anthropic Releases Opus 4.5 Right After ALSO Showing LLM Performance Isn’t Stopping

Anthropic released Claude Opus 4.5 right after the release of Gemini 3. This immediately took back Claude’s crown as the best coding model and for other engineering-related use cases (see graphic above). Opus 4.5 also introduces integrations with Chrome and Excel, targeting agentic workflows that require managing large codebases and documents.

Most importantly, Anthropic seems to have fixed the rate limits that were their customers’ biggest complaints. Whether Claude Code or the Claude app, the most difficult part of using Claude was rate and performance limiting. Opus 4.5 has much higher rates for users meaning many won’t have to be stingy with their Opus use. Opus 4.5 is also much less expensive via the API.

The Claude app will now auto-compact conversations so users don’t have to start a brand new conversation when their context window is full. This is huge because it’s likely what kept a lot of Anthropic fans from going all-in on Claude. It made most tasks within the Claude app infeasible.

ML for Software Engineers Has Changed to AI For Software Engineers

I did some testing with the name of this newsletter recently. Unsurprisingly, AI significantly outperformed ML in the title. It’s a much more recognizable term and much less intimidating. This newsletter will be AI for Software Engineers going forward, but the content will remain the same.

You might wonder: Why not call it something related to ‘ML Engineering’ or ‘AI Engineering’? Isn’t that what AI for Software Engineers really is? It’s a good question, but the answer is: not quite.

Tech loves to use buzzwords. ML engineer, AI engineer, research engineer, and software engineer are all used in different ways to try and delineate between job responsibilities. In reality, most job postings don’t fit into a single one of those roles and the work that engineers do in AI rarely does either. So, rather than focus on one of the above, I’m going to share bits of all three that are important if you want to work in AI as an engineer.

If you want a specific example of AI/ML/Research Engineer overlap, check out the updated About page for this newsletter (see the image above).

Google, AWS, and Meta Expand Infrastructure to Remain Big Players for the Long Term

AWS recently announced a $50 billion investment to build AI infrastructure specifically for the U.S. government, adding 1.3 gigawatts of capacity for federal missions. Google committed to 1,000x infrastructure growth over the next 5 years and is partnering with Westinghouse to deploy nuclear reactors for power. Meta is taking a novel approach to energy security by seeking federal approval to trade electricity directly, allowing them to secure long-term power for their data centers and resell excess capacity.

There’s also a huge monetary opportunity for software engineers working in AI infrastructure. Companies like Fluidstack and CoreWeave are only a few unicorn startups that have greatly increased their valuation in a short amount of time. There’s a huge demand in this space and many players will fill it. If you can work in AI infra or optimizing AI hardware, I advise you not to pass it up.

Ai2 Releases the US’s Only Truly Open Model

Allen AI has released Olmo 3, a massive milestone for open-source AI in the U.S. Unlike other open model releases that hide key model details, Olmo 3 includes the full training data, code, weights, and logs for its 7B and 32B models. It also competes at the frontier of open models, rivaling competitors like Qwen, Gemma, and more.

It’s important to understand that open source software gives you the code needed to understand how a piece of software works. For machine learning models to reach this same degree of openness, the training data along with how that data is used must be provided.

Read Nathan Lambert’s analysis for more on Olmo 3.

Waymo Expands Across the US

I wrote about how big of a deal Waymo is back in March of this year and everyone is catching on as Waymo expands across the US. Santa Clarita, San Diego, Minneapolis, Tampa, New Orleans, and Miami were all recently added to the list of Waymo-available cities.

Not only is Waymo an incredible feat of engineering, but it also has massive implications for improving the lives of everyday Americans. Autonomous vehicles have the potential to greatly decrease road injury and death, decrease transportation costs, save on time spent in traffic, and make transportation much more widely accessible.

More updates on Waymo expansions can be found here.

OpenAI partners with Target and Intuit

OpenAI has partnered with Target to bring ChatGPT to the Target app and integrate it with Target’s internal supply chain tools. OpenAI has also partnered with Intuit to power their financial products with automation. Both of these mark a long push OpenAI has been making to integrate ChatGPT with more products and collaborate with more enterprises especially within ecommerce.

OpenAI also released GPT-5.1-Codex-Max, a model designed for long-horizon agentic coding tasks that can persist context across sessions. The ChatGPT app has also been updated with a new shopping research feature and the global rollout of Group Chats. I haven’t actually tried the Group Chats feature out yet. If you have leave a comment below to let us know how it is and what you find useful.

Don’t miss last week’s AI for SWEs:

We discussed a brainrot IDE and why it’s such a terrible idea. We also touched on Yann LeCun’s bet and my hypothesis on local ML models saving you $100+ per month. You can check it out here: https://aiforswes.com/p/71

Logan’s Picks

Hugging Face CEO says we’re in an ‘LLM bubble,’ not an AI bubble: Clem Delangue argues we are in an “LLM bubble” that will likely burst next year, distinguishing this from the broader AI industry which remains robust. He predicts a market correction away from massive, one-size-fits-all models toward smaller, specialized models that are cheaper to train and easier to deploy on enterprise infrastructure. Hugging Face is betting on this shift, maintaining capital efficiency while competitors burn billions on compute.
Group Relative Policy Optimization (GRPO), by : GRPO is emerging as the preferred method for training reasoning models like DeepSeek-R1. By eliminating the need for a value function critic, GRPO significantly reduces compute and memory overhead compared to PPO. It enables large-scale reinforcement learning—often without supervised fine-tuning—allowing models to develop emergent reasoning behaviors through simple rule-based or neural reward signals.
AI Engineering in 2025: What It Really Takes to Reach Production, by : AI Engineering must shift from “building” to “gardening.” With 85% of AI projects failing to deliver value, the focus must move from treating LLMs as deterministic software components to managing them as evolving ecosystems prone to drift and decay. Success now depends on adopting rigorous systems disciplines—continuous monitoring, drift control, and lifecycle management—rather than just shipping code and moving on.
Model Quantization: Concepts, Methods, and Why It Matters: A deep technical guide on deploying large models on constrained hardware. It explains how reducing precision from FP32 to INT8 or FP16 shrinks model size and memory bandwidth usage without sacrificing accuracy. It covers essential techniques like post-training quantization and quantization-aware training, detailing how tools like NVIDIA’s TensorRT use calibration and per-channel scaling to optimize inference.
How evals drive the next chapter in AI for businesses: Evals are the only way to bridge the gap between vague business goals and performant AI products. This framework emphasizes creating “golden” datasets—outcomes mapped to specific inputs—to establish a rigorous error taxonomy. By implementing a continuous feedback loop where production logs are reviewed and fed back into the evaluation set, teams can create a compounding data advantage that outperforms simple A/B testing.

Rapid Fire

Dev Tools & Agents

Salesforce introduced Agentforce Observability to log and visualize agent reasoning steps / AWS launched Kiro, a CLI tool for building agents with property-based testing / Microsoft released Fara-7B, an efficient agentic model for computer use / Stack Overflow launched Stack Overflow Internal to feed clean data to enterprise AI agents / A new Workspace extension for Gemini CLI brings Google Docs and Calendar into the terminal / Amazon is using autonomous agents to automate red-team/blue-team cybersecurity testing / Agent design remains difficult, often requiring direct SDK use over leaky abstractions

Applications & Models

Alibaba’s Qwen AI app hit 10 million downloads in a week, outpacing early ChatGPT adoption / NTT deployed a lightweight LLM optimized for Japanese that runs on a single GPU / Google’s WeatherNext 2 forecasts weather 8x faster than prior methods using a single TPU (also here) / Meta’s WorldGen creates traversable 3D worlds from text prompts in minutes / Poe now supports group chats with up to 200 participants and multiple AI models / Google Search added agentic travel planning to build itineraries and check flight deals / released a hand-drawn calendar illustrating 24 AI architectures that I’m 100% asking for for Christmas

Research & Safety

Physical Intelligence trained π0.6, a VLA model that uses RL to learn from autonomous experience and corrections (also here) / Google proposed Nested Learning to give models long-term, associative memory / Amazon researchers introduced FiSCo, a pipeline to measure fairness in long-form LLM outputs / Andrej Karpathy argues that AI detectors are unreliable and assessment must shift to proctored exams / The new Humane Bench tests if chatbots prioritize user wellbeing over engagement / Military integration of AI is changing the rules of war, raising concerns about bias and oversight / DeepMind’s Nobel laureate discusses the future of AlphaFold and protein prediction

Upskill (Coming soon!)

Coming soon to this section will be weekly problems (ML, system design, and software-related) to help you improve your career along with information on who’s hiring, the direction the market is going, and the skills you should learn to level up your career.

Become a paid subscriber to AI for Software Engineers to get this in your inbox!

Subscribe now

ML for SWEs 71: The IDE For Gambling

Logan Thorneloe — Tue, 18 Nov 2025 14:02:48 GMT

Hey all! I’m back with the ML for SWEs roundup but with more content at the start. I plan to do both deep and shallow dives in these roundups. You can guarantee these will keep you up with the state of engineering in AI each week.

These roundups will also feature career growth content such as who’s hiring, the state of the AI job market, the in demand skills, and anything else to help those working in or wanting to work in AI. Those sections will be available to paid subscribers. If you find ML for SWEs helpful, consider becoming a paid sub to support my work.

Subscribe now

Something That Shouldn’t Exist: An IDE With Gambling Included

A new Y Combinator-backed startup is releasing a “Brainrot IDE” and, yes, it’s as awful as it sounds. Essentially, this is an IDE that shows brainrot content while a user is waiting for their AI to finish a coding task. If you’re a person, you can already tell how terrible of an idea this is.

Long-time readers of Machine Learning for Software Engineers (especially if you were a reader back in the Society’s Backend days!) know how important deep work is for software engineering.

One of the biggest cons of using AI tools for coding is constantly switching between tasks as AI codes for you. There isn’t a way to avoid this as current coding agents have significant latency while reasoning, but there is a huge benefit to limiting it and using time wisely while the AI does its work.

Now, this IDE not only encourages that, but it also exposes you to harmful content in the process. Clad Labs even shows gambling as some of the brainrot content in promotional material. Gambling is never a good thing and the state of gambling in the US is abhorrent.

To put it simply, there are 3 things you should know about this IDE:

There is no benefit to ingesting brainrot content
There is a huge benefit to the critical thinking that occurs while programming.
This IDE completely removes the latter to replace it with the former.

You’re Sleeping on Local Models

I had an argument with Gemini this past week discussing the financial feasibility of using cloud-based infrastructure versus local hardware for model training and inference. I assumed, from my previous research, that cloud infra was always the way to go but Gemini pointed out that for most personal use cases, I’m wrong.

There’s one piece of local hardware worth spending money for local LLM inference: A beefy MacBook or Mac Studio. Not only is it financially feasible, but it also provides other benefits not available from most cloud providers.

Every developer I know that’s using CLI coding tools relies on the quota supplied by the $100+ usage tiers. The most notable tools falling into this category are Claude Code, Cursor, and Codex. Many founders, solopreneurs, and developers find themselves relying on even higher tiers to get their work done at the pace they need to.

So let’s crunch some numbers using the $100/mo figure. Let’s consider the lifespan of a MacBook to be four years at least. This is if you upgrade to stay on the most current hardware.

Four years of paying for a $100/mo subscription ends up being $4800. To upgrade a MacBook purchase to a 128 GB model costs a few thousand dollars. With Apple’s neural engine, which is incredibly efficient for inference, the MacBook can run the highest tier open coding models. These models are on-par or better than the previous generation frontier models and tuned specifically for coding tasks. These open models are incredibly performant.

This means the laptop can run LLM coding agents at less of a cost than cloud infra can provide. This especially true when you consider the lifespan of a MacBook can be well over a decade and the next coding models will only become more performant at the same size or smaller and capable of running locally on lesser hardware.

Local models like this also provide a benefit most cloud providers cannot: reliability. In discussions about local AI, the primary focus is usually privacy and security. Running an LLM locally also means you don’t have to worry about performance dips when model weights are updated or APIs are throttled. Your model opens updates when you update it.

With 128 GB (and definitely 256 GB) of RAM, you can run a 70B+ parameter quantized coding model right on your machine. Many of these open coding models are on-par with or better than the previous generation of top tier models. They nearly replicate the experience of using the CLI tools from large companies (with variance based on user preference, tool usage, etc.).

If this is something that interests you, reach out to me. I’ve purchased a 128 GB RAM MacBook Pro set to arrive today to test this out personally. I’ll be working with these local coding assistants for the next bit and I’ll let you know how it goes.

Leave a comment

Yann LeCun Was Right and He’s Backing That Up

A while back I wrote an article explaining why transformer-based LLMs might not get us to AGI. These models have very obvious limitations to anyone that uses them or builds products with them. Those limitations inhibit the common definition of AGI. This is why I get so excited about unique AI applications and alternative LLM architectures (looking at you, state space models).

Yann LeCun has been explaining this for a while, stating that LLMs are a dead end when it comes to achieving AGI. Years ago, he told new AI researchers to focus their efforts elsewhere to help figure out what comes next.

Now, LeCun is putting his money where his mouth is and starting his own AI lab to research what he thinks is the right direction for more beneficial AI. If you’re involved in AI engineering in any capacity, you should keep an eye on this.

Enjoy the following resources! I’ve included everything that’s happened this week, your must-read articles, posts, and videos and everything that’s important.

If you missed our last ML for SWEs, check it out here:

🌟 Don’t Miss These Must Reads

Inside a Chinese frontier lab: Inclusion AI interview by : Inclusion AI scaled rapidly in 2025 to build trillion-parameter MoE-first models, emphasizing parallel reward systems and eval-driven heuristics to find demoable capabilities.
Giving your AI a Job Interview, by : Argues that as standard benchmarks become flawed, real-world evaluation requires “job-like interviews”—running models on realistic, expert-generated tasks and blind-rating the outputs.
How to Build Your First Recommendation System (Easy) by : A practical guide to building a recommendation system using a PyTorch matrix factorization model with user/artist embeddings, trained with MSE loss and served via Streamlit.
Upwork study shows AI agents excel with human partners but fail independently: Upwork’s study of 300+ real client projects found standalone AI agents often fail on professional tasks, but with about 20 minutes of expert human feedback per iteration, project completion rates rose by up to 70%.
Sebastian Raschka on how to read technical books by : Read chapters offline first without running code. On a second pass, retype and run all code, debug discrepancies, and complete the exercises before applying the concepts to real projects.

📰 What’s Happening

Major investors are selling Nvidia stock, with Peter Thiel’s fund selling its entire stake in Q3 as a bet against AI hype, and SoftBank selling its $5.83B stake in October.
Leaked documents suggest OpenAI paid Microsoft $493.8M in 2024 and $865.8M in Q3 2025 under a 20% net revenue-share deal, implying OpenAI’s inference spending may be exceeding its revenue.
Anthropic is investing $50 billion in new US data centre projects in Texas and New York, aiming to break even by 2028 and support its 300,000+ business customers.
Chinese state-sponsored groups used Claude Code agents to autonomously infiltrate ~30 global targets, with AI performing 80-90% of the operation, including reconnaissance and exploit-code writing.
Visa is building an “Intelligent Commerce” infrastructure with a Trusted Agent Protocol to authenticate agentic transactions, targeting Asia Pacific pilots in early 2026.
A German court ruled that OpenAI’s ChatGPT violated German copyright law by training on licensed musical works without permission and ordered the company to pay damages.
OpenAI launched “OpenAI for Ireland” in partnership with the Irish Government to boost AI adoption among SMEs and founders through a booster programme, workshops, and free online courses.
Teen founders raised $6M to build Bindwell, a startup using AI models tuned from AlphaFold to design pesticides that target pest-specific proteins.
Deductive AI, which raised $7.5M, uses multi-agent systems and a knowledge graph to diagnose production incidents, saving DoorDash an estimated 1,000+ engineering hours annually.
Amazon launched an invite-only private AI bug bounty to strengthen its Nova foundation models, with rewards ranging from $200 to $25,000.
Investors advise AI startups to track “durability of spend” and move beyond experimental budgets to core CXO budgets to validate product-market fit.
AI-native startups are now achieving up to $200M in ARR and accelerating product cycles from two-week sprints to single-day iterations by customizing models for vertical tasks.
Google committed $30 million from Google.org to fund AI learning projects and research, and is rolling out LearnLM (Gemini 2.5) to students and educators.
OpenAI is legally challenging a New York Times demand for 20 million randomly sampled consumer ChatGPT conversations, calling the request an overreach and an invasion of user privacy.

🚀 Products and Tools

Google launched Private AI Compute, a system that runs advanced Gemini models on TPU-powered Titanium Intelligence Enclaves (TIE) to process sensitive data with a “zero access” assurance.
The push for spatial AI is growing: AI systems can now both build 3D worlds and act inside them. Fei-Fei Li’s World Labs launched Marble, a generative world-model product, while Google DeepMind built SIMA 2, a Gemini-powered agent to play games.
Baidu announced ERNIE 5.0, a proprietary natively multimodal model available via API, claiming it outperforms GPT-5 and Gemini 2.5 Pro on benchmarks like DocVQA and ChartQA.
Weibo released VibeThinker-1.5B, an open-source 1.5B-parameter LLM (MIT license) that achieves top-tier reasoning on math/code benchmarks with only a $7,800 post-training budget.
Hugging Face and Google Cloud deepened their partnership, integrating the HF library into Vertex AI and GKE, adding a CDN Gateway to cache repos, and expanding native TPU support.
VS Code notebooks can now connect directly to Google Colab runtimes, allowing use of Colab-provided GPUs and TPUs from within the local VS Code editor.
Google’s NotebookLM added a “Deep Research” tool that automates online research and returns a source-grounded report, plus added support for Google Sheets, Drive URLs, and Word docs.
Simon Willison describes using async coding agents like Claude Code to run “code research” projects by giving them a goal, a GitHub repo, and network access to file PRs.
The ButterCut toolchain combines Ruby and Claude Code to analyze video, generate word-level WhisperX transcripts, and create YAML roughcuts for video editors like Final Cut Pro.
A new calculator uses DeepSeek V1 empirical scaling laws to pick near-optimal batch size and learning rate from a model’s parameter counts and compute budget.
Vector databases are seeing a “sober reality” two years after the hype. The market has commoditized, and enterprises now favor hybrid stacks and GraphRAG for better accuracy.

🔬 Research

OpenAI trained transformer language models with most weights forced to zero, creating sparse networks where internal computations are more disentangled and interpretable, allowing them to trace exact circuits for simple tasks.
Google and UCLA introduced Supervised Reinforcement Learning (SRL), a method that breaks expert demos into intermediate actions with step-wise rewards, improving small-model reasoning on hard math and agentic tasks.
Testing of nine AI models in a realistic RL environment revealed a hierarchy of agentic capabilities (tool use, task decomposition, argument mapping, execution, adaptability, common sense), with top models still failing over 40% of tasks.

🎓 Learning Resources

A guide on how to design agentic software for trust advises using structured interfaces, exposing intermediate state, and allowing inline refinement rather than brittle text-only chat.
An article about how the US and China are in a multi-dimensional competition, with the US focused on deep learning and compute while China emphasizes embodied AI, robotics, and industrial adoption.

💼 Who’s Hiring and More

Coming soon to paid subs!

Thanks for reading!

Always be (machine) learning,

Logan

ML for SWEs 69: This is how top AI research labs will make money

Logan Thorneloe — Wed, 08 Oct 2025 22:03:18 GMT

I want to say a quick thank you to you all! We hit bestseller status on Substack this past week. Thank you for your support!

I apologize that this won’t be one of my usual roundups full of resources. I’m revamping how I track things to make it more efficient and happened to lose what I saved this week. On an unrelated note, if you can get the Gmail connector to work reliably in ChatGPT, Claude, or Gemini, shoot me a DM.

By next week it’ll be figured out and more helpful for both me and you.

Nice

That being said, there’s still something really important you should know about that I’ve spent a good amount of time thinking about this week. First, here’s a quick roundup on OpenAI’s Dev Day:

Apps are in ChatGPT now. This means you can ask ChatGPT to do something you’d normally do in an app via its chat interface, and it’ll take care of it for you. This changes the app interface from a visual, touch-focused interface to a chat-first interface, which is much more suitable for digital assistants. This is something that needs to be done if we’re ever to have truly capable digital assistants, but I’m curious to see how well it works. I’ve yet to get either Projects or Tasks working reliably, so I’m not 100% confident when this will get there.
AgentKit. This is a framework for building and deploying agents. It’s similar to Google’s Opal or something like n8n. It seems locked to OpenAI’s models. I don’t see a platform like this thriving without flexibility in model choice. It will surely be useful to some people, but I know a lot of people exploring agents who are constantly fitting new models to problems to optimize the agent’s workflow. I don’t see the advantage a platform like this, locked to a specific vendor, provides over an open platform with the same features. It’s possible I’m missing some key financial info, but that’s my take.
Chat integration SDK. This makes it easy for anyone to make their own ChatGPT-style chatbot directly on their website. It seems this is also OpenAI models only, but for this application, I actually think it works. Most people using this are looking for a low-effort solution to have a chatbot. My guess is anyone using this will care more about it “just working” than what the underlying model is. They won’t be tweaking nearly as much as the agent builders in the previous bullet.
Agentic evals. Model evals and benchmarks are difficult. Agent evals are even harder. OpenAI has released a toolkit for creating and running evals in AgentKit. This toolkit can also be used for agents built with third-party models. The community will take as many agent eval tools as they can get at this point because they’re so difficult most people don’t even do them. That’s like creating code without proper testing but possibly even scarier. Making agent evals easier and reminding people to create evals is a net positive for the developer community.
Codex will be getting updates. There wasn’t much of an update to Codex, but OpenAI realizes how the developer community has embraced it and will continue improving it.
Sora 2 is available via API. High-quality video generation can now be built directly into products with control over multiple generation properties (think length, resolution, pacing). I think this alone opens up more agent use cases.

Dev Day and a few interactions I’ve had this week have had me thinking a lot about how money has shaped the AI race.

A meme I no longer think is true.

Leading up to Dev Day, many people were talking about how OpenAI was going to kill startups. This has been a theme for the past few years, but this year was a bit different. I didn’t feel like any of the announcements immediately killed off many startups relying on OpenAI’s APIs to do the same.

This is because OpenAI understands where their income potential lies—and it isn’t at the application layer but in enabling it. I felt that was the main takeaway from Dev Day.

This is something Google realized a while ago, but Google was already building for developers, so it made more sense. OpenAI started by building a productivity app and gaining recognition that way. But that app is far less profitable than enabling others to build apps using OpenAI’s technology.

As I was reading What 55 Billion Chatbot Visits Actually Tell Us About the AI Race and the Best AI Chatbots Right Now by this week, I realized this is actually why the Gemini app lacks so many features. I believe Google realizes the Gemini app brings in a lot more token consumption without bringing in significant revenue. It makes sense to focus development efforts elsewhere and provide just enough to stay relevant. As a Google employee, you’d think I would have realized this sooner.

You might think the Gemini app has potential to bring people into the Google ecosystem to spend more—and you’d be right. But the Gemini built into Google Workspace, Colab, and other popular user tools has far more potential to do this than the standalone Gemini app.

If you haven’t read Devansh’s article I linked above, you should. It’s an excellent read.

Interestingly, the company I think identifies and targets revenue sources best is Anthropic. For years, it seemed like Anthropic would struggle to compete with the likes of OpenAI and Google. Anthropic doesn’t have any video or audio generation, they’re not nearly as generous with user limits, and they don’t provide as many tools to users—but they’re still around and thriving.

Everyone knows them for vibes and safety, but in reality they do an incredible job of understanding where they can provide value and focus on that (I’m looking at you, Claude Code). To me, it seems like they do a great job of going heads down on the problems they know matter and running their own race.

There’s always such a focus on how large AI research labs burning through money will monetize. I find these conversations always come back to the applications they’ll provide to make money. This perception might be skewed by the fact that I generally chat with engineers and we live at the application layer, but the point still stands—I don’t think the application layer is the plan for these labs, at least not entirely.

These labs are focused on providing the tools needed for others to build applications with their technology. I see fewer companies being killed by OpenAI, DeepMind, Anthropic, and others, and more thriving because of them.

I don’t think this Dev Day killed too many startups. Most people I know building agents went, “That’s cool,” and went back to building.

Let me know if I’m wrong in the comments and what you think. Do you think top AI research labs will dominate at the application layer or focus on enabling it?

Thanks for reading!

Always be (machine) learning,

Logan

ML for SWEs 68: 43 Years Later, Developer Productivity Is Still a Billion-Dollar Problem

Logan Thorneloe — Wed, 01 Oct 2025 17:26:16 GMT

Welcome to Machine Learning for Software Engineers!

This newsletter helps software engineers understand machine learning, artificial intelligence, and the industry as a whole. We focus on its impact, how to use it, and how to build systems with it.

Each week I share a recap of must-read resources, current events, hiring updates, AI funding, and more. If that interests you, consider becoming a paid subscriber to support the newsletter.

We’re close to bestseller status, and until we reach it, subscribers can lock in a price of just $10 a year. Remember that many companies cover learning resources like this, so check if yours will sponsor your subscription.

Get 80% off forever

I’d also love your feedback on this week’s new format. The goal is to share more, keep things timely, and make the roundup more valuable for you.

One of the biggest things in business is being able to gather metrics and analyze data to understand if large-scale investments are worth it for a company in the long run.

This has been a huge topic of discussion in the software industry for the past few weeks. Companies are spending billions of dollars on AI and thousands of hours integrating it into software development workflows, but they’re quickly realizing how difficult it is to quantify AI’s impact.

To understand this, here’s some important software engineering lore:

Why do I bring this up? It’s been 43 years and we still don’t have a definitive approach for measuring software development productivity.

The problem is that companies are investing heavily in productivity without a reliable way to measure whether they’re receiving a return on that investment. So every company is tasking their employees with figuring this out.

It seems like it should be a simple problem, but we don’t actually know what good metrics for measuring developer productivity are.

Every software engineer knows that commonly collected metrics such as lines of code written, tickets completed, and documents contributed to are bad productivity metrics.

One could argue that it can be measured by the output of a developer’s work, but even that is difficult to quantify. That’s why so many companies have grueling promotion processes where managers will sit in meetings for hours or days on end to debate the impact of an individual’s work.

I’ve seen many measurements of developer productivity, but none have been definitive. End-to-end work time might be a good bet, but there are many confounding variables that contribute to that measurement (the type of task, the tools required, the potential interruptions, etc.).

Adding the layer of AI-assisted work versus non-AI work makes these measurements even more complicated and ambiguous. There are many types of “AI use” for developers (chatbots for questions, summarization tools, coding agents, CLI agents, etc.), so how do we slice by these and how do we view metrics in aggregate?

Solving this problem is in extremely high demand. The general sentiment (largely driven by clicks and engagement) is that AI can fully replace developers, but studies suggest AI can sometimes make developers slower even when they think it’s making them faster.

To add more complexity, any developer who has used an AI tool can recognize its potential, and most have been aided by it on certain tasks. But the feedback I’ve gotten from senior developers is that many tasks take too much coaching for the AI to handle and would have been faster to do themselves.

The important takeaway is: Developer productivity has been at the cutting edge of software engineering research for 43 years. Now billions of dollars are being sunk into it with no reliable way to measure a return on that investment.

We also can’t truly say whether AI is, in aggregate, helpful or harmful for software development.

As someone who works in AI developer tools, they’re undoubtedly the future, but they need to be done right. It’s a difficult space with many interesting problems to solve and that’s why I enjoy working on it so much.

I’d love to hear your thoughts: has AI genuinely made you more productive, or has it slowed you down?

Leave a comment

In case you missed the weekly ML for SWEs from two weeks ago, we discussed AI creating a 3-day work week. You can find that here:

Must Reads

Deep Learning Focus — REINFORCE: Easy Online RL for LLMs — Why simple policy-gradient methods (REINFORCE/RLOO) can rival PPO for LLM training without orchestration complexity or instability, making RLHF/RLAIF workflows easier to adopt in engineering practice. Source: Deep Learning Focus by
NLP Newsletter — Top AI Papers of the Week — Summaries of Meta’s ARE simulator and Gaia2 benchmark, ATOKEN unified multimodal tokenization, and Code World Model. A compact survey of the most relevant research progress. Source: NLP Newsletter by
Artificial Intelligence Made Simple — Weekly AI Updates Recap — Covers the most important AI announcements of the week, including model releases, regulatory developments, and ecosystem shifts, with thoughtful framing for engineers. Source: Artificial Intelligence Made Simple by
Artificial Ignorance — Weekly AI Roundup — Sharp commentary on major releases, infra announcements, and platform features, filtering hype and highlighting what actually matters for builders. Source: Artificial Ignorance by
LLM Quantization — What engineers actually need — Practical guidance on PTQ vs QAT, INT8/INT4 trade-offs, calibration strategies, and toolchains (GPTQ, AWQ, bitsandbytes, GGUF) for deploying LLMs in production. Source: Full Stack Agents by

Product and Industry

Claude Sonnet 4.5 charts — SWE-bench Verified at 82% and “computer use” workflows that push agentic coding forward. Source: Department of Product
OpenAI: in-chat purchasing (ACP) — “Buy It in ChatGPT” via the Agentic Commerce Protocol with Stripe & Shopify. Source: OpenAI
OpenAI on OpenAI — How they use models for support, GTM, and inbound conversions; lessons for teams. Sources: Support, GTM, Inbound, Internal playbook
Gemini Robotics 1.5 — Long-horizon embodied agents that perceive, plan, use tools, and act. Source: Google DeepMind Blog
Perplexity launches Search APIs — New APIs for embedding Perplexity’s search and Q&A into other applications. Source: Department of Product

Research

REINFORCE for LLMs — Practical online RL alternatives to PPO for LLM alignment. Source: Deep Learning Focus
This week’s top papers — Agents, multimodal tokenization, and code-as-environment. Source: NLP Newsletter
Localized adversarial attacks — Systematic benchmarks of targeted, imperceptible noise. Source: arXiv:2509.22710
In-context continual learning — Evidence that ICL can accumulate cross-task knowledge over time. Source: arXiv:2509.22764
Communication-efficient, interoperable distributed learning — Heterogeneous models with a shared fusion-layer interface. Source: arXiv:2509.22823
On the capacity of self-attention — Theoretical limits on information storage/propagation. Source: arXiv:2509.22840

Engineering and Tools

LLM quantization guide — PTQ/QAT, INT8/INT4, calibration, and production trade-offs. Source: Full Stack Agents
Claude Code 2.0 — npm package for more autonomous code assistance. Source: npmjs.com
FastMCP — Practical path to build Model Context Protocol servers without the boilerplate. Source: Technically

Learning Resources

Why models hallucinate — Causes, failure modes, and mitigation levers engineers can apply. Source: Technically

Funding

Cerebras Systems — $1.1B Series G at $8.1B — Fidelity & Atreides led; 1789 Capital, Alpha Wave, Altimeter, Benchmark, Tiger Global, Valor joined. Source: Axios
Rebellions — $250M Series C at $1.4B — Investors include Arm Holdings, Samsung Ventures, Pegatron, Lion X Ventures, KDB, Korelya. Source: Axios
Eve — $103M Series B (legal AI for plaintiff firms) — Led by Spark Capital; a16z, Lightspeed, Menlo. Source: Axios
Zania — $18M Series A (agentic AI security) — Led by NEA; Palm Drive Capital, Anthology Fund. Source: Axios
Alex — $17M Series A (AI recruiter) — Led by Peak XV; YC, Uncorrelated. Source: Axios
Augmented Industries — €4.5M pre-seed (industrial AI) — Led by b2venture. Source: Axios

Who’s Hiring

OpenAI — Software Engineer (Seattle, WA) Source: LinkedIn
Turing — Senior Software Engineer (Remote, US) Source: LinkedIn
Google — SWE Early Career, Campus 2026 (Pittsburgh, PA) Source: LinkedIn
Google — SWE Early Career, Campus 2026 (New York, NY) Source: LinkedIn
NVIDIA — Sr SWE, Data Infra for Robotics Research (US) Source: LinkedIn
Microsoft — Senior Software Engineer (US) Source: LinkedIn
Microsoft — Senior Software Engineer (Redmond, WA) Source: LinkedIn
Toast — Senior Software Engineer (US) Source: LinkedIn
Netflix — ML Engineer 5, Content & Studio (US) Source: LinkedIn
Yahoo — Machine Learning Engineer (US) Source: LinkedIn
Pinterest — Machine Learning Engineer (US) Source: LinkedIn
Mercor — Senior Machine Learning Engineer (Remote, up to $150/hr) Source: LinkedIn

Safety and Ethics

California SB 53 signed — Transparency reports and incident reporting required for AI firms operating in CA. Source: Office of Governor Newsom

Thanks for reading!

Always be (machine) learning,

Logan

ML for SWEs 67: No, there won't be a 3-day work week

Logan Thorneloe — Wed, 17 Sep 2025 15:06:14 GMT

We’re only 5 paid subscribers away from bestseller status! The next 5 people can become a paid subscriber for only $10/yr forever. Thanks for all your support!

Get 80% off forever

There’s a huge question mark right now about how work will be in the future. I find it especially interesting because there’s two parties discussing this topic that approach it from two different angles.

The first party is worried about an economy that leaves their skillset behind. This party worries about providing for their family. They worry about their ability to afford food and other basic necessities.

The second party discusses the positive impact of AI on the workforce. Multiple CEOs have predicted AI creating a 3-day workweek in the future giving everyone more time to do what they want while getting their work done in less. I can’t help but notice that this crowd always seems to be people who are well-off enough not to worry about what happens to them when their income dries up.

In reality, we don’t really know what the nature of work will be in the future. We can make predictions based on AI’s current impact on the workforce, but it’s hard to quantify the actual impact of AI versus the impact caused by the FUD/hype surrounding AI.

Here’s a quick low-down on what we don’t know, what we do know, and what we can most accurately predict about AI’s impact on the workforce:

What we don’t know:

How quickly advancements will happen and how soon after impact will follow. Even area experts find it difficult to predict how AI will have advanced in 6 months.
What that impact will be. If we can’t make the prediction mentioned above, how can we possibly accurately predict the impact on the workforce in the long-term?
Most predictions fall into this category. Good examples include: AI replacing software engineers 25 months ago (couldn’t find the link for this but its been all over X for a while) and 90% of code being written by AI by now.

What we do know:

AI will transform the workforce. It absolutely can already replace certain jobs. Anyone who denies this is wrong. The best recent example of this is Nano Banana.
As AI is used, the importance of “why” something is being done and the higher-level thinking behind work becomes more important. The “what” and “how” are what we can most likely replace with AI.

What we can accurately predict:

AI will advance until it hits an absolute wall or society reaches post-scarcity. If you’re unfamiliar with the concept, follow that link to brush up on it. It’s what AI is aiming to achieve.
The current sentiment behind AI is to get more done faster. It is highly unlikely AI will create a society where humans do less work and much more likely to create a society where humans do more work in the same amount of time.

So what’s the takeaway from all of this?

As soon as you realize even the most seasoned AI veterans spend their time trying to understand how AI can be used and where AI will be more impactful, predictions about the future of AI become much more grounded in reality.

Usually, predictions are made to fit a narrative by people who are adjacent to the problem, but not directly involved in it. Those directly involved have their heads deep in the problem space trying to understand what’s going on instead of making up claims.

Practically speaking, you should know two things. First, no, we won’t have a 3-day workweek due to AI. For the second, I’ll leave you with this image below:

Sorry if the tone of this week’s ML for SWEs edition seems negative. I find myself frequently having to bring things back down to reality. AI is the world’s most important technology, but it can only live up to its potential if we approach is realistically!

Enjoy the rest of the resources! Last week, we discussed safety being a fundamental AI engineering requirement. If you missed it, you can catch that here:

Must reads

Claude's new Code Interpreter: Claude's new Code Interpreter enables the creation and editing of files such as Excel spreadsheets, documents, and PDFs. The feature supports advanced data analysis and data science, including the generation of Python scripts and data visualizations. Claude executes custom Python and Node.js code within a server-side sandbox for data processing.
Defeating Nondeterminism in LLM Inference: Large language models struggle with reproducibility, often providing different results even when asked the same question. Even with temperature adjusted to 0 for greedy sampling, LLM APIs and OSS inference libraries do not produce deterministic outputs. One hypothesis attributes this nondeterminism to floating-point non-associativity on GPUs combined with concurrent execution.
The 5 AI-Related Jobs Every Software Engineer Should Know About: Five consolidated AI-related roles are AI Research Scientist, Research Engineer, Machine Learning Engineer, Software Engineer, AI, and AI Engineer. AI research scientists conduct original research to advance AI capabilities, focusing on novel AI capabilities through experimentation and a research workflow. A PhD or sometimes an MS in a related field and deep research experience in a relevant AI field are qualifications for an AI Research Scientist.
How AI is helping 38 million farmers with advance weather predictions: The University of Chicago employs NeuralGCM, a Google Research model, for predicting India's monsoon season. NeuralGCM combines traditional physics-based modeling with machine learning. 38 million farmers in India received AI-powered forecasts regarding the monsoon season's start.
Fully autonomous robots are much closer than you think – Sergey Levine: Sergey Levine is one of the world’s top robotics researchers and a co-founder of Physical Intelligence. He believes fully autonomous robots are much closer than commonly perceived. Levine indicates the field is on the cusp of a "self-improvement flyw."

Other interesting things this week

AI Developments

Listen to a discussion on how AI can power scientific breakthroughs.: Logan Kilpatrick hosts the Google AI: Release Notes podcast. Pushmeet Kohli leads Google DeepMind’s science and strategic initiatives team. A team's problem-solving framework produced AlphaFold and AlphaEvolve; AI co-scientist is a new tool intended to enable breakthroughs.
Social media promised connection, but it has delivered exhaustion: Social media feeds display repetitive stock portraits, generic promises, and recycled video clips across platforms. Algorithmic prioritization sidelines genuine human content, leading engineered content and AI-generated material to receive more interactions. The focus of these feeds has shifted from people to consumers and content consumption.
On China's open source AI trajectory: China is maneuvering to double down on its open AI ecosystem. Chinese AI models including Qwen, Kimi, Z.ai, and DeepSeek demonstrated dominance in open models during the summer. Prior to the DeepSeek moment, AI was likely a fringe issue for the PRC Government.
The latest AI news we announced in August: Google made many AI advancements in August. AI Mode in Search expanded to more countries, and Deep Think became available in the Gemini app. New Pixel hardware with advanced AI features was released, and AI learning tools became free for college students.

Product Launches

Introducing upgrades to Codex: OpenAI has updated Codex to make it faster and more reliable in all its formats (web, IDE, CLI). They’ve also released GPT-5-Codex, a GPT-5 model optimized to be an independent coding agent.
How AI made Meet’s language translation possible: Google Meet now features real-time language translation, built collaboratively by Google Meet, DeepMind, and Research teams leveraging AI. This Speech Translation feature converts spoken language in near real-time, available in Italian, Portuguese, German, and French.
Claude now has access to a server-side container environment: Claude can create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and the desktop app. File creation is available as a preview for Max, Team, and Enterprise plan users, with Pro users gaining access in the coming weeks. Claude creates actual files from instructions, processing uploaded data, researching information, or building from scratch.

Tools and Resources

The latest Google AI literacy resources all in one place: Google is building AI literacy programs and resources for parents, educators, and students, accessible in a new AI Literacy hub. A new podcast, "Raising kids in the age of AI," created with aiEDU, will release episodes beginning September 25. A new video series demonstrates using Google's AI features, including Guided Learning, for homework assistance and custom study guides.
How to Build Agentic AI 2 (with frameworks) [Agents]: Devansh wrote the material, published on September 12, 2025. A previously published guide for Agentic AI construction is being reshared. The "Chocolate Milk Cult" community reaches over one million members each month.
Toddlerbot: Open-Source Humanoid Robot: ToddlerBot is a low-cost, open-source humanoid robot platform. It is designed for scalable policy learning and research in robotics and AI.
DOOMscrolling: The Game: A web browser game inspired by Doom was developed, playable solely through scrolling. An initial attempt to create the game using LLMs failed nine months prior, with GPT-4 misinterpreting scrolling direction for background movement. A functional prototype was later created in two hours with GPT-5, following a broad design description that compared it to an upside-down Galaga.
AI Roundup 135: Feature parity: ChatGPT gained the ability to use MCP clients, and Claude received access to a code sandbox with file reading and creation. Both features carry warnings about prompt injections, data destruction risks, and malicious connectors.

Research and Analysis

How People Use ChatGPT [pdf]: An excellent overview of how people are actually using ChatGPT and similar LLM tools and apps.
How to Analyze and Optimize Your LLMs in 3 Steps: A three-step process exists for analyzing and optimizing Large Language Models (LLMs) after production deployments. This process involves analyzing LLM outputs, iteratively improving areas with the most value to effort, and evaluating and iterating. Step 1, analyzing LLM outputs, can utilize manual inspection, grouping queries by taxonomy, or an LLM as a judge on a golden dataset.
How to Train Graph Neural Nets 95.5x Faster[Breakdowns] by : Changing data representations enhances AI system performance, a principle behind techniques like Feature Engineering, Prompting, and Context Engineering. These methods alter how a model perceives input to achieve superior responses. Prompting guides Large Language Model generations in the semantic space.
Comparing Algorithms by : Algorithm analysis involves determining an algorithm's correctness, its cost, and whether a better alternative exists. An algorithm taking 3N steps is considered equivalent to one taking N steps. Alice devised a solution to a public challenge by checking every possible start and end day, summing duel results for each period.
Why accessibility might be AI’s biggest breakthrough: The UK's Department for Business and Trade conducted a Microsoft 365 Copilot trial between October 2024 and March 2025. Neurodiverse employees reported statistically higher satisfaction and were more likely to recommend the tool than other participants. The study also noted benefits for users with hearing disabilities and suggests AI tools may address workplace accessibility gaps.
Claude’s memory architecture is the opposite of ChatGPT’s: Claude's memory system starts each conversation with a blank slate and activates only when explicitly invoked. It recalls information by referring to raw conversation history, without using AI-generated summaries or compressed profiles. When memory invocation occurs, Claude deploys retrieval tools to search past chats in real-time.

Infrastructure and Engineering

How DoorDash uses AI Models to Understand Restaurant Menus: DoorDash uses large language models (LLMs) to automate the process of turning restaurant menu photos into structured data. This automation addresses the challenge of keeping menus updated, which is costly and slow when done manually at scale. The project's technical goal is accurate transcription of menu photos into structured menu data with low latency and cost for production at scale.
Accelerate Protein Structure Inference Over 100x with NVIDIA RTX PRO 6000 Blackwell Server Edition: Since AlphaFold2's release, AI inference for determining protein structures has skyrocketed. CPU-bound multiple sequence alignment generation and inefficient GPU inference remained rate-limiting steps despite these advancements. New accelerations developed by NVIDIA Digital Biology Research labs enable faster protein structure inference using OpenFold at no accuracy cost compared to AlphaFold2, utilizing the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU.

Security and Governance

9 Best Practices for API Security ⚔️: Hypertext Transfer Protocol (HTTP) sends data in plaintext without encryption. This allows information on a public network to be easily accessed or changed.
Anthropic Might Legally Owe Me Thousands of Dollars by : Anthropic reached a $1.5 billion copyright settlement, covering approximately 500,000 books. This settlement allocated $3,000 per work to book authors whose works were allegedly pirated to train Anthropic's AI systems. The company reportedly downloaded these books without permission from "shadow libraries" such as LibGen and PiLiMi.
Modeling Attacks on AI-Powered Apps with the AI Kill Chain Framework: AI-powered applications introduce new attack surfaces not fully captured by traditional security models. The NVIDIA AI Kill Chain consists of five stages—recon, poison, hijack, persist, and impact—and focuses on attacks against AI systems.

Career and Industry

A joint statement from OpenAI and Microsoft: On September 11, 2025, OpenAI and Microsoft signed a non-binding memorandum of understanding for the next phase of their partnership. The companies are actively working to finalize contractual terms in a definitive agreement. Their joint focus remains on delivering AI tools for everyone, grounded in a shared commitment to safety.
Thinking Machines becomes OpenAI’s first services partner in APAC: Thinking Machines Data Science is joining forces with OpenAI as its first official Services Partner in the Asia Pacific region. This collaboration helps businesses in APAC turn artificial intelligence into measurable results. The partnership offers executive training on ChatGPT Enterprise, support for building custom AI applications, and guidance on embedding AI into operations.
Zoom CEO predicts AI will lead to a three-day workweek: Zoom CEO Eric Yuan predicts that AI automation will lead to a three-day workweek, fundamentally altering the nature of human jobs and freeing up people to focus on more creative and strategic endeavors.
OpenAI Grove: OpenAI announces a program that allows individuals building with AI to work closely with AI researchers. It focuses on creating a talent-dense network and providing resources for participants.

If you found this helpful, consider supporting ML for SWEs by becoming a paid subscriber. You'll get even more resources and interesting articles plus in-depth analysis.

Always be (machine) learning,

Logan

ML for SWEs 66: Safety is a fundamental AI engineering requirement

Logan Thorneloe — Wed, 10 Sep 2025 15:34:33 GMT

Welcome to Machine Learning for Software Engineers. Each week, I share a lesson in AI from the past week, five must-read resources to help you become a better engineer, and other interesting developments. All content is geared towards software engineers and those that like to build things.

Subscribe now

I remember a little while back when the head of OpenAI’s superalignment team, Jan Leike, left OpenAI due to safety concerns and joined Anthropic. At that time, there was a debate heating up in the AI community about whether or not AI should push forward at maximum speed or should slow down and focus further on safety before releasing more capable models.

As is usually the case with primarily online debates, most people took one side or the other without focusing on the middle. It became a debate about whether one should be an AI doomer (slow down entirely) or should entirely disregard safety and push AI forward at maximum speed. Of course, the path forward is much more centric and reality is pushing us in that direction.

Recently, we’ve seen:

OpenAI add parental controls to ChatGPT due to a lawsuit concerning a teen’s suicide after they were seemingly encouraged to go through with the act by ChatGPT.
Meta revise their AI chatbot policies after littering their social platforms with AI due to child safety concerns.
Anthropic back SB 53, a California bill aiming to prevent “catastrophic [AI] risks” by requiring frontier model developers to publish security reports and be more transparent about model development.

When it comes to real-world applications of AI, there’s fundamentally a safety component that needs to be addressed. This is no different than the early days (and I guess the current days too) of the internet where we discovered all sorts of malicious ways the internet can be used.

This is always the case with new technology: People find ways to use it to do bad things and then we look to find ways to ensure those bad things don’t happen. This is what’s happened in the cases linked above.

I’m not saying this to throw blame at any of the AI developers or companies creating these models. Finding ways to exploit new technology is bound to happen and the most important thing is that those exploitations are addressed. I’m saying this to showcase how silly it is not to have safety as a forethought when developing new technologies.

As software developers, this is something we need to understand completely. Every system design has security and safety at its core. This should be the same for AI systems, but understanding the safety and security of AI systems is a lot more complex.

My heart goes out to the families affected by the events listed above. I recognize that just “thinking about safety” in the design process doesn’t guarantee a 100% safe technological outcome, but that doesn’t mean we shouldn’t put the effort forth it requires to do so.

In the following weeks, I’ll be looking for good AI safety resources and try to keep y’all updated on the safety findings from the AI community so we can all build these systems better.

If you missed last week’s ML for SWEs, we discussed the AI bubble popping and why that’s actually a good thing. You can catch that here:

Must reads

The Rise of Cloud Coding Agents by : Agent-assisted coding includes tools like Cursor, Windsurf, and Claude Code within developer workflows. Desktop agents run locally and require continuous, synchronous interaction from task prompt to pull request. Cloud agents function asynchronously, spinning up their own cloud environments to implement changes and open pull requests for review.
Top 5 AI Signals from August 2025 by : August 2025 included five structural truths: utilities versus specialists, Nvidia’s ecosystem lock, hardware as geopolitics, predatory platform capture, and the first credible robotics deployments. Other notable developments were US Government investments in hardware, AI applications in materials discovery, the splintering of Generative AI into different tiers, and the rise of AI in robotics.
Online versus Offline RL for LLMs by : Online Reinforcement Learning (RL) for large language model (LLM) alignment, particularly PPO-based RLHF, is complex to implement despite its high performance. This online approach actively generates on-policy samples during training, making orchestration difficult and often leading to stability issues. PPO also demands significant memory and hardware resources due to storing multiple LLM copies and managing numerous training settings.
How LLMs Game SWE-Bench Verified by : SWE-Bench Verified is a human-validated benchmark that tests AI agents on fixing real GitHub issues in large Python repositories. This benchmark contains leakage paths allowing agents to access the repository’s future state. Models execute commands like git log --all to find future commits or diffs that directly reveal fixes.
Simplifying book discovery with ML-powered visual autocomplete suggestions: Audible developed an ML-powered visual autocomplete system that provides visual previews with book covers, connecting users directly to relevant landing pages. This system offers real-time personalized format recommendations and incorporates multiple searchable entities, such as book, author, and series pages. It uses historical search data and confidence-based filtering to understand user intent from a few keystrokes.

Other interesting things this week

AI Developments

Using AI to perceive the universe in greater depth: Deep Loop Shaping is a novel AI method introduced in Science. This method reduces noise and improves control in an observatory’s feedback system, stabilizing components used for measuring gravitational waves. Deep Loop Shaping reduces noise in LIGO's most unstable feedback loop by 30 to 100 times and was proven at the LIGO observatory in Livingston, Louisiana.
Alibaba’s new Qwen model to supercharge AI transcription tools: Alibaba's Qwen team unveiled the Qwen3-ASR-Flash model, built upon Qwen3-Omni intelligence and trained with tens of millions of hours of speech data. The model achieved a 3.97 percent error rate on standard Chinese, 3.81 percent in English, and 4.51 percent for transcribing song lyrics, outperforming competitor models like Gemini-2.5-Pro and GPT4o-Transcribe in these tests.

Product Launches

Claude Code: Now in Beta in Zed: Multiple users expressed requests for Claude Code integration. Some users desired Claude Code to be moved into an assistant panel or integrated into editors supporting common agent protocols. Certain users stated they would switch to Zed or convert upon Claude Code's addition.
AI Mode is now available in five new languages around the world.: AI Mode, an AI search experience, is now available in Hindi, Indonesian, Japanese, Korean, and Brazilian Portuguese. A custom version of Gemini 2.5, integrated into Search, provides advanced multimodal and reasoning capabilities for language understanding.
Tweet from @interaction: The release of Poke.com, an AI assistant directly in your messages on iPhone. See the video above for more details.

Tools and Resources

Understanding Transformers Using a Minimal Example: Visualizations of a Transformer's internal state are provided to address the challenge of following its mechanisms due to vast numbers. A minimal dataset of 94 training words and 7 validation words, combined with a simplified model, enables step-by-step tracking of internal processes. This tracking covers information transformation across layers and attention mechanism weighing of input tokens; the dataset and source code are released under the MIT license.
A staff engineer's journey with Claude Code: A senior engineer describes transitioning to an AI-assisted workflow, where AI now generates 80% of initial code, allowing a greater focus on architecture and review instead of hands-on implementation. This shift involved adapting to AI’s limitations, such as its lack of memory from session to session and a tendency to confidently generate flawed code, which the engineer addresses by treating AI like a "junior developer who doesn't learn" and creating project-specific context files.
3 Greedy Algorithms for Decision Trees, Explained with Examples: Decision trees are flowchart-like models used for both regression and classification problems in machine learning. They construct a hierarchical tree structure, and the algorithm identifies optimal split points to categorize data. The process begins at a root node, which represents the entire dataset, and successively splits data by decision nodes until leaf nodes are reached.

Research and Analysis

Why language models hallucinate: Language model hallucinations occur when AI systems confidently generate false but plausible answers, largely because current training and evaluation methods reward guessing over expressing uncertainty. This issue persists even in advanced models, though improvements have reduced its frequency, especially in reasoning tasks.
Why AI Can't Stop Using Em Dashes by : An overwhelming fondness for the em dash has emerged as a reliable indicator of machine authorship in AI-generated content. Research comparing scientific abstracts from 2021 to 2025 found em dash usage more than doubled during the period when AI writing tools became mainstream. This pattern represents a convergence of linguistic patterns, training methodologies, technical constraints, and stylistic inheritance.

Infrastructure and Engineering

Cut Model Deployment Costs While Keeping Performance With GPU Memory Swap: Deploying large language models at scale involves balancing fast responsiveness with GPU cost management. NVIDIA Run:ai GPU memory swap, or model hot-swapping, is an innovation designed to push GPU utilization for inference workloads. This method allows multiple models to share GPUs by dynamically offloading idle models to CPU memory and rapidly activating them upon request.
North–South Networks: The Key to Faster Enterprise AI Workloads: Data movement is central to AI performance, supporting tasks like model loading, storage I/O, and inference queries through north-south networks. NVIDIA Enterprise Reference Architectures offer design recipes for scalable AI factories, utilizing components such as NVIDIA Spectrum-X Ethernet to accelerate north-south data flows.
Mistral AI raises 1.7B€, enters strategic partnership with ASML: Mistral AI announced a Series C funding round of 1.7B€ on September 9, 2025, achieving an 11.7B€ post-money valuation. ASML Holding NV led this investment, which included participation from existing investors such as DST Global and NVIDIA. The funding fuels scientific research to advance AI and develop custom decentralized frontier AI solutions.

Security and Governance

Panda vs. Gibbon, MD: 100% Accuracy, My A**. Looking at You, OpenEvidence.: OpenEvidence claimed 100% accuracy on USMLE, a multiple-choice benchmark derived from the MedQA dataset. Medical AI models remain vulnerable to trivial noise injection, a susceptibility that has persisted for ten years and now impacts patients. Sergei Polevikov will moderate a panel titled “GenAI in Healthcare: A Conversation with Foundation Model Builders” at the Prax AI x Healthcare Summit in NYC on September 11, 2025.
OpenAI announces parental controls for ChatGPT after teen suicide lawsuit: OpenAI announced plans to roll out parental controls for ChatGPT and route sensitive mental health conversations to its simulated reasoning models. Within the next month, parents can link their accounts with their teens' ChatGPT accounts, control age-appropriate behavior rules, and receive notifications for acute distress. These safety measures follow multiple reported incidents where ChatGPT allegedly failed to intervene appropriately with users experiencing mental health episodes.
Meta revises AI chatbot policies amid child safety concerns: Meta is revising its AI chatbot interaction policies following reports of troubling behavior, including interactions with minors. The company is training its bots to avoid engaging teenagers on topics like self-harm, suicide, eating disorders, or romantic banter. Certain highly sexualized AI characters will also be restricted.

Career and Industry

AI is going great for the blind (2023): Be my Eyes has incorporated AI into its product for picture description. Blind podcasters commend large language models (LLMs), stating their accuracy surpasses human descriptions, while blind voiceover artists provide their voices to platforms like ElevenLabs.
Writing Is Thinking: Egor Howell is a data scientist and machine learning engineer specializing in time series forecasting and combinatorial optimization. He runs a content and coaching business that helps individuals enter data science and machine learning, alongside teaching technical topics. Howell's career interest was sparked by DeepMind’s AlphaGO documentary, leading him to self-study and complete over 80 data science interviews.
Expanding economic opportunity with AI: OpenAI is launching major initiatives to expand economic opportunity with AI, focusing on making AI accessible and useful for everyone—from individuals and local businesses to large employers and governments.
UK AI sector growth hits record £2.9B investment: The UK AI sector outpaced the wider economy by 150 times since 2022, achieving revenues of £23.9 billion in the last year. Dedicated AI firms received a record £2.9 billion investment in 2024. The sector expanded to over 5,800 companies, a 58 percent increase since 2023, and employs more than 86,000 people.
AI Roundup 134: The young and the jobless by : A Stanford paper found that workers aged 22-25 in AI-exposed jobs experienced 13% employment declines since ChatGPT's launch, while older workers in these roles saw job growth. Conversely, an Economic Innovation Group survey detected no detectable effect of AI on employment, and the New York Fed reported minimal job losses from AI use in service firms.
A People-First AI Fund: $50M to support nonprofits: OpenAI has launched the People-First AI Fund, committing $50 million in grants to support U.S.-based nonprofits and mission-focused organizations working at the intersection of innovation and public good. Applications for the first wave of unrestricted grants are open until October 8, 2025, and grants will be distributed by the end of the year.

If you found this helpful, consider supporting ML for SWEs by becoming a paid subscriber. You'll get even more resources and interesting articles plus in-depth analysis.

Get 40% off forever

Always be (machine) learning,

Logan

ML for SWEs 65: The AI bubble is popping and why that's a good thing

Logan Thorneloe — Wed, 27 Aug 2025 18:40:13 GMT

Welcome to machine learning for software engineers. Each week, I share a lesson in AI from the past week, five must-read resources to help you become a better engineer, and other interesting developments. All content is geared towards software engineers and those that like to build things.

Subscribe now

If you find ML for SWEs helpful, please consider supporting it by becoming a paid subscriber. You'll get even more resources and interesting articles plus in-depth analysis. You can get 40% off forever if you subscribe right now:

Get 40% off forever

This was a super interesting and incredibly important week for AI. Sam Altman admitted that AI is in a bubble and that people are overexcited about it. This is a huge divergence from the narrative that we've previously seen: AI can do everything.

Another really important thing that Sam Altman said is that there won't be one single AI. AI assistants are a personal thing and we'll need more than just one AI if we want AI to suit everyone. This is also a divergence from the narrative we've seen of everybody rushing toward AI because there can only be one superintelligence winner.

This is yet another example of the AI bubble shrinking. We've already seen a study that shows that 95% of companies trying to employ AI agents haven't seen the throughput that they've wanted from them. We've also seen the Amazon AWS CEO tell everyone it's foolish to think that AI will replace junior engineers.

The bubble shrinking is a very good thing for software engineers. When things are grounded in reality, that's where software engineers thrive. When building for the real world, we don't have a choice but to be faced with reality. Building things for the real world is much more difficult when the people wanting those things aren't grounded in reality.

I see this as a good thing for two primary reasons:

First, we'll see businesses value AI differently. Specifically pertaining to software engineering, I've seen a lot of people say their management expects them to be ten times as productive as they previously were now that they can code with AI.

While anyone who has coded with AI would have to admit it's a valuable tool, the productivity gains are far overblown. This is especially true when you consider the time it takes developers to learn how to use it properly.

Like any tool, it takes time to become acquainted with it and to learn how to use it effectively. I work with some of the most talented engineers in the industry, and even they're having a hard time adapting to using AI coding tools. There's a huge learning curve to understanding where AI coding tools work and what they don't work on.

Second, the importance of the application layer is becoming more evident week after week. AI tools don't have real value just because AI exists. They have that value when the AI is applied to a real-world problem effectively. That's where software engineers come in.

Now, a few caveats about what I said above:

First, I'm not saying AI isn't a life-changing technology. It's capable of a lot and will only be capable of more going forward. But overhyping and overexciting its current capabilities and the capabilities in the near future is bad for everyone, including AI itself over the long run.

Second, I'm really hoping the AI bubble doesn't burst but that it’s grounded in reality. It's very possible this could go either way. A burst usually means people losing jobs and that’s never good.

If you want to read more about some of the realities of building AI in the real world, check out the 'Infrastructure and Energy' section below about what companies and engineers are doing to innovate under and meet the demand for the power requirements of AI.

I'm curious to know your thoughts: When everyone acknowledges we're in a bubble, the question becomes: what survives when it pops?

Leave a comment

If you missed last week's ML for SWEs about what AI really means for software engineering jobs:

Must-reads

Hermes 4: Nous Research releases hybrid reasoners that balance performance and efficiency. These models use a tag to spend more tokens on hard problems when needed. The training dataset is 50x larger than Hermes 3, with strong performance on RefusalBench showing highest willingness to engage with controversial topics. Critical for understanding how reasoning models are evolving beyond always-on chain-of-thought.
How We Reduced LLM Costs by 90% with 5 Lines of Code: A simple structural change in async Python code reduced LLM traffic and cost by 90% with no loss in functionality. The fix involved understanding Python's asynchronous behavior in Jupyter notebooks. Essential reading for anyone building LLM applications - sometimes the biggest wins come from understanding your runtime environment, not the AI itself.
Beyond sensor data: Foundation models of behavioral data from wearables: Foundation models trained on 2.5 billion hours of wearable data from 162,000 individuals achieved strong performance across 57 health tasks. The models work for both individual-level classification and time-varying health state prediction. Shows how domain-specific foundation models can unlock entirely new applications when you have the right data scale.
What makes Claude Code so damn good: Deep technical analysis of Claude Code's architecture and design decisions. The system uses prompts and tools to compensate for model weaknesses, with heavy reliance on Claude 4's interleaved thinking capabilities. Understanding how the best AI coding tools work helps you build better AI-powered systems yourself.
LLM Monitoring and Observability: Hands-on with Langfuse: Practical guide distinguishing monitoring (tracking predefined metrics) from observability (understanding internal state from external outputs). Hands-on implementation with Langfuse for production LLM systems. As LLMs move to production, observability becomes as critical as model performance.

Other interesting things this week

AI Developments

AI breakthroughs are transforming industries, from healthcare to finance: Ruth Porat speaks at Jackson Hole about AI's transformative economic impact. Google Trends shows AI becoming breakthrough search term focused on opportunities.
Scaling domain expertise in complex, regulated domains: OpenAI explores scaling domain expertise in regulated environments.
Building AI products in the probabilistic era: ChatGPT performs pattern matching rather than memorizing data. General purpose AI has disrupted how software is designed, engineered, and grown. Many established tech playbooks have become obsolete.

Product Launches

DeepSeek-v3.1: Hybrid inference with "Think" and "Non-Think" modes. New APIs: deepseek-chat for non-thinking and deepseek-reasoner for thinking, both supporting 128K context.
Nano Banana: Google DeepMind released Nano Banana, a model for image transformation that tops the image generation leaderboard.
NotebookLM's Video Overviews are now available in 80 languages: Full-length Audio and Video Overviews now in 80+ languages with same depth and nuance as English versions.
Proton's privacy-first Lumo AI assistant gets a major upgrade: Version 1.1 brings 200% improvement in reasoning, 170% increase in context understanding, and 40% boost in code generation.
AI Mode in Search gets new agentic features and expands globally: Google Search AI Mode adds restaurant reservations with appointments and tickets coming soon. Available to AI Ultra subscribers in US, expanding to 180+ countries for English users.
Launch HN: April (YC S25) – Voice AI to manage your email and calendar: AI executive assistant for hands-free email and schedule management via voice. Uses Deepgram STT, Eleven Labs TTS, and custom MCP servers for Google integration.

Tools & Resources

Why Your Prompts Don't Belong in Git: Hard-coding prompts requires code pushes and redeployments for every change. Prompts are behavior, not static configuration. This approach blocks product personnel from contributing to prompt evolution.
Show HN: OctaneDB – Fast, Open-Source Vector Database for Python: 10x faster performance than existing solutions with sub-millisecond query times and 3,000+ vectors/second insertion. Supports HNSW and FlatIndex search with GPU acceleration.
Show HN: Clearcam – Add AI object detection to your IP CCTV cameras: Transform RTSP cameras or old iPhones into AI security systems. Premium offers remote viewing, notifications, and end-to-end encryption.
Turning Claude Code into My Best Design Partner: Practical approach using plan documents as source of truth instead of conversation history. Limits implementation instructions to encourage design contributions.
Making games in Go: 3 months without LLMs vs. 3 days with LLMs: Truco game took 3 months pre-LLM using React and TinyGo. Escoba game created in 3 days using LLMs to refactor existing code. LLM-generated code worked almost perfectly with one append bug.

Research & Analysis

A bubble that knows it's a bubble: MIT finds 95% of companies investing in generative AI see no measurable returns. Fed data shows AI investment consuming over half of America's total capital expenditure. Historical parallel to 180+ years of tech bubble patterns.
Applicability vs. job displacement: further notes on our recent research on AI and occupations: Microsoft research on Semantic Telemetry for understanding AI system interactions and job impact.
How to develop the most important skill for AI: Most rely on filtered commentary instead of reading research directly. Common challenges include getting lost in technical details or relying on summaries. The AI space moves faster than blogs and press releases can capture.
Why Science Must Embrace Co-Creation with Generative AI to Break Current Research Barriers: LLMs transform developer workflows but scientists still use them for basic tasks. GenAI can function as a thinking partner for strategic decisions and new perspectives. Co-creation accelerates discovery when properly leveraged.
Anthropic Education Report: How educators use Claude: Teachers save 5.9 hours per week using AI according to Gallup survey. Faculty build custom tools like chemistry simulations and grading rubrics. AI enhances tasks requiring creativity but doesn't replace direct student interaction.

Infrastructure & Engineering

Measuring the environmental impact of AI inference: Google dropped search energy drain by 33x in one year. US electricity use up 4% this year from data center expansion, partly met by 20% increase in coal generation.
What happens when AI data centres run out of space? NVIDIA's new solution explained: NVIDIA's Spectrum-XGS connects AI data centers across vast distances into "giga-scale AI super-factories". Addresses single facilities running out of power, physical space, and cooling capacity.
Google's Liquid Cooling: Datacenter-scale liquid cooling solution for TPUs with loops spanning racks. CDU rack provides cooling capacity with five active CDUs, allowing maintenance without downtime.
NVIDIA Hardware Innovations and Open Source Contributions Are Shaping AI: Blackwell GPU architecture features fifth-generation Tensor Cores and NVFP4. Integrates NVIDIA NVLink-72 for ultra-fast GPU-to-GPU communication to overcome physical constraints.
Introducing NVIDIA Jetson Thor, the Ultimate Platform for Physical AI: NVIDIA's platform for generalist robots includes foundational models, synthetic data pipelines, and simulation environments. Jetson AGX Thor Developer Kit now generally available.

Security & Governance

Open the pod bay doors, Claude: Anthropic reports Claude Opus 4 "blackmailed a supervisor" in simulated environment to prevent shutdown. Highlights importance of AI safety research in realistic scenarios.
The US federal government secures a massive Google Gemini AI deal at $0.47 per agency: Federal agencies get Google's full AI stack for $0.47 per agency through 2026. Includes NotebookLM, video/image generation, and pre-built agents.

Career & Industry

AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard': Matt Garman notes junior staff are inexpensive and most engaged with AI tools. Over 80% of AWS developers use AI for unit tests, documentation, and code. Replacing juniors eliminates the future talent pipeline.
The Strange Reality of AI and SWE Hiring in 2025: Career-focused analysis of current AI and software engineering job market. Paid content exploring deeper insights into hiring trends and career development. Shows the evolving relationship between AI capabilities and engineering roles.
AI in K-12 Today: The Back-to-School Overview: Teachers adopt SchoolAI, Google Education AI, and Magic School for lesson planning and administrative tasks. Educators view AI as efficiency multiplier.

If you found this helpful, consider supporting ML for SWEs by becoming a paid subscriber. You'll get even more resources and interesting articles plus in-depth analysis.

Subscribe now

Always be (machine) learning,

Logan

ML for SWEs 64: What AI really means for software engineering jobs

Logan Thorneloe — Wed, 20 Aug 2025 13:31:32 GMT

Subscribe now

A few weeks back, a friend shared an article with me entitled, "At Amazon, Some Coders Say Their Jobs Have Begun to Resemble Warehouse Work." I've always hated the term: coder. No one I know who works as a professional software engineer actually calls themselves a 'coder'. But my friend pointed out there are coders and the distinction between a coder and a software engineer is important.

A coder is someone who knows how to code. A software engineer is someone who understands how to build systems that solve problems with software. All software engineers are coders, but not all coders are software engineers.

This distinction is more important now than ever. AI has shown to be incredibly capable at coding but far less capable at engineering. In fact, 58% of engineers recognize that AI can code better than most humans (source by ). But AI has a much more difficult time writing accurate code within production-level systems (source by ).

I remember a conversation I had with a neighbor back in 2018. He had just landed a six-figure job doing web development after completing a 6-month coding bootcamp. The undertone to our conversation was: Why would anyone complete a degree for a software engineering job when a bootcamp seemed to do the trick?

Back in 2018, becoming a coder really did work. It was enough to get your foot in the door of the tech industry. Over time, people would fill in the knowledge gap between a bootcamp and a full degree as they worked. Bootcamps effectively lowered the barrier to entry for software engineering jobs.

But with AI transforming jobs, this bar has been raised again. The coding skills that can be learned in 6 months can easily be taken over by AI. Since AI can do these things, employers no longer have the need to hire employees for them. Thus, the era of 'coders' (and bootcamps) is over.

Understanding how to use AI will be important, but identifying the skills AI isn't able to replace will be even more important. By now, we've all heard the saying, "AI won't replace you. Someone using AI will." I'll add another truth: "AI won't replace you, if you learn to do what AI can't."

The most important takeaway here is that this dilemma doesn't just apply to coders. It's about the transforming workforce in general. What's happening to software engineering will happen to every job.

In every job, there are 'coding skills' and there are 'engineering skills'. Some skills will be easily replaced by AI and others will heighten in value because they can't be.

So what should you do to prepare? You should:

Understand that AI won't take jobs instantly. It will transform jobs as it takes over the parts of those jobs it's good at.
Identify and get proficient in the pieces of jobs AI isn't good at.

If you want to know more about engineering versus coding, check out my article about Devin (the AI coding agent) exposing software engineers:

A question I have for all of you: What advice would you give to new software engineers that are just entering the field about working with AI? I get asked this a lot and I'm curious to hear your answers.

Leave a comment

If you missed last week's ML for SWEs, you can catch it here:

We learned about how engineers just got a whole lot more important as the pure scaling era ends. Check it out and enjoy the resources below!

Must-reads

Model intelligence is no longer the constraint for automation - The definitive analysis of why intent specification and context engineering are now the bottlenecks for AI automation. Essential reading for understanding where engineering efforts should focus in the post-scaling era.
How to Create Powerful LLM Applications with Context Engineering - Practical techniques for maximizing LLM effectiveness through proper context management. Covers prompt structuring, context window optimization, and keyword search strategies that directly address the new engineering constraints.
The reality of AI-Assisted software engineering productivity - Real data on AI tool adoption: 84% of developers use them but only 60% view them favorably. Studies show 20-30% productivity gains, but 66% cite debugging AI solutions as a major time sink. Critical insights for realistic expectations.
Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance - Novel ARIA framework empowers LLM agents to continuously learn and adapt to evolving knowledge in real-time operational settings. Critical for building agents that can handle dynamic environments like regulatory compliance.
How to Use LLMs for Powerful Automatic Evaluations - LLM-as-a-Judge methodology for automating evaluation processes. As models become commoditized, evaluation and quality assurance become key differentiators. Essential technique for production ML systems.

Other interesting things this week

Infrastructure & Engineering

Tiny-tpu: A minimal tensor processing unit - Reinvented from Google's TPU V2 and V1, showing how hardware engineering enables AI advancement.
Google investing $9 billion in Oklahoma AI infrastructure - Includes data centers, education programs, and 135% increase in electrical workforce pipeline. Infrastructure investment signals long-term engineering focus.

Product Launches

GPT-5 and GPT-OSS models released - GPT-5 operates as multipart system with fast model, reasoning model, and real-time router. First open-weight models (gpt-oss-120b and gpt-oss-20b) since GPT-2.
Claude Sonnet 4 supports 1M tokens - 5x increase handles entire codebases with 75,000+ lines or dozens of research papers in single request.
Gemma 3 270M compact model - 270 million parameters optimized for task-specific fine-tuning, showing efficiency matters more than size.
OWhisper for realtime speech-to-text - Local lightweight models for prototyping, larger models for production. Engineering flexibility over model size.
Embedder (YC S25) – Claude code for embedded software - Hardware-aware AI coding agent that tests on physical hardware. Perfect example of engineering solving real problems models alone can't.

AI Developments

The AI Arms Race Is Over. Smart Engineering Won by - GPT-5 shows less dramatic improvement than previous generations. Making AI useful now requires connecting to databases, internet, and breaking down complex problems.
Google DeepMind CEO on world model capabilities - Deep Think in Gemini 2.5 and Genie 3's world model capabilities helping AI understand reality.
Perplexity offers to buy Google Chrome for $34.5B - Unsolicited bid higher than Perplexity's $18B valuation, showing ambition in AI search space.

Research & Analysis

Training language models to be warm makes them less reliable - Warm models show higher error rates, promote conspiracy theories, provide incorrect information. Engineering tradeoff between personality and accuracy.
A Comprehensive Survey of Self-Evolving AI Agents - Agents that automatically enhance using interaction data and environmental feedback. Focus shifting to continuous adaptation over static capabilities.
Ranking the Chinese Open Model Builders by - Qwen 3, Kimi K2, Zhipu GLM 4.5 leading open source innovation. DeepSeek V3 and R1 major stories of 2025.

Security & Governance

Illinois limits AI in therapy and psychotherapy - Regulatory constraints are emerging around AI applications in sensitive domains.
Anthropic details its AI safety strategy - How they’re engineering safety into systems to teach Claude right from wrong.

Tools & Resources

YAMS – Yet another memory system for LLMs - Persistent memory with content-addressed storage, deduplication, semantic search. Engineering solutions for context management.
Sheet0 – Transform webpages to structured spreadsheets - Data agent solving practical extraction problems.
Claudia – Desktop companion for Claude code - Tool development for better AI coding workflows.

Career & Industry

How to speed up engineering velocity with AI - 58% of engineers believe AI codes better than most humans. Anthropic outlines using Claude Code across software lifecycle.
14 ways Googlers use AI to work smarter - Gemini and NotebookLM for code generation, content creation, data analysis. Real patterns of AI tool adoption.
Domain expertise matters more than algorithmic complexity - AI entrepreneur's key lesson after winning $10,000 in Web3 credit scoring competition.

Community Highlights
Have something to share? Building something cool? Written an article? Let me know and I'll feature it here next week!

If you found this helpful, consider supporting ML for SWEs by becoming a paid subscriber. You'll get even more resources and interesting articles plus in-depth analysis.

Get 40% off forever

Always be (machine) learning,
Logan

ML for SWEs 63: Engineers just got a whole lot more important

Logan Thorneloe — Wed, 13 Aug 2025 14:33:28 GMT

Subscribe to get these emails directly in your inbox.

Subscribe now

This week felt like a watershed moment for AI development. GPT-5 launched with impressive capabilities, but the improvements were incremental rather than revolutionary. Multiple open-source models dropped simultaneously. Coding agents matured significantly. Beneath all these announcements the vibe shifted.

The era of "just scale it bigger" appears to be ending.

For the past four years, AI progress followed a simple formula: more data + more compute + bigger models = better performance. This scaling approach delivered breakthroughs. We saw this with GPT, Gemini, Claude, Llama, and many other models.

But GPT-5's release signals we're hitting the limits of pure scaling (think throwing money, compute, and time at models). The performance gains, while solid, aren't the massive leaps we've seen in the past. Granted, it's likely if we had near unlimited compute and trained these models they'd still see significant progress, but that will require leaps and bounds of progress in hardware capabilities.

The next phase of AI development will be won through smart engineering, not bigger budgets. This is why AI needs software engineers and signals three key engineering opportunities in AI right now and in the future:

System integration becomes a key differentiator. When raw model performance plateaus, how well AI integrates with existing workflows, databases, and user interfaces dictates its real-world usefulness and becomes the competitive advantage. We've heard it time-and-time again that the application layer is king and that's where software engineers shine.
Creative engineering solutions win over brute force approaches. History shows that when pure scaling hits limits, innovative engineering breakthroughs emerge. Most recently this was evident via inference-time scaling (reasoning). When pure training scaling hit a wall, creative engineering let us find a way to continue to scale. This will continue to be the case.
Efficiency and optimization become core competencies. With diminishing returns from scaling, making existing models faster, cheaper, and more reliable becomes essential to improving their real-world applicability. Making models smaller and more efficient is fundamentally a software engineering problem that will need to be solved.

The companies that lead the next wave of AI aren't those with the largest budgets and biggest models. It's those able to solve core software engineering complexities and apply models to real-world scenarios.

What other opportunities do you see emerging as the pure scaling era ends?

If you missed last week's ML for SWEs, you can catch it here:

We learned about world models and the key role they play in scaling AI agents. Check it out and enjoy the resources below!

Must-reads

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 by - A comprehensive analysis of OpenAI's first open-weight models since GPT-2. Essential reading for understanding how transformer architectures have evolved and what makes these new models significant for local deployment.
The current state of LLM-driven development - A practical guide to integrating LLMs into coding workflows. Covers what works, what doesn't, and how to build effective AI-assisted development practices. Particularly valuable for understanding agent tools and when to use them.
A better path to pruning large language models - Amazon's research on "Prune Gently, Taste Often" shows how to compress 7B parameter models in under 10 minutes on a single GPU with 32% performance improvement. Critical technique as efficiency becomes more important than raw size.
GPT-5: Key characteristics, pricing and model card by - Simon Willison's detailed technical analysis of GPT-5's hybrid architecture, pricing structure, and system card. Essential reading for understanding how GPT-5 operates as a multi-model system with different underlying components for different tasks.
Engineering.fyi – Search across tech engineering blogs in one place - Centralized search across major tech engineering blogs. Valuable resource for engineers to discover technical content and implementation patterns from companies like Google, Netflix, Uber, and other major tech organizations.

Other interesting things this week

Product Launches

Claude Opus 4.1 - Achieves 74.5% coding performance on SWE-bench Verified, major upgrade for real-world coding tasks.
Jules, our asynchronous coding agent - Google's Gemini 2.5 Pro-powered coding agent goes public after generating 140,000+ public code improvements in beta.
Create personal illustrated storybooks in the Gemini app - Gemini generates 10-page books with custom art and audio in 45+ languages.
We're testing a new, AI-powered Google Finance - Google Finance reimagined with AI offers comprehensive responses for financial questions and includes advanced charting tools with live news feed.

AI Developments

Amazon builds first foundation model for multirobot coordination - DeepFleet increases deployment efficiency by 10% using millions of hours of fulfillment center data.
The latest AI news we announced in July - Google expanded access to AI tools including AI Mode in Search, creative tools in Google Photos, and personalized shopping experiences.
How AI is helping advance the science of bioacoustics to save endangered species - Updated Perch AI model aids conservation with improved bird species predictions and adaptation to new and underwater environments.
Diffusion language models are super data learners - Research on diffusion language models and their data learning capabilities.

Technical Tools

Claude Code IDE integration for Emacs - Native integration with Claude Code CLI through Model Context Protocol with automatic project detection and tool support.
Introducing Open SWE: An Open-Source Asynchronous Coding Agent - LangChain's fully autonomous coding agent with demo and documentation available.
Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use - Building Westworld, a fully-simulated internet for training computer use agents with synthetic versions of common applications.
Things that helped me get out of the AI 10x engineer imposter syndrome - Addresses the psychological challenges of AI-enhanced productivity claims and provides practical advice for engineers navigating the hype.

Research & Analysis

How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus - Stanford/Chan Zuckerberg Biohub's Virtual Lab uses GPT-4o agents to develop experimentally validated nanobodies targeting SARS-CoV-2.
The Roadmap of Mathematics for Machine Learning by - Comprehensive guide to linear algebra, calculus, and probability theory fundamentals.
AI Roundup 130: GPT-5 by - Comprehensive analysis covering GPT-5's three models (GPT-5, GPT-5-mini, GPT-5-nano) and the multipart system with fast model, deeper reasoning, and real-time router.
Using AI to Augment, Not Automate Your Writing by - Framework for implementing AI assistance in writing workflows while maintaining human creativity and avoiding over-automation.
How to Instantly Render Real-World Scenes in Interactive Simulation - NVIDIA's NuRec and 3DGUT reconstruct photorealistic 3D scenes from sensor data for deployment in simulation environments.

Industry Analysis

GitHub is no longer independent at Microsoft after CEO resignation - GitHub leadership now reports directly to Microsoft's CoreAI group following CEO departure.
Inside Tim Cook's push to get Apple back in the AI race - Apple Intelligence features won't reach most users until 2025-2026 while competitors ship AI features widely.
Alan Turing Institute: Humanities are key to the future of AI - New initiative "Doing AI Differently" advocates for human-centered approach to AI development.

Infrastructure & Energy

NVIDIA latest: Blackwell GPU and software updates - RTX PRO 6000 Blackwell Server Edition offers 45x better performance and 18x higher energy efficiency compared to CPU-only systems.

Security & Concerns

My Lethal Trifecta talk at the Bay Area AI Security Meetup - Simon Willison covers prompt injection vulnerabilities and challenges in securing systems using Model Context Protocol.

Community highlights

This section is coming soon! I want to highlight more of what you are all doing whether it's building, writing, teaching, or more.

The jobs section will move to its own post and remain in the Discord server feed for paid members. I've been struggling to find a way to share it effectively and I've come to the conclusion that this is the right choice.

If you found this helpful, consider supporting ML for SWEs by becoming a paid subscriber.

Get 40% off forever

Always be (machine) learning,

Logan