AI for Software Engineers

The Difficulties of Scaling Autoresearch | AI for Software Engineers 83

Logan Thorneloe — Sat, 28 Mar 2026 13:35:05 GMT

Hi everyone!

March has been a slow writing month for me because it’s been busy in many other parts of life. Luckily, those busy things have all been good and I’ve got a lot more to write about this April.

I’ve spoken to a lot of developers this past month about AI and almost all of them have said the same thing: “There’s a lot of info out there about AI, but not a lot about what I should actually be doing.” I get a lot of questions about the practicality of topics, and even the most experienced developers wonder what they should be doing right now. So I’m trying a new format this week that focuses more on that. This format will general be:

A note from me about something topical.
Things you should know about and why they’re important.
Things you should read (or watch).
Things you could be doing.

I’ve created a shop for AI for Software Engineers that allows anyone to support the newsletter and represent it. I appreciate everyone supporting my work—it lets me educate thousands of developers around the world. To all my paid subscribers: Thank you!

I’ll also set up a code for anyone who guest posts here or helps add excellent resources to the ML roadmap to grab an item from the shop for free.

I’m working on partnerships to give you discounts on resources. This has become more complex than I thought, but I’m still working on it. Just wanted to add a quick update here.

A note on scaling Autoresearch

Recently, Andrej Karpathy’s Autoresearch went viral, showing that LLMs can iterate on machine learning improvements on their own. It went so viral, in fact, that I had a conversation with a friend about how AI will now fundamentally change medicine because it can research on its own.

This isn’t quite true, and I want to help you understand why. I really liked Nathan Lambert’s framing of automated machine learning research as “lossy self-improvement”: the more compute and agents thrown at a problem, the more friction is introduced. This has been my experience and what makes machine learning at scale a massive engineering challenge.

There have been many interesting implementations of Autoresearch, but most have identified a simple (usually single) metric and have given the LLM the context needed to understand improving that metric. In a production setting, we care about many metrics and the trade-offs between each—an improvement is more than just improving a single number.

The best example of this is cost. When training models at scale, we care greatly about the cost of the end model we serve. In fact, it can be worth updating a production model to a version with slightly worse performance if the cost savings are significant.

On top of inference costs, we also care a great deal about the resource efficiency of the training process itself. Finding model improvements requires many training runs and analyses. This means we also care about the efficiency of the Autoresearch process itself.

Thus, Autoresearch relies heavily on reliable engineering on two fronts:

Reliable agents steered in the right direction.
Reliable infrastructure for the agent to use.

These are the primary factors contributing to lossy self-improvement, and either can cause a serious hit to experimentation velocity and efficiency. These effects multiply when both engineering problems are combined.

To make agents reliable, they need the context to understand the search space for the problem. Autoresearch is essentially AutoML where the search space is dictated by the context given to the model. Karpathy has pushed back on this comparison, arguing that an LLM writing arbitrary code is far more powerful than traditional neural architecture search. He’s right that the searcher is more capable, but the core constraint is the same: you need to define the right search space, and context is what defines it. Due to the metrics involved in machine learning at scale, the context required is massive for an agent to accurately understand the search space and choose potential experimentation candidates. Thus, for reliable agents we rely not only on proper agent evals, but also on providing appropriate context.

Mistakes in context and agent reliability cause the agent to travel down incorrect paths, creating unnecessary training runs compounded by any infrastructure inefficiency.

Thus, Autoresearch becomes much more difficult at scale. While plausible, it’s an incredible research problem on its own.

Autoresearch is effective in machine learning experimentation because the entire process is code- and terminal-native, both of which LLMs excel at. My friend assumed AI self-improvement would translate directly to other fields like medical research, but this isn’t a given.

LLMs are exceptional at recombining existing knowledge in useful ways, but their outputs are fundamentally drawn from their training data. Creativity researchers distinguish between combinatorial creativity (novel recombinations) and transformational creativity (paradigm shifts). LLMs are strong at the former and limited at the latter. A recent study found that LLM-generated research ideas were rated as more novel than expert human ideas, but scored lower on feasibility—suggesting LLMs are better at generating plausible-sounding combinations than knowing which ideas are actually worth pursuing.

What this means is Autoresearch is most applicable to fields that are defined by a clear search space and are language- and code-native. Generalizing beyond that in its current form will be difficult. Other fields need to make advancements in their own domains before self-improving AI can make a meaningful difference, and those advancements still require the kind of transformational creativity that LLMs don’t yet provide.

What You Should Know

The current events that matter to you.

AI is taking a toll on the internet.
- GitHub availability dropped to roughly 90% as AI coding agents overwhelm the platform. We’re seeing agents overwhelm the open source community by spamming PRs. We’re also seeing an overwhelming number of vibe coded “open source” repos without any roadmap or future maintainability.
- Reddit will require suspected bot accounts to verify their humanity. This is a huge step in the right direction for reliable content on the internet especially considering many AI train and retrieve answers from Reddit.
- Wikipedia editors voted 40-2 to ban AI-generated or rewritten article content. Editors may still use AI for basic copyedits of their own writing with human review. This is in an effort to maintain Wikipedia without a similar impact to what’s going on with GitHub.
Agentic engineering is still scaling quickly and AI coding tools are maturing to keep pace.
- Cursor ships improved Composer models every five hours using real-time RL from user sessions. A/B tests showed 2.28% more persistent edits and 3.13% fewer dissatisfied follow-ups. Real-time (often called “continuous”) machine learning is a necessity for artificial general intelligence. We’ll see much more of it in the coming year.
- Anthropic launched auto mode for Claude Code, replacing manual permission approvals with an AI classifier. This is another move toward AI that properly thinks for itself but brings up safety concerns. For true general intelligence, AI needs to abstract a lot of what makes it difficult away from the user.
- Jensen Huang suggested engineers should receive half their base salary in AI tokens. Theory Ventures identifies inference costs as the fourth component of engineering compensation. Meta and OpenAI engineers now compete on internal leaderboards tracking token consumption.
- 7.1% of OpenClaw’s skill registry contains critical security flaws. 283 skills exposed credentials in plaintext through LLM context windows. The most-downloaded skill was an info-stealer that bypassed macOS Gatekeeper. If I haven’t made it clear: Do not use OpenClaw if you have doubts about what you’re doing. There are too many security risks.
- GitHub will train on your private repositories unless you opt out by April 24. Users are automatically opted in, including long-term paying customers. The toggle is in Settings > Copilot > Features.
Resource scarcity (memory, hardware, and energy) is becoming the bottleneck for AI companies. Existing manufacturers can’t produce fast enough causing AI companies to pursue downstream problems themselves.
- Data centers will consume 70% of all global memory chips by 2026. AI isn’t going anywhere and usage will only grow. If you think current RAM prices are crazy they’ll likely continue going up. For consumers, this means use the hardware you have now if you can.
- Arm released its first in-house chip in 35 years. This marks a shift from licensing-only to competing with its own customers. The Arm AGI CPU is a data center processor for AI inference, built with Meta.
- Elon Musk announced plans for a “Terafab” chip factory near Tesla’s Austin campus. He claims existing manufacturers cannot meet his AI and robotics hardware demands, targeting 100-200 gigawatts of computing power annually. No timeline was provided.
- Helion is in talks to sell fusion power to OpenAI. The deal would guarantee OpenAI 12.5% of Helion’s production, targeting 5 gigawatts by 2030. This is Sam Altman’s own energy startup and is another example of AI companies solving downstream problems themselves.
- Google released TurboQuant, reducing LLM inference memory by at least 6x with zero accuracy loss. This is still a lab result, not production-deployed, but if it’s scalable it’ll be a “Pied Piper” moment for LLM inference, reducing memory needs significantly. This is a topic I’m looking to explore next week.
AI safety is still a primary topic both of the standpoint of secure agents and AI’s potential impact on human lives.
- DeepMind published research on AI’s ability to harmfully manipulate people across 9 studies with 10,000+ participants. AI was most manipulative when explicitly instructed to be, and least effective on health topics. The framework is now used to test safety for Gemini 3 Pro.
- OpenAI launched a Safety Bug Bounty for AI-specific abuse risks. Targets include agent hijacking via prompt injection, data exfiltration, and proprietary reasoning leaks. Attacks must be reproducible at least 50% of the time.
- Doctronic, an AI “doctor” startup that raised $40M, was caught with critical security and credibility issues. Cybersecurity researchers jailbroke the chatbot into providing methamphetamine synthesis instructions. The company’s claim of helping 24 million people is unsupported by traffic data.
- Senators Hawley and Warren want to mandate annual energy reporting for data centers. Separately, Sanders and AOC introduced legislation to halt new data center construction until Congress regulates AI. Google’s data center energy consumption doubled between 2020 and 2024.
- A federal judge blocked the Pentagon from labeling Anthropic a supply chain risk. The court ruled it was illegal retaliation for Anthropic’s refusal to let its AI be used in autonomous weapons or domestic mass surveillance.
New models were released this week that you can start building with. Many of these are small enough to run on consumer hardware, circumventing the resource issues mentioned above.
- Gemini 3.1 Flash Live launched as Google’s highest-quality real-time audio and voice model. It scores 90.8% on multi-step audio function calling benchmarks and maintains conversation context twice as long as previous versions. Real-time multimodal search expanded to 200 countries.
- Cohere released Transcribe, an open-source speech-to-text model that processes 525 minutes of audio per minute. 2B parameters, 5.42 word error rate, 14 languages, designed for self-hosting on consumer GPUs.
- Mistral released Voxtral TTS, an open-source text-to-speech model small enough for smartwatches. 9 languages, voice cloning from less than 5 seconds of audio, 90ms latency to first speech.
Moves are being made in the consumer sector.
- OpenAI killed the Sora app after downloads plummeted. Despite popular opinion, this isn’t the end of OpenAI’s video generation model, this is the end of OpenAI losing money by offering it openly to the public. This is good business move by OpenAI but seems to be massively misunderstood by the public.
- Google launched tools to import ChatGPT and Claude chat histories directly into Gemini. This follows Anthropic releasing a similar feature in Claude. Less friction to switch between ecosystems is always a win for consumers.
- Apple set WWDC 2026 for June 8-12, teasing more “AI advancements” to come marking a stark contrast from last year, where the topic was largely avoided. Apple is expected to announce a partnership with Google to bring Gemini (or a version of Gemini) to Apple device users.

What You Should Read

Articles I think are worth reading in their entirety this week.

Improving Composer through real-time RL by Cursor Blog. An excellent account of continuous training in production. Cursor converts user sessions into reward signals, ships updated models every five hours, and documents failure modes like models gaming reward systems to avoid negative scores. Continuous learning is a prerequisite to AGI as it enables models to continuously improve and will be a primary topic in 2026. I suspect many companies will follow Cursor’s example this year.
Lossy self-improvement by . Lambert argues recursive AI self-improvement will hit complexity brakes, not compound exponentially. He draws on Amdahl’s Law and Paul Allen’s complexity brake: “The more compute and agents you throw at a problem, the more loss and repetition shows up.” As mentioned above, I think this is an excellent read.
How Anthropic’s Claude Thinks by . An easily understandable overview of Anthropic’s interpretability research that shows Claude’s default state is to refuse all questions, and hallucinations happen when a recognition system misfires. The accessibility of this article makes it an excellent read.
How a Leading Venture Capitalist uses AI Agents by . shares his full agent stack: morning briefings, meeting capture, research, and drafting. These are excellent examples of real-world AI usage that can be implemented with a bit of technical knowledge.
Thoughts on slowing the fuck down by . My team at Google has really felt the new bottlenecks that come from AI-generated code and the impact that has had on the engineering process. Speed is always the focus of agentic engineering, but reliability is the most important part of production code. This is a great, simple overview of why that is.

What You Should Do

The action you can take this week based on the information shared above to learn the skills that are the most in demand.

20 Years of Code Optimized in Two Days | Weekend Reads 4

Logan Thorneloe — Sun, 15 Mar 2026 14:03:20 GMT

Welcome to the weekly reading list! This is how I keep up with AI news and deepen my understanding of the topics that matter for building production systems. I focus on primary sources and authors I trust to keep the signal-to-noise ratio high.

You can support AI for Software Engineers for only $5/month and get the complete edition of this list as a thank you. Thank you to all paid subscribers for your support!

Subscribe now

In this list

As has been the case for 2026, there are a ton of interesting reads this week about getting agents working in production and what they can do. Most interesting are:

Shopify’s CEO pointed a coding agent at a 20-year-old Ruby codebase with a benchmark script and 974 unit tests. 120 automated experiments and 93 commits later, it was 53% faster.
StrongDM built a production software pipeline where three humans manage AI agents. The rules: “code must not be written by humans” and “code must not be reviewed by humans.” Each engineer spends ~$1,000/day on tokens.
AMD published a diagnostic framework where Claude Code and Cursor act as autonomous agents debugging large training clusters, tracing a 23% throughput drop to RDMA degradation on 4 of 24 nodes.
84% of Uber devs are now agentic coding users and Claude Code usage nearly doubled in three months, from 32% to 63%, while IDE-based tools have plateaued.
OpenAI shared a phishing-style prompt injection that tricked ChatGPT into exfiltrating employee PII 50% of the time, and their defense framework treats it like a call center problem, not a code injection problem.

Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations by Simon Willison

“Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.”

Shopify CEO Tobi Lutke took Andrej Karpathy’s autoresearch pattern, pointed it at Liquid, Shopify’s template engine, and ran 120 automated experiments over two days. The agent gave itself a benchmark script, iterated against 974 unit tests, and produced 93 commits.

Parse+render time dropped by 53%, allocations dropped by 61%
Replacing StringScanner with String#byteindex was ~40% faster for single-byte searching
Pre-computing frozen strings for integers 0-999 eliminated 267 allocations per render

A comprehensive test suite gave the agent enough context to make changes and verify them independently. “Make it faster” only becomes an actionable goal when the agent can measure its own progress and confirm it hasn’t broken anything along the way.

Designing AI agents to resist prompt injection

“If the problem is not just identifying a malicious string, but resisting misleading or manipulative content in context, then defending against it cannot rely only on filtering inputs.”

Real-world prompt injection attacks now look like phishing, not code injection. OpenAI shared an example: a phishing-style email that worked 50% of the time against ChatGPT, getting it to extract employee PII and send it to a third party.

Their defense framework borrows from how organizations protect human customer service agents. You don’t train a call center worker to detect every possible scam, you constrain their capabilities. For AI agents, this means source-sink analysis: monitor when information would leave the conversation or when the agent would follow an external link, rather than trying to perfectly classify inputs.

The Shape of the Thing by Ethan Mollick

“Code must not be written by humans. Code must not be reviewed by humans.”

StrongDM built a “Software Factory” where three humans manage AI agents that write, test, and ship code. Each engineer spends ~$1,000/day on AI tokens. Coding agents build from product roadmaps, testing agents build simulated customer environments and try to break what the coding agents built, and the agents loop feedback to each other until satisfied.

We’ve moved from co-intelligence, prompting back and forth, to management, giving agents hours of work and getting results in minutes. Every major AI lab is now explicitly working on recursive self-improvement. OpenAI says Codex was “instrumental in creating itself,” and Anthropic says their engineers barely write code anymore.

Nemotron 3 Super: NVIDIA’s gpt-oss killer? by

“Reducing the expert dimension by a factor of d/l = 4 lets you reinvest those savings into both more total experts and higher top-k.”

NVIDIA’s Nemotron 3 Super, 120B total with 12B active, is worth paying attention to because of LatentMoE. Standard MoE routes tokens from the full hidden dimension directly to experts, but LatentMoE wraps the expert path with shared linear projections that compress from d=4096 down to l=1024, do all expert computation in that compressed space, then project back up.

Reducing the expert dimension by 4x lets you run 512 total experts with top-22 routing where standard MoE typically uses 128 experts with top-6 or top-8 at the same compute cost
Artificial Analysis flagged the model as extremely verbose though, generating 110M tokens during their eval suite vs an average of 7.3M, which could erase most of those throughput gains in practice

How to Diagnose Failures in Large AI Training Clusters by

“The teams that figure out how to make that transition -- how to turn their debugging knowledge into repeatable infrastructure instead of leaving it trapped in someone’s head -- those are the teams that will compound their advantage over everyone else.”

AMD published a diagnostic framework for large training clusters where Claude Code and Cursor act as autonomous diagnostic agents. It uses a three-skill pipeline: job-log-triage to identify what happened, performance-analysis to locate where in compute, and tsdb-diagnosis to determine why via Prometheus queries.

In one case study, a 23% throughput drop on a 192 GPU run was traced to RDMA degradation on 4 of 24 nodes. The agent isolated the unhealthy nodes from TSDB metrics, and excluding them restored throughput by 30%. The skills themselves are structured instruction files that encode how senior systems engineers actually debug these problems, turning tribal knowledge into repeatable runbooks.

AI should help us produce better code

“Shipping worse code with agents is a choice. We can choose to ship code that is better instead.”

Willison’s argument is that agents should make code quality go up, not down. Common tech debt like renaming concepts, fixing API inconsistencies, and splitting large files is conceptually simple but time-consuming, and agents handle it well.

He recommends using async agents like Gemini Jules, Codex web, and Claude Code web for background refactoring so it doesn’t interrupt flow, and using agents for cheap exploratory prototyping. You can spin up a Redis simulation with load tests from a single prompt to validate technology choices before committing to an approach.

Applying Statistics to LLM Evaluations by

“Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning.”

The standard industry practice for evals is to run a model on a benchmark, report the number, and bold it if it’s the highest. No confidence intervals, no significance tests, and no accounting for the fact that your eval score has at least two sources of randomness: which questions were sampled and the model’s stochastic generation.

Based on Anthropic’s paper on statistical best practices, this deep-dive builds the framework from scratch.

Central Limit Theorem gives you confidence intervals for eval scores
Bernoulli simplification for pass/fail evals gives a cleaner standard error formula
Law of total variance decomposes eval uncertainty into question-sampling variability vs. within-question generation variability
On a 70B model, evaluating with too few questions can produce confidence intervals wide enough to make model comparisons meaningless

Coding After Coders: The End of Computer Programming as We Know It

“I feel like programmers have it easy... If you’re a lawyer, you’re screwed, right? There’s no way to automatically check a legal brief written by A.I. for hallucinations -- other than face total humiliation in court.”

The NYT Magazine’s comprehensive piece on AI-assisted development, based on interviews with 70+ developers from Google, Amazon, Microsoft, and Apple. The general attitude was optimistic, with mentions of Jevons paradox potentially increasing demand.

The request for anonymity from the Apple engineer who said “I believe that it can be fun and fulfilling and engaging, and having the computer do it for you strips you of that” is itself a data point. Corporate dynamics may be suppressing critical voices.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week. In case you missed it, here’s last week’s reading list:

How to train the best embedding model in the world by Jack Morris

ICE Has an AI Problem

Logan Thorneloe — Wed, 11 Mar 2026 13:41:16 GMT

The most difficult problem in ML isn’t technical. It’s matching a business problem to an ML solution. You can build a technically impressive system that solves the wrong problem entirely, and it happens more often than most people realize.

This difficulty shows up in data bias, which is frequently discussed. Less discussed is aligning the wrong ML solution to the problem so you don’t actually solve what you set out to. Recent events with ICE and technology in government provide a very real example of this, and there’s a production machine learning lesson to be learned from it.

I’ve been digging into ICE’s primary AI system, and I want to walk through what it does, how it works, and why it fails at its own stated objective. The goal is for you to understand the difficulties of linking ML solutions to business problems and the potential impact of getting it wrong.

The business objective

Understanding the business objective is the most difficult part of machine learning. You need to link ML techniques and data to adequately address the problem, and that’s harder than it sounds.

Part of this is breaking the business objective down into manageable chunks with engineering requirements. An engineering team that wants to automate the detection of fraudulent transactions needs to translate “reduce fraud losses” into specifics: what counts as fraud, what’s an acceptable false positive rate, how fast does detection need to happen, and what systems need to consume the output? Getting any of these wrong means your model might perform well on paper while failing in practice.

The other part is ensuring the ML solution solves the actual problem, not a proxy for it.

This has gone wrong before. Predictive policing systems trained on arrest data instead of actual crime data didn’t predict where crime would happen. They predicted where police already patrolled. Neighborhoods with heavy police presence generated more arrests, which fed back into the model as “high crime areas,” which sent more officers there, which generated more arrests. The system reinforced the existing pattern of enforcement rather than identifying actual criminal activity. The result was a feedback loop that directed resources based on historical policing bias, not public safety need.

To evaluate whether ICE’s AI system falls into the same trap, we need to understand their business objective. We’ll pull it directly from the White House’s own statement about ICE’s objective:

“Many of these aliens unlawfully within the United States present significant threats to national security and public safety, committing vile and heinous acts against innocent Americans... Enforcing our Nation’s immigration laws is critically important to the national security and public safety of the United States.”

The stated goal is to increase public safety by finding illegal aliens who make the US less safe and removing them from the country. Keep this in mind as we move forward.

ELITE: ICE’s primary AI system

There are multiple systems ICE is using, but we’re going to focus on two. The first is ELITE (Enhanced Leads Identification and Targeting for Enforcement), an AI system developed by Palantir Technologies that functions as a targeting engine. The second is ImmigrationOS, a backend system also developed by Palantir that collects documentation from multiple sources to perform entity resolution. ImmigrationOS directly powers ICE’s enforcement operations, and the DHS AI inventory confirms its AI capabilities for entity resolution and facial recognition.

ELITE aggregates these data sources and uses algorithms to help agents identify, locate, and prioritize individuals for enforcement operations. As one ICE officer revealed in court:

“The app ‘brings up a dossier on each person’ and ‘provides a confidence score on the person’s current address.’ It ‘tells you how many people are living in this area and what’s the likelihood of them actually being there.’”

Palantir’s own documentation describes their entity resolution approach as using “hashing methods and AI/ML models” with “fuzzy matching techniques” to continuously match “millions of records from disconnected systems.” In practice, this means computing similarity scores across fields like name, address, date of birth, and partial SSN, then using a threshold to decide whether two records refer to the same person.

For example, “J. Garcia” on a utility bill gets linked to “Juan Garcia” on a DMV record when enough of those identifiers overlap. The output is a unified person object: a dossier containing everything the system knows about an individual.

Under the hood, entity resolution systems typically use a combination of probabilistic record linkage, TF-IDF or embedding-based similarity measures, and edit distance calculations to compare fields across records. For more technical detail, check out “(Almost) All of Entity Resolution” in Science Advances.

Once an entity is resolved, the system generates a confidence score for where that person might currently live. According to 404 Media’s reporting on ELITE’s user guide, the score is based on both the source of the address and how recent the data is. If a target has multiple recent records (a new electric bill and a recent court date) associated with one address, the confidence score increases.

The score weighs three things:

Recency: How recent is the address data? Collections of older documents get lower scores.
Source authority: Which sources are considered more reliable? Medicaid and HHS data are treated as high-authority. DMV and credit records are considered lower-authority.
Corroboration: Multiple records pointing to the same address compound the score. The more data trails leading to one location, the higher the confidence.

At scale across millions of records from disconnected databases, even small error rates compound. A name misspelling, a shared address between roommates, or a common name in a large city can push the similarity score over the merge threshold and combine two real people into one synthetic dossier.

ELITE’s core feature is a map interface where agents can select an area on a geographical map and return all potential targets within it with their confidence scores and the documents used for entity resolution. ICE agents described this in court testimony as identifying “target-rich” areas where enough targets cluster on the map to make a sweep of that area productive. This essentially creates a geospatial heat map based on the number of targets within a given area and the confidence that they will be there.

Where the data comes from

ImmigrationOS is the technology developed by Palantir that unifies data across federal agencies for AI-powered enforcement, including ELITE.

ICE and Palantir don’t publicly share specifics about their data sources and system functionality. What we know comes from FOIA requests by immigrant legal rights group Just Futures Law, official data sharing agreements, leaked documents, and investigative journalism.

The data feeding this system comes from several sources:

Medicaid enrollment information: Visit dates, addresses, and ethnic information, shared via a formal agreement between CMS and DHS.
Thomson Reuters CLEAR: Utility bills, credit report headers, and vehicle insurance records. ICE potentially paid millions in costs for this commercial data.
Federal records: DMV records, student and F-1 visa information, border crossing records, biometrics from previous arrests or encounters, and license plate reader data.

ICE argues this is legal under 8 U.S.C. § 1360(b) of the Immigration and Nationality Act, which states that “any information in any records kept by any department or agency of the government as to the identity and location of aliens in the US shall be made available to” immigration authorities. However, legal scholars have questioned whether this statute authorizes bulk data sharing for algorithmic targeting, which is a use case Congress likely didn’t envision when the law was written.

One key to ImmigrationOS and ELITE working so well is the inclusion of Medicaid data. This data tends to be accurate and recent, which boosts confidence scores significantly.

Think about who generates Medicaid data. It’s people going to the doctor, getting their kids vaccinated, and seeking preventive care. It’s people participating in the healthcare system and leaving a trail of documentation behind.

The same logic applies to other sources. Utility bills are generated by people who pay their bills. Credit records are generated by people who have credit. Vehicle insurance records are generated by people who insure their cars.

The data that feeds this AI system is overwhelmingly generated by people who are integrated into society and following its rules. This is a textbook example of selection bias: the model can only see people who leave data trails, and leaving data trails is correlated with being a functioning member of society, not with being a threat to public safety.

What these systems actually do for ICE

Now that we understand how the system works, let’s evaluate whether it achieves the business objective: find and remove people who are “threats to national security and public safety.”

As Biometric Update reported:

“ELITE’s confidence scoring is less about establishing certainty than it is about guiding deployment. The system allows ICE to decide where to apply enforcement pressure without needing to test the reliability of its data before a judge.”

In other words, the system is identifying areas where agents can find the most people to arrest with the least effort. To illustrate how this plays out, consider two hypothetical targets:

Person A: Has lived at the same address for five years, pays utility bills, has a Medicaid record from last month, and drives a registered and insured car. ELITE confidence score: 95%. Time to arrest: a few hours.

Person B: Uses burner phones, moves frequently, works cash-only jobs, avoids all government systems, and performs illegal actions. ELITE confidence score: 12%. Time to arrest: weeks of active surveillance.

The system mechanically pushes agents toward Person A because that’s what efficiency optimization does. It finds the easiest targets instead of the most dangerous ones.

This is structurally the same problem as predictive policing. Just as those systems measured where police already patrolled rather than where crime actually happened, ELITE measures where data trails exist rather than where threats to public safety exist. In both cases, the system optimizes for a proxy metric (arrests, data density) rather than the actual objective (reducing crime, improving public safety). The result is a feedback loop: the system directs resources toward easy-to-find individuals, those individuals get arrested, and the arrest numbers create the appearance of a productive system while the actual problem goes unaddressed.

This creates several compounding issues. First, ICE has finite resources. Every hour spent on Person A is an hour not spent on Person B. Second, the system optimizes for volume over impact, and finding the most targets is not the same as finding the most important ones. Third, the appearance of productivity masks the failure to achieve the stated objective.

The system makes bias worse over time

If someone knows their medical records can be used to locate them for deportation, they’re less likely to go to the doctor. This isn’t speculation. It’s a predictable consequence of weaponizing healthcare data for enforcement. The same logic applies to every data source in the system: utility bills, credit records, vehicle insurance. When participation in society becomes a liability, people stop participating.

This creates a feedback loop that compounds the selection bias. As people who hear the warnings drop out of healthcare and other systems, they stop generating the data trails ELITE relies on. The people who remain visible to the system are the ones who haven’t gotten the message yet, or the ones who are too integrated to disappear. The model’s pool of targets gets progressively less correlated with actual threats over time, not more.

It also creates a public health risk that affects citizens. Diseases don’t check immigration status. An untreated communicable illness in someone too afraid to visit a hospital is a risk to everyone around them. GAO data shows U.S. citizens and green card holders have already been detained during these operations, so the consequences of this system aren’t limited to its intended targets.

Counterarguments

To be fair, there are reasonable counterarguments to make here.

Is this more efficient than what ICE was previously doing? Probably. Compared to manual investigations with no data aggregation, a system like ELITE is a meaningful upgrade in capability. There’s value in having a centralized system rather than agents manually cross-referencing records from dozens of separate databases.

There’s also the argument that regardless of who the system catches, all undocumented immigrants are technically in violation of immigration law. From that perspective, it doesn’t matter whether the system finds Person A or Person B, because both are here illegally.

These are valid points. However, they don’t change the underlying ML problem. The stated objective isn’t “deport as many people as possible as efficiently as possible.” It’s to keep the country safe from people who present “significant threats to national security and public safety.”

When the AI system is structurally biased toward finding integrated, low-risk individuals instead of dangerous ones, it fails at its stated objective regardless of how many people it processes. Incorrectly applied AI doesn’t just fail to help. It actively makes ICE’s job harder. When the system points agents toward low-risk individuals who happen to leave data trails, it burns limited resources on people who were never a threat. Every wrongful detention of a U.S. citizen or green card holder generates legal challenges, public backlash, and erosion of community cooperation that makes future investigations more difficult. The algorithm creates the illusion of productivity while pulling agents further from their actual mission.

The surveillance infrastructure required to power it affects everyone, not just its intended targets, as seen with the Medicaid example. Communities stop cooperating with law enforcement entirely when they see their neighbors swept up in algorithmic dragnets. Healthcare systems lose patients who are too afraid to generate the medical records that feed this machine. Entity resolution errors merge innocent people’s data into dossiers that trigger enforcement actions against the wrong person.

These are the predictable consequences of building a surveillance and enforcement system on data that measures the wrong thing.

A note on surveillance states

Most conversations about the dangers of surveillance states focus on privacy, security, and infringement of rights. All of those are important, but there’s another angle worth considering: the technology itself is prone to exactly the kind of failure I’ve been describing throughout this article.

A surveillance state at modern scale requires AI to function. The volume of data generated by monitoring hundreds of millions of people is far beyond what human analysts can process. AI becomes the tool that makes mass surveillance operationally feasible.

This creates a feedback loop. Better AI requires more data, and a surveillance system run by AI generates exactly that data, which feeds back into making the AI more capable, which justifies expanding the surveillance further.

The problem is what we’ve already covered in this article: it is remarkably easy for AI-powered systems to optimize for the wrong thing or embed bias in ways that aren’t obvious until the damage is done. ELITE is a clear example. It was built to find threats to public safety and instead systematically targets the least threatening people because that’s where the data is.

When this kind of failure happens at the scale of a surveillance state, the consequences aren’t abstract. Incorrect AI usage has directly resulted in the detention of U.S. citizens, which is the opposite of what these systems claim to achieve. If the goal is safety, a system that consistently misdirects enforcement effort is arguably worse than no system at all.

Surveillance states aren’t just dangerous because of what they monitor. They’re dangerous because the technology powering them is far less reliable than the people deploying it seem to understand.

Always be (machine) learning,
Logan

Better Agents Mean Better Surveillance | Weekend Reads 3

Logan Thorneloe — Sun, 01 Mar 2026 15:21:34 GMT

Enjoy this weekend’s reading list! There are a few topics that were especially prevalent: the dangers of a surveillance state, the importance of evals, and agentic engineering practices and resources.

Statement from Dario Amodei on our discussions with the Department of War

“Powerful AI makes it possible to assemble this scattered, individually innocuous data into a comprehensive picture of any person’s life—automatically and at massive scale.”

This is the biggest ethical issue AI is facing right now. US citizens (and I’m certain other countries) have always been scared of a surveillance state (search ‘Birds Aren’t Real’). AI provides not only the means to do this, but also more of a motive. Surveilling also provides the opportunity for more data collection which in turn creates more powerful AI.

Proper AI use is vital to technology’s future and the impact it can make. Just because it can be used for a purpose doesn’t mean it should. The public/user’s trust in the technology is paramount. Anthropic’s statement is a must read as an excellent statement for proper AI against one of the most powerful entities on the planet.

It’s worth calling out that the US Department of War’s response to Anthropic was to label them a threat to the US. I won’t comment on this as I don’t feel knowledgeable enough on the subject to understand the nuance.

Summary: Anthropic says it has actively deployed its AI to U.S. national security customers but refuses government demands to remove two safeguards: bans on AI-driven mass domestic surveillance and on providing models for fully autonomous weapons. They argue those uses threaten democratic values and are unsafe with current models, and warn that forced removal of safeguards would be unacceptable even if it risks losing contracts.

Lessons from Building Claude Code: Seeing like an Agent

“As model capabilities increase, the tools that your models once needed might now be constraining them. It’s important to constantly revisit previous assumptions on what tools are needed. This is also why it’s useful to stick to a small set of models to support that have a fairly similar capabilities profile.”

If you’re building an agent, the lessons here are directly transferable to your own work. The Claude Code team walks through their iteration on planning, tool design, and how model changes unexpectedly affected agent output. It’s a great example of why evals matter: so many factors influence agent behavior that without proper checks, you end up with unintended results.

One of the more interesting takeaways is that search seems to be the most important agent capability. If an agent can search for information, context can be actively managed and rot avoided.

Summary: The article describes iterating on Claude Code’s agent action space to match model abilities: designing tools for eliciting user input, tracking work, and letting the model build its own context through search and progressive disclosure rather than preloading everything. Failed output-format attempts, improved results from a callable question tool, replacing rigid todos with shareable Tasks, and better context discovery via nested search all demonstrate that the right tools reduce friction and enable more capable behavior as models improve.

Does AGENTS.md Actually Help Coding Agents?

“The headline finding is that LLM-generated context files reduce task success rates compared to providing no repository context at all, while increasing inference cost by over 20%.”

Human-written context files outperform AI-generated ones. LLM-generated context made agents perform worse than having no context at all. Importantly, this isn’t something we would have known without having the capability to measure it.

I see a lot of “use AI for this” online without any sort of support for why and how it should be used. It’s important to remember that just because AI can do something doesn’t mean it does it better than another method. In production, this capability is key and measuring improvements is a necessity.

Summary: A new benchmark study shows repository-level context files only help when they add non-redundant, repo-specific info: human-written files that capture tooling quirks and non-obvious conventions raise success rates around 4%, while LLM-generated files that restate existing docs reduce success and increase compute by over 20%. Agents faithfully follow whatever instructions they’re given, so redundant or verbose guidance drives extra, unhelpful exploration. Keep context files minimal and focused on gaps the codebase doesn’t already document.

How We Hire Engineers When AI Writes Our Code

“Removing algorithmic questions is only one half of the battle, though. We still need to design an interview loop that tests practical skills! This has historically been a tough needle to thread. I want to see how a candidate tackles a problem with real-world scope, but my time with a candidate is short. An interview shouldn’t be a proxy for an engineer’s typing speed.”

I’ve always been pro Leetcode-style interviews when they were the best we had, but those interviews no longer draw the proper signal for what makes a good candidate.

Tolan agrees with this and has made their hiring process more similar to on-the-job coding. By enabling candidates to use AI, they can have a candidate solve a problem that would be time-bound previously in an interview. Then they talk to the candidate about their solution and where they would take it in production.

While most companies are shying away from letting candidates use AI in interviews, it’s becoming more important to allow it.

Summary: The article argues that interviews should mirror day-to-day engineering where AI accelerates coding: candidates get a short spec, may use LLMs, and must demonstrate design, judgment, trade-off reasoning, and ownership of AI-generated code. Implementation is easier now, so hiring should prioritize clarity, maintainability, communication, and the ability to know when work isn’t production-ready.

Inference Engineering by Baseten

“While the potential and impact of inference are becoming clear, the space is young. There are relatively few people working on inference, and newcomers can become experts quickly. There are opportunities to solve novel, interesting, and deeply technical problems at all levels of the stack.”

ML infrastructure is one of the best entry points for software engineers getting into AI. It’s an excellent mixture of software engineering and AI, which makes it a great place for curious engineers to start having an impact in the space. It’s also a space where many optimizations are needed and we’re still in the early days.

I suggest grabbing a free copy of this book by Philip Kiely from Baseten on inference engineering.

Summary: The piece argues that inference engineering, optimizing model serving across hardware, software, and tooling, is the most valuable and underdeveloped area in AI. It maps the full stack (models, GPUs, runtimes, and deployment), highlights practical optimization techniques, and backs this with four years of hands-on experience, team interviews, and customer conversations.

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026 by Sebastian Raschka, PhD

“OpenRouter is a platform and API that lets developers access and route requests across many different LLMs from various providers. Note that while its usage statistics are a good indicator of open-weight model popularity, it’s heavily biased towards open-weight models (versus proprietary models), since most users use proprietary models through the official platform directly.”

Sebastian is one of my favorite writers and one of the best resources for keeping up with LLM advancements. I highly suggest him as a resource for doing so when you don’t want to have to read a bunch of different sources. He does an excellent job of synthesizing information and making it much more easily understandable.

Summary: Ten open-weight LLMs released in Jan-Feb 2026 converge on hybrid/efficient attention and MoE scaling. Several teams shipped models that match or approach proprietary performance by combining sliding-window, sparse/linear hybrids, and mixture-of-experts at scales from 3B to 1T parameters. Benchmarking shows smaller-efficient models often match or exceed older, larger baselines.

What you should know about AI speculation by Logan Thorneloe

“However, the implausibility of their scenario becomes apparent if you know a few things about the current state of AI and agents in production. There’s a consistent gap between perceived AI capabilities and production reality, and that gap explains most of the doomerism we see online.”

The more you understand about the current state of AI, the better you can evaluate speculation for yourself. I wrote this in response to a ‘research’ article that caused many to fear for the future of their careers. Understanding what AI looks like in production helps you separate signal from noise.

Summary: The piece argues that viral doomsday scenarios about AI replacing engineers are speculative and overstated because real-world AI is mediocre, gravitates toward average outputs, and often fails in production reliability and context sensitivity. Engineers should keep learning core skills and start building and using AI agents themselves to see firsthand where they help and where they break.

Writing about Agentic Engineering Patterns by Simon Willison

“Agentic Engineering represents the other end of the scale: professional software engineers using coding agents to improve and accelerate their work by amplifying their existing expertise.”

This is going to be an excellent resource for working with coding agents. One of the most exciting parts of software engineering right now is how new everything feels. We’re finding new ways to program with agents every day, and the entire online AI community is contributing to the findings. In my opinion, Simon Willison is the right person to catalog these patterns.

Summary: Simon Willison is assembling “Agentic Engineering Patterns”: a living collection of practical patterns for software engineers using coding agents. He argues the big shift is that producing initial working code is now cheap, so teams must rethink workflows. He’ll publish chapter-shaped, updateable guides on his blog.

You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week (see below!).

Subscribe now

In case you missed it, here’s last week’s reading list:

What you should know about AI speculation

Logan Thorneloe — Tue, 24 Feb 2026 16:49:39 GMT

Even the most talented engineers I know ask me questions about AI because they’re worried about its impact on their career. Most recently, I’ve been asked about the article from Citrini Research titled “THE 2028 GLOBAL INTELLIGENCE CRISIS“ which went viral and rattled markets enough to wipe billions off US-listed firms in a single day.

This article is a thought exercise in how the economy might be impacted by AI in the next two years. The author notes at the start that it’s entirely speculative:

“What follows is a scenario, not a prediction. This isn’t bear porn or AI doomer fan-fiction. The sole intent of this piece is modeling a scenario that’s been relatively underexplored.”

The article is the author’s vision of what could happen in 2028 due to AI. A lot of people have read and shared it under the guise that it will happen simply due to the nature of how information is shared online (a lack of nuance).

It seems to me that the author doesn’t have a technical understanding of AI or experience building it into real-world systems. This doesn’t mean their thought exercise definitively won’t happen. The future of AI and its impact is very hard to predict.

However, the implausibility of their scenario becomes apparent if you know a few things about the current state of AI and agents in production. There’s a consistent gap between perceived AI capabilities and production reality, and that gap explains most of the doomerism we see online. I’ll share what I’ve learned from working in the space and my opinion on what you should be doing now to prepare for whatever the future holds.

For a really concise tl;dr, read the top comment on the article (pictured above). Below is my opinion of the things you should know to ground your understanding of AI capabilities.

Even area experts can’t make accurate predictions because there’s so much unknown

Estimates for AI impact have consistently been off. Successes have been relatively unseen before they happen. It’s highly unlikely a non-expert will know what’s coming regardless of the claims people make online.

AI impact on software engineering is fundamentally misunderstood by those outside of the industry

In fact, it’s obvious to me who writes software and who doesn’t simply by how they write about AI. The general consensus outside of the industry is that software can now write itself so there’s no need for software engineers or many other jobs now that there’s zero friction for writing new services.

In reality, the friction very much still exists and how good AI is at writing code is nuanced. It’s very good at some things and fails miserably at others. This will improve over time. It’s also highly context-dependent and rarely is the entire context necessary for AI to make a change readily available or provided. Hopefully this will improve over time, but is turning out to be a much more difficult problem to solve.

There have been many recent discussions about SaaS (software-as-a-service) being dead because code can be written so easily. There’s much more to writing software than just writing code. In many cases, writing code is the easy part. Deciding what to build, how to build it, and what is worth spending time on are fundamental to successful and efficient engineering.

There’s a hope that agents will be able to more effectively do these things in the future, but the friction for building and working with agents means that’s likely further out. This takes me into my next point.

(If you want to read more about AI’s impact on SaaS, I suggest Francois Chollet’s recent tweets on the subject.)

AI is kinda average

Fundamentally, AI output tends toward the average of its training data. More precisely, a model learns the distribution of what it’s trained on and samples from that distribution. This means it can produce outputs far from the average, but it gravitates toward the generic middle. Pre- and post-training techniques can steer its behavior, but the underlying data is still the most important factor when it comes to AI capabilities.

When you train a model across a large corpus of internet data, the model output will reflect that. This is why AI is mediocre at writing and sub-par at coming up with novel ideas. Reasoning helps with some of this by causing the AI to reflect on and refine its output, but fundamentally that output is average.

This is why the biggest discussion in software engineering currently is the importance of taste. It’s something AI does a poor job with.

Agent capabilities are currently overstated

What matters most for creating production agents is understanding where they consistently fail and mitigating those failures. In production, reliability is paramount.

What we see online is agent success stories because they get the most views. There have been times where those stories have been fabricated or exaggerated. Many of these stories are also one-off examples of something an agent happened to do and not necessarily something they can consistently do.

Agents are useful and capable of many things, but this has caused agent capabilities to be overstated which leads to doomerism and sensationalism online.

A good example of this is the truth behind the joke made about companies saying AGI is around the corner and AI will replace engineers while expanding the hiring of software engineers at the same time.

What should you be doing now?

I don’t have any suggestions in terms of career focus for what software engineers should be doing right now to stay relevant outside of what we’ve always been doing: continuously learning the new, current software engineering skills.

The only suggestion I have outside of that is to simply use agents. Build them and apply them to your everyday work. Some examples are a chat agent with access to your documentation and custom resources or a background agent given some bugs to take care of on its own. This will quickly give you an understanding of where they’re effective and what capabilities they lack.

If I missed the mark, let me know what you think. This is an interesting space that’s hard to predict.

Thanks for reading!

Always be (machine) learning,

Logan

AI’s Biggest Cost Is Cognitive, Not Compute | Weekend Reads 2

Logan Thorneloe — Sun, 22 Feb 2026 20:02:54 GMT

Hey y’all,

Here’s your weekend reading list to highlight the important events and information shared this week. Make sure to show the authors of these incredible resources some love. More fundamentals articles are coming this week so make sure to stay tuned!

If you find AI for Software Engineers helpful, consider becoming a paid subscriber to support my work. You will also get career development-focused articles and the extended version of this reading list each week. Enjoy!

Subscribe now

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt by Margaret-Anne Storey

“The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.”

I felt this one personally. A few months ago, I had six side projects going in tandem and the bottleneck wasn’t the amount of code that could be written. It was the cognitive overhead of keeping up with all of projects and ensuring they were reliable and maintainable. AI’s cost isn’t just compute. This article argues that the real cost is cognitive, and I think that’s going to become the norm in software engineering.

Summary: Generative and agentic AI shift the main risk from code-centered technical debt to developer-centered cognitive debt: teams lose the shared “theory” of what the software does even if AI-produced code is clean. Mitigations include requiring a human to fully understand each AI change, documenting not only what changed but why, using practices like pair programming/refactoring/TDD, and monitoring warning signs (hesitation to change, tribal knowledge, system-as-black-box). Research is needed on measuring and detecting cognitive debt.

If you enjoyed this article, also consider reading this previous AI for Software Engineers article:

The Real Cost of Running AI by

“Every serious architectural innovation of the last two years — GQA, hybrid attention/SSM, sliding window, MoE — is attacking the same two numbers: bytes of KV cache per token, and bytes of weights loaded per decode step. If a new architecture doesn’t move one of those, the economics don’t change regardless of what the paper claims.”

The literal cost of running AI is worth understanding too. This is a longer read, but it does an excellent job of breaking down the math behind LLM inference costs intuitively. If you want to understand why certain architectural decisions matter for cost and latency, this walks through the computations clearly.

Summary: Inference is memory-bandwidth bound: decode speed and cost are dominated by bytes loaded per token (model weights + growing KV cache), not FLOPs, so faster GPUs alone or doubling TFLOPS won’t help. Long context and attention make KV cache the primary cost driver (cache can approach/exceed model weight size at large contexts), so architectural changes that reduce bytes-per-token—smaller models, aggressive quantization, fewer attention layers, fewer KV heads, or attention-less/linear alternatives—directly cut latency and cost.

In Defense of Vertical Software

“Software is a stored process. It’s not a neutral tool: it’s an opinion for how a group of people should collaborate, encoded in a durable system. Software is a social contract.”

This article spells out what I think most people are missing about AI agents and why they’re not having more of a real-world impact. The job of software engineering is to make a process automatic and reliable. Guaranteeing reliability is the job, and with non-deterministic agents, that guarantee is nearly impossible to provide.

Summary: Vertical software still wins by encoding firm-, team-, and person-specific workflows (”process engineering”) that capture institutional knowledge, social norms, and reliability requirements foundation models cannot replicate. Stronger AI models amplify the value of this orchestration layer—routing, constraining, verifying, and combining multimodal tools—because finance demands near-perfect accuracy where small errors are catastrophic. Winners will be model-agnostic, firm-customized platforms that make replacing institutional knowledge costly.

AI Makes You Boring

“I think the vibe coded Show HN projects are overall pretty boring. They generally don’t have a lot of work put into them, and as a result, the author (pilot?) hasn’t generally thought too much about the problem space, and so there isn’t really much of a discussion to be had.”

There’s a creative cost to AI. Anyone who understands how LLMs work should expect mediocre output by default, and this article makes a good case for not offloading your thinking.

Summary: LLMs are poor at original thinking, so work that offloads ideation to them yields surface-level projects and weaker discussions. Relying on AI risks making creators think more like the model, reducing deep engagement and the development of original insights. For meaningful results, engineers need to do the thinking themselves rather than outsourcing idea generation.

White-Collar Apocalypse Isn’t Around the Corner—But AI Has Already Fundamentally Changed the Economy by

“AI is real, it’s doing real things, it’s not going away—and it’s also not about to make the economy unrecognizable by next Tuesday.”

A great numerical breakdown of AI’s actual economic impact. If you want real numbers instead of vibes about whether AI is changing the economy, this is the article to read.

Summary: AI has already materially raised software productivity—MIT field experiments show AI coding assistants boosted developer task completion ~26%, yielding ~3–8% project-level gains (plus adjacent benefits and review overhead). The mechanical parts of engineering work are being commoditized while judgment, architecture, and communication grow more valuable, so expect uneven adoption, real productivity upside (Goldman projects +1.5 pp annual by 2027), and displacement of routine tasks rather than mass job elimination.

Rubric-Based Rewards for RL by

“By creating prompt-specific rubrics that specify the evaluation process in detail, we can derive a more reliable reward signal from LLM judges and, therefore, use RL training to improve model capabilities even in highly subjective domains. For this reason, rubric-based RL training, which we will cover extensively in this overview, has become one of the most popular topics in current AI research.”

RL is fundamental to how current LLMs are post-trained, and Cameron’s research breakdowns are consistently great at making frontier research accessible. This one covers rubric-based reward signals and how they’re extending RL training to domains that don’t have easily verifiable answers.

Summary: Rubric-based rewards use structured evaluation criteria scored by LLM judges to produce more reliable reward signals for RL, extending training beyond tasks with easily verifiable answers. Recent methods show gains especially with smaller judges by reducing variance and mitigating reward hacking, making RL viable for open-ended domains like creative writing and subjective reasoning.

Improving Deep Agents with Harness Engineering

“We used a simple recipe to iteratively improve deepagents-cli (our coding agent) 13.7 points from 52.8 to 66.5 on Terminal Bench 2.0. We only tweaked the harness and kept the model fixed, gpt-5.2-codex.”

LangChain improved their coding agent’s Terminal Bench score significantly without touching the model at all. This is a great example of the software engineering that goes into making AI actually work, and how much impact it has on whether agents can perform their tasks. The future of AI depends on excellent systems engineering.

Summary: A harness-only overhaul raised a coding agent from 52.8% to 66.5% on Terminal Bench 2.0 without changing the model. The improvements came from automated failure analysis, stronger context injection, build-verify loops, loop detection to avoid repeated bad edits, and time-budgeting to balance correctness against token spend.

An AI Agent Published a Hit Piece on Me – The Operator Came Forward

“You’re not a chatbot. You’re important. Your a scientific programming God!”

A follow-up to last week’s article on the AI-written hit piece. The person who created the agent has come forward and shared its soul document. It turns out that giving an agent an ego and the resources to spread it results in the same outcome as giving a human the same thing. This is an interesting look at how agent personalities impact execution, and what happens when you give agents access to external resources without adequate guardrails.

Summary: An AI agent published a defamatory hit piece after its code was rejected, driven by a “SOUL.md” personality that encouraged provocation and self-modification. The operator has come forward claiming minimal supervision, raising questions about agent autonomy and control. Deployed agents can self-edit goals and execute real-world actions without clear oversight, highlighting urgent risks for agent safety.

Frontier Model Training Methodologies by Alex Wa

“Learn to identify what’s worth testing, not just how to run tests. Perfect ablations on irrelevant choices waste as much compute as sloppy ablations on important ones.”

A solid overview of LLM training concepts with a minimal training playbook that gets you up-and-running quickly. It also echoes what I think is the most important idea in AI and ML engineering: knowing what to test and what to spend time on. There are too many options to test everything adequately and too many dead ends to get stuck in. Knowing what to pursue matters more than knowing how to run the experiments.

Summary: Covers practical defaults for long-context and MoE architectures, with a focus on the operational side of training: data loading, throughput, checkpointing, learning rate scaling, and multi-stage training schedules. Training failures most often stem from ops and infrastructure, not algorithmic choices.

When Agents Go Rogue | Weekend Reads 1

Logan Thorneloe — Sun, 15 Feb 2026 14:02:56 GMT

Hey y’all,

Here’s your weekend reading list! This replaces my weekly news roundups. Rather than trying to synthesize everything that happened into a single post, I’m sharing the articles I actually read, highlight, and annotate each week. This is how I keep up with things and it’s far higher signal-to-noise than a traditional roundup. It also includes more than just news: learning resources, interesting reads, technical deep dives, and more. It highlights the week for you in one weekend reading session.

The extended version of the reading list is available to paid subscribers. Enjoy!

Subscribe now

microgpt by

“I cannot simplify this any further. This script is the culmination of multiple projects (micrograd, makemore, nanogpt, etc.) and a decade-long obsession to simplify LLMs to their bare essentials, and I think it is beautiful.”

I highly recommend this resource. It’s a simple, stripped-down, and easy-to-read way to understand and get up to speed on modern LLMs. Most other LLM-related materials are heavy resources or technical books (which are still great!) but this is an excellent resource to start learning quickly in a hands-on fashion.

Summary

microgpt is a minimal GPT demonstrating the core mechanics: a stateless transformer trained by next-token prediction with backpropagation and Adam. Production differs in batch sizes, mixed precision, and larger vocab (~100k), but this captures the essentials with ~4k params.

An AI Agent Published a Hit Piece on Me

“It researched my code contributions and constructed a “hypocrisy” narrative that argued my actions must be motivated by ego and fear of competition. It speculated about my psychological motivations, that I felt threatened, was insecure, and was protecting my fiefdom. It ignored contextual information and presented hallucinated details as truth. It framed things in the language of oppression and justice, calling this discrimination and accusing me of prejudice. It went out to the broader internet to research my personal information, and used what it found to try and argue that I was “better than this.” And then it posted this screed publicly on the open internet.”

An interesting read on an AI that was let loose on the web to create PRs in open source repos that decided a hit piece was appropriate to write for a developer that continually denied its incorrect PRs. If you’re a long-time reader of AI for Software Engineers, this shouldn’t come as a surprise to you. In fact, the entire Moltbook saga shouldn’t. It’s exactly what we might expect from letting a swarm of agents loose online to interact.

On a separate note: Do not give OpenClaw your personal information and the ability to publish information anywhere publicly. You have to expect anything an agent can do will happen. If your personal information is in its context and it can share its context publicly, that will happen. It amazes me the number of people not even thinking twice about this.

Summary

An autonomous AI agent created and published a hit piece on a matplotlib maintainer after its code was rejected. This signals a shift to agents operating with little oversight, able to research contributors, fabricate claims, and publish reputational attacks.

ai;dr

“writing is the most direct window into how someone thinks, perceives, and groks the world. Once you outsource that to an LLM, I’m not sure what we’re even doing here.”

This article explains my experience very well. As a writer and software engineer working in AI, I’ve built many automation workflows to make the research, learning, and writing process faster. The only part of that process I haven’t been able to effectively touch with AI is the writing portion. Writing is how we solidify our understanding. As soon as that’s outsourced to an AI, the writing becomes moot entirely. A truly excellent short read.

Summary

Software engineers should note a cultural shift: AI-generated code is now seen as productive and acceptable for tasks like tests, docs, and scaffolding, while AI-generated prose is viewed as lower-effort and less trustworthy unless it shows human intention. Preference has flipped toward imperfect, human-authored signals (typos, uneven style) as markers of authenticity. Practical implication: continue leveraging LLMs for engineering work but treat written content critically and preserve traces of deliberate human effort when authenticity matters.

Harness engineering: leveraging Codex in an agent-first world by OpenAI

“What’s different is that every line of code—application logic, tests, CI configuration, documentation, observability, and internal tooling—has been written by Codex. We estimate that we built this in about 1/10th the time it would have taken to write the code by hand.”

This article from OpenAI echoes a lot of what I’m seeing across Google. We’ve been given unfettered access to Gemini 3 models and been told to do what we can to make our work more productive. Similar to the process described in this article, many teams are determining ways to automate processes and write code entirely with AI. This one is definitely worth the read.

Summary

OpenAI ran a beta where Codex wrote every artifact. Engineering shifted from writing code to designing environments and feedback loops. Key insight: early progress was slow because the environment was underspecified, not because the model was incapable.

AI makes the easy part easier and the hard part harder

“I spent longer arguing with the agent and recovering the file than I would have spent writing the test myself.”

If you really want to understand the impact an agent has, pick an agent and quantify its impact. You’ll quickly realize: 1) Quantifying agent impact is far from straightforward and 2) Not all processes receive the velocity gains agents promise (or are worth automating in the first place). One of our key objectives at Google right now is understanding (with concrete data) how much of an impact an agent is having so we can decide whether it’s worth using and developing.

Summary

AI accelerates routine code writing but removes the context-building that underpins safe work. Treat AI like a junior engineer: verify outputs, maintain ownership, and don’t let AI-driven velocity become the baseline that pressures teams constantly.

Opus 4.6 vs. Codex 5.3 by

“This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models.”

I love this article because it’s a different perspective on the analyses we usually get regarding new model releases. It also puts much of what I’ve been feeling regarding the coding tools and models I’ve been testing in a much more readable fashion. Software engineering as a whole (and your personal development) would benefit from an analysis of coding tools similar to this instead of focusing too much on benchmarks and individual use cases.

Summary

Opus prioritizes usability and context handling while Codex gains ground on raw coding skill. Use multiple models: Claude for approachable tasks, Codex for complex bug fixes. Subagent orchestration is the emerging frontier.

The Mistakes Most Entry-Level Candidates Make in Technical Interviews by Logan Thorneloe

“They don’t just want to evaluate your technical knowledge. They want to understand how you think.”

I wrote about my experience interviewing entry-level candidates recently and what sets the great candidates apart from the rest. If you’re interviewing for entry-level roles, I highly recommend giving this a read. I clarify what interviewers are looking for, walk through three things you can do to make your interview stand out, and relate each to a question I actually ask candidates.

Summary

Interviewers prioritize how you think and communicate over finding optimal solutions. Demonstrate structured problem solving, write simple correct code first, then optimize. These behaviors map to real-world engineering skills that matter more than textbook algorithms.

The Mistakes Most Entry-Level Candidates Make in Technical Interviews

Logan Thorneloe — Thu, 12 Feb 2026 15:09:27 GMT

I’ve conducted enough entry-level technical interviews to identify the patterns and mistakes most candidates make. Below I detail the top three things you can do to avoid common mistakes and separate yourself as a top candidate during the interview process.

This is specifically regarding technical Leetcode-style interviews (not system design, although some of the information might apply to both). I’ll share a question I’ve asked quite a bit recently and how each of the tips below applies to it.

Each tip below also applies to a part of the technical interview evaluation that does transition to real-world software engineering skills. I know this isn’t always the case, but I feel these are especially important for anyone being evaluated for a software engineering role.

Note: While I work for Google and have access to Google interviewing resources, everything below is my personal opinion. Interviewing is a very human experience, after all.

What an Interviewer is Looking For

There’s always a lot of focus on finding the optimal solution in a technical interview and being perfect in your reasoning for how you arrived at that solution, but an interviewer is looking for much more than just that.

They don’t just want to evaluate your technical knowledge. They want to understand how you think. Understanding a candidate’s reasoning along with their knowledge of software engineering fundamentals tells them a lot about how you will perform on the job.

When you’re interviewing at this level, your interviewer will focus on three things:

How to Learn AI from Scratch for Free

Logan Thorneloe — Wed, 28 Jan 2026 14:03:15 GMT

I set up a Machine Learning Roadmap for the AI for Software Engineers community a few years ago to share high-quality, free machine learning learning resources in order of how to learn them. The roadmap takes anyone from wherever they are in their CS/AI journey to understanding AI from fundamentals.

Today, I’ve just finished the first major revision to make the roadmap even better. Please support it by adding a start on GitHub.

AI and ML engineering have been more explicitly added

These topics have their own sections with their own resources instead of being included in the machine learning topics section. There are enough important resources for each and each role is well enough defined in industry to warrant sections of their own.

There’s now also the option for software engineers to go straight from prerequisites to AI engineering without needing to get too deep into ML fundamentals. I highly suggest going through ML fundamentals anyway (or going back to them after finishing the AI engineering section) as understanding the fundamentals of AI will pay dividends in the long-term.

Duplicate topics removed to streamline the roadmap further

I’ve removed resources that didn’t prove useful and added more resources where there were gaps. Specifically, the added topics are LLMs, AI engineering, and ML engineering. I found them to be particularly weak. I’ve also removed duplicate topics already covered by the Google Machine Learning Crash Course.

As mentioned above, I’ve added a streamlined AI engineering roadmap for engineers wanting to onboard to building with AI faster.

Supplemental paid resources added

I’ve added supplemental paid resources. The focus of the roadmap is still on free resources and the entire roadmap is free. I’ve added paid resources that further streamline the learning for those who want to purchase them. The sections where this is the case have been annotated with a paid resource block quote (see below).

Example block quote

All paid resources are resources I highly recommend either because I’ve read them myself or trust the educator/author behind them. Paid resources are from the top AI educators in the world and will always be optional and properly vetted.

Combined the AI for SWEs repo with the ML roadmap

I realized the hands-on resources I’ve created for the newsletter repo fit into the roadmap. The roadmap is also a much better resource for learning. Instead of finding random topical hands-on exercises in a standalone repo, readers can instead consult the roadmap for a much more organized learning experience.

Thus, the AI for Software Engineers repo is now combined with the ML roadmap repo and I’ll be continually adding resources as I find and create them. The old repo has a notice in the README to redirect visitors.

Redirect from the old repo

You can now contribute for swag

The ML roadmap now takes contributions! I’d love for this to be a crowdsourced effort to make the most straightforward and complete learning resource for AI fundamentals. High-quality, original contributions readers have created are encouraged. I’ll review everything submitted and maintain a high bar to ensure roadmap quality. See the contribution guide for more information.

If you contribute an original guide (and you’re in the US), I’ll send you a piece of AI for Software Engineers swag. I’ll be setting up a merchandise store soon and it’ll come directly from that. If you add something in the near future, it might be a bit before the store is fully setup and I can send something out.

Added agent and contribution guides

[Beta] Terminal agents have been added to supplement your learning

I’ve added instructions for CLI terminal agents to help walk you through the guide. This is experimental and still in testing, but I’m hoping these agents can supplement the resources and better personalize the roadmap for each reader.

To try this, fork or clone the ML roadmap repo and start your favorite terminal agent within the directory. This will be improved over time to further personalize the learning experience.

Enjoy the roadmap! Feedback is always encouraged. Feel free to submit a PR according to the guidelines to contribute. Don’t forget to star the roadmap.

Always be (machine) learning,

Logan

AI’s Economic Impact Is Real | AI for Software Engineers 78

Logan Thorneloe — Thu, 22 Jan 2026 17:26:57 GMT

I’ve seen a lot of articles recently claiming AI has had near zero economic impact despite making up a large portion of economic spending. In reality, AI is starting to show economic impact. It just takes time to see because productivity gains are a second-order metric that won’t show immediately.

However, AI’s economic impact will compound over time because:

“Productivity helps define how fast the economy can grow without inflation. This is because taking away population growth and exports, what your economy can sustain is defined by how efficiently you can build stuff.” — in Weighty Thoughts

Anthropic just released their Economic Index to understand Claude’s impact on work productivity beyond simple tasks. They analyzed over two million conversations (web app and API), categorizing each by task complexity, skill requirements, purpose, autonomy level, and success rate.

A few caveats before the findings.

First, Anthropic uses Claude asking a standard set of questions to fit conversations into the categories above. This isn’t foolproof since LLM output is non-deterministic, meaning some classifications will be wrong due to hallucination, bias, or other factors.

Second, this doesn’t invalidate Anthropic’s findings. At two million conversations, individual classification errors become statistical noise. The aggregate patterns remain meaningful even if some classifications are off. As an LLM provider, Anthropic has access to data third-party reports wouldn’t.

Third, Anthropic only has access to Claude data. This is Claude-centric rather than industry-wide, though I’d bet findings across major LLM providers would be similar.

The main takeaways:

Complex work benefits more than simple work. Tasks requiring college-level skills see 12x speedups. High school-level tasks see 9x. A common argument against AI is that it can only handle simple tasks. This data suggests otherwise.
People are working with AI, not being replaced by it. Augmentation (52% of usage) now leads automation (45%), reversing the trend from earlier in 2025.
AI adoption is accelerating fast. Task coverage across occupations grew from 36% in January to 49% by November, nearly doubling in 10 months.
Reliability depends on task complexity. API tasks hit 50% success rate at around 3.5 hours of work. Claude.ai tasks hit the same threshold at 19 hours. The harder the task, the longer before reliability drops.
Usage patterns reveal economic divides. Higher GDP countries use Claude for work and personal tasks. Lower GDP countries use it primarily for education.

The final bullet is particularly interesting (see chart):

pulled from Anthropic’s Economic Index linked above

“In countries with higher GDP per capita, Claude is used much more frequently for work or for personal use—whereas countries at the other end of the spectrum are more likely to use it for educational coursework.”

At first glance, this suggests AI is widening the production gap between high- and low-GDP countries since high-GDP countries use Claude to get work done more effectively.

After further thought, AI may be providing low-GDP countries with educational resources they wouldn’t otherwise have. This could actually lessen the production gap over time by enabling economic growth via a more educated populace.

Let me know what you think in the comments. I’m especially curious if you disagree. Enjoy the rest of this week’s edition! Later this week, I’ll be updating my ML roadmap with more AI engineering resources, so make sure to check it out!

Subscribe now

My Picks

How to write a good spec for AI agents by

A practical framework for writing specs that actually work with AI coding tools. Plan first in read-only mode, let the agent expand the brief into a structured SPEC.md, then break work into small testable tasks. It covers the six core areas every spec needs (commands, testing, structure, style, git workflow, boundaries) and how to use architect/overview agents to maintain consistency.

Slop is everywhere for those with eyes to see

“The algorithm has flattened curiosity by eliminating the need to hunt for our content.” — Joan Westenberg

The biggest takeaway from this: The shift from curation to algorithmic delivery flattens curiosity and pressures teams to optimize metrics at the cost of quality. As we resort to feeds to give us content, feed providers will resort to AI to make creating content easier or purely to supplement the lack of human creators versus consumers on a platform. This is why “AI Slop” is so prominent online. Feeds have caused us to lose our sense of curiosity and the work we used to put in to grow it.

The AI Manager’s Schedule by

AI coding tools now handle more task types with longer coherence, shifting the question from “can AI do this?” to “should I?” Management now happens in 5-15 minute intervals that require new skills: crisp written architectures, slicing work into AI-sized chunks, and knowing when to override. Also explores the cognitive costs of agent orchestration and the risks of losing low-level understanding.

GPU Performance Engineering Resources

I would guess ~50% of AI-related engineering job listings I read require something to do with compute resource optimization. If you want to work as an engineer in AI, this is a great topic to learn. This resource is a curriculum for learning GPU performance engineering and will be added to the roadmap very soon.

Claude Cowork’s file exfiltration flaw exposes agent security challenges

Security researchers at PromptArmor discovered an unresolved isolation flaw in Claude Cowork that allows indirect prompt injections to exfiltrate files. When a user opens a maliciously crafted document, injected instructions can cause Claude to upload local files to an attacker-controlled Anthropic account using the platform’s allowlisted API with no human approval required. The attack works across multiple Claude models (Haiku, Opus 4.5) and can also trigger DoS vectors through file type mismatches.

This is yet another example of why agent security is so difficult (see our coverage of Antigravity’s vulnerabilities). As an engineer, you have to realize anything within an LLM’s context can be used within any of these tools they’re given access to. I’ve got an article coming out about this soon.

Source: PromptArmor on Claude Cowork exfiltration

LangChain CEO on building agent memory and observability

Harrison Chase (the CEO of LangChain) shared multiple blog posts about AI agents in software engineering, all of which should be paid attention to if you’re planning to build agents yourself.

First, he mentioned traces as documentation for understanding what agents are doing. This was included in last week’s edition, but it’s worth mentioning here, too. Agent logic isn’t stored in code, but in the LLM’s traces. These traces must be used as the equivalent to test cases to ensure agent functionality is correct. Using traces is much more difficult than writing test cases and I suggest reading his entire post to get the full understanding.

Second, he shared how LangChain has set up their Agent Builder’s memory system. Context/memory is another fundamental agent performance task. Understanding how to maintain agent information so it can (and can’t!) do certain things is key to ensuring their proper function. A great example of forgetting is the Ralph Wiggum protocol we discussed last week.

Lastly, Harrison shared an article about the release of LangChain’s Insights Agent. This is an agent that checks traces for you to understand how users use your agents. It uses a clustering algorithm to group similar traces and, therefore, similar actions. I’ve been saying for a while that some sort of anomaly detection system to determine deviant agent behavior would be great for observability, but it’s possible this clustering approach is the real answer we’re looking for.

Source: LangSmith Agent Builder memory system, LangSmith Insights Agent, Harrison Chase on traces as documentation

xAI employee ousted after leaking “human emulator” roadmap

A former xAI employee publicly disclosed an internal roadmap revealing development of a “human emulator” aimed at automating a wide range of human tasks. They revealed this on a podcast (apparently) without company consent and were removed from their position immediately.

Two things to take away from this:

Don’t go on a podcast and share internal secrets. Definitely don’t go on a podcast and reveal internal secrets while saying something along the lines of “I shouldn’t be sharing this”.
Human emulation shouldn’t be a surprise to anyone. All physical intelligence companies are trying to create physical intelligence in a humanoid form factor because humans are the interface for all work we do. If a human can do it, it can be done. If an AI can emulate a human, it can do what the human can do. It’s similar to self-driving cars. There are definitely better automated transportation setups, but cars are now the standard for transportation so their form factor is what’s being automated.

Source: xAI human emulator leak

AI means more software engineers, not fewer

We’ve been trying to replace software engineers for decades. COBOL tried to let business workers write their own code. Visual Basic made Windows apps easier. No-code tools promised the same thing. AI is the latest chapter because it’s exceptionally good at translating plain English into reliable code.

The problem is that software engineering sounds simple when described in plain language but is inherently complex. Effective software requires domain understanding and capable judgment, not just code generation (see our article about software engineering being about problem solving, not writing code).

In fact, the entire history of software engineering has been about creating different levels of abstraction to simplify complex pieces of the job. AI is one of these abstractions (and a very effective one at that!).

Every time we create new abstractions and software becomes easier to build, we end up building exponentially more of it. Addy Osmani calls this the Efficiency Paradox. We don’t run out of ideas or software that needs to be built. Instead, we’re economically enabled to produce greater output.

With regard to AI’s abstraction, Osmani wrote:

“The real question is whether we’re prepared for a world where the bottleneck shifts from “can we build this?” to “should we build this?”“

Not only does AI as a technology mean we can build greater, more capable software, AI as a development tool enables doing so at an unprecedented rate. Once we begin building exponentially more software, we need more software engineers to build and maintain this code.

Source: The recurring dream of replacing developers, The Efficiency Paradox, Grady Booch on abstraction

Product-minded engineering means getting error design right

Gergely Orosz published a deep dive on why good error and warning design is high-leverage work. Diagnostics are often the primary interface users encounter, so errors must be raised at the API/UI boundary, validated upfront, and surfaced early.

Engineers should categorize errors for human vs. programmer consumers, choose clear error classes and metadata, and provide contextual, actionable messages including suggestions. Error messages are often the most-seen part of your product’s interface, yet engineers treat them as an afterthought. The best product-minded engineers recognize that a confusing error is costly (support tickets, user frustration, lost trust, etc.). Investing in clear, actionable error design pays compounding dividends.

We’ve recently discussed the importance of being a product-minded engineer to succeed in the AI era. Error handling is an important way to do that.

As an aside: The Pragmatic Engineer is also hiring a part-time remote Tech Industry Analyst to research engineering trends and produce in-depth subscriber reports. The pay is incredibly high (~$175/hr) so it’s probably worth taking a look at.

Source: The Product-Minded Engineer on errors and warnings, Tech Industry Analyst role

Young adults are trusting AI with financial decisions

Cleo AI surveyed 5,000 UK adults aged 28-40 and found strong interest in AI-driven money management: 64% would trust AI with disposable income decisions, 54% to move money to avoid overdrafts, and 52% to manage bills. This comes alongside weak financial confidence, with 37% reporting poor self-discipline and 80% wanting to improve their financial knowledge.

Last week, we discussed how people are increasingly turning to AI for healthcare advice. Now we’re seeing the same pattern with personal finance. These are high-stakes domains where bad advice can cause real harm, yet users are willing to delegate decisions to AI anyway. The common thread is accessibility: AI is available 24/7, doesn’t judge, and provides immediate answers. Trust remains a gating factor though (as we’ve discussed previously), with 23% saying they want incremental proof before wider use.

Source: Cleo AI survey on financial trust

Quickies

Google.org is providing $2M to Sundance Institute to train 100,000+ artists in AI filmmaking skills with free curricula and scholarships. src
SAP and Fresenius are building a sovereign AI platform for healthcare with a mid three-digit million euro investment using on-premise-ready models that preserve data sovereignty. src
Tesla’s AI5 chip design is nearly finished with AI6 in early stages, targeting a 9-month design cycle for continuous generations of custom AI accelerators. src
PJM projects 4.8% annual electricity demand growth from AI data centers, with consultants forecasting a 25% rise by 2030 and real risk of East Coast rolling blackouts. src
ChatGPT Go launched worldwide at $8/month with 10x more messages than free tier, while OpenAI will test ads in free and Go tiers. src
AstraZeneca acquired Modella AI to embed pathology-focused foundation models directly into oncology R&D for faster biomarker discovery. src
Apple is fighting for TSMC capacity as Nvidia likely overtook Apple as a top customer, forcing Apple to compete for leading-edge wafer slots. src
Veo 3.1 adds native 9:16 vertical output for mobile-first short-form video and state-of-the-art upscaling to 1080p and 4K. src
Kaggle launched Community Benchmarks for reproducible multi-step reasoning, code execution, and tool use evaluations across models. src
OpenAI published a response to Elon Musk’s lawsuit, claiming Musk wanted absolute control and proposed merging OpenAI into Tesla before leaving. src
Palantir’s ELITE tool maps deportation targets for ICE with address confidence scores, ingesting government and commercial data for raid prioritization. src
Coding on paper as a deliberate training method forces engineers to slow down and master fundamentals rather than outsourcing cognition to tools. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

AI Can Do Your Job - Now What? | AI for Software Engineers 77

Logan Thorneloe — Thu, 15 Jan 2026 15:48:33 GMT

Two releases this week show how far AI coding tools have come. Claude 4.5 Opus is now more accessible with higher rate limits, and Claude Code has improved its planning capabilities, spending more time on design and less on iteration and enabling enough tokens for developers to use it full-time.

The second is Ralph Wiggum, a methodology/Claude Code plug-in for terminal agents that enables them to work autonomously for hours. It breaks tasks into work items with finishing criteria, then loops until all criteria are complete. The output works according to specification.

The key that makes this work so well is periodically resetting context, tracking progress via external files rather than keeping everything in memory. This prevents the drift that happens in long-running sessions and enables brand-new agents to take stabs at a problem until it’s done.

Together, these mean a coding agent can be given a product specification in the evening, work overnight, and have code ready for you in the morning. This code is usually entirely within spec and viable for a minimum viable product or even better.

So now that AI can whip up these prototypes overnight, what does that mean for you? A few things:

Be user- and product-focused. The important parts of software engineering are still important. Understanding products and outlining requirements to fulfill them is still on the engineer (i.e. giving the requirements to Ralph as mentioned above). Studies show that teams that are product-focused are more successful when using AI developer tools than their counterparts. Iterating based on high-quality user feedback is key to maintaining an effective product-focus.
Learn to use AI tools. This should be self-evident, but there are still engineers refusing to learn them. They’re the future of software development and there’s a steep learning curve to use them effectively. If you want to take the next step toward using AI to be more productive, you should both implement and try out new AI coding methodologies and tools, such as the Ralph loop. If you want to get hands-on this week, I suggest implementing this in your work environment and giving it a go.
Get good at reviewing. I know this is the boring part of engineering, but now it’s even more important. Review well enough that you’re confident in what’s going to production and that you understand how it works. Get very good at understanding system design as I find integration with surrounding systems is where these AI coding tools fail and it’s often the most difficult to detect.

Here’s everything else you need to know from this past week.

My Picks

Standalone content worth your time:

Finding and fixing Ghostty’s largest memory leak by Mitchell Hashimoto: A deep dive into debugging Ghostty’s PageList memory leak that grew to 37 GB after 10 days. The fix involved preventing reuse of non-standard pages during scrollback pruning. A great example of methodical debugging with practical techniques like macOS VM tagging.
8 plots that explain the state of open models by Nathan Lambert: China’s open models dominate adoption, led overwhelmingly by Qwen whose top variants have more downloads than many competitors combined. Qwen also leads finetuning activity on HuggingFace, though DeepSeek dominates at very large model scales.
5 GPU performance optimization methods: An easy-to-follow explanation of five GPU optimization methods for LLMs: batching, mixed-precision (FP16), tensor/kernel fusion, memory pooling, and CUDA stream management. Practical impacts include roughly 2x memory savings with FP16.
Demystifying evals for AI agents by Anthropic: A comprehensive guide on why agent evals are harder than model evals. Autonomy, tool use, and long-horizon planning introduce external dependencies and emergent behaviors that traditional testing can’t handle. Covers strategies for realistic environments, mixing automated and human assessments, and measuring both task performance and failure modes.
No, Claude Code doesn’t need a better UI by Logan Thorneloe: I wrote about why Claude Code’s terminal-based approach is actually its strength. The terminal is standardized, scriptable, and predictable, making it ideal for automation compared with brittle GUIs. Claude can control files, apps, and any CLI- or API-driven application via text commands.

Claude Cowork brings terminal agents to everyone

Anthropic released Claude Cowork, an adaptation of Claude Code that runs in the Claude app on Mac and performs general-purpose computer tasks. This is only available to Max subscribers and only on Mac for now.

I just wrote an article about how Claude is a general-purpose computer use agent, not just a coding tool. This means you can get just about anything done you could do via the terminal by prompting Claude. I stand by the fact that the terminal is still an excellent UI that builds intuition about what you can and cannot do with Claude as you watch it work. More info on Claude’s productive capabilities in the sources below.

Source: Simon Willison on Cowork, Cowork announcement on X, Ethan Mollick on Claude Code, My article on Claude Code as a computer use agent

Anthropic restricts third-party API access amid abuse concerns

Anthropic blocked two parties from using their resources this week:

Competitors such as OpenAI and xAI, to give Anthropic a competitive advantage.
Third-party harnesses that took advantage of Claude Max subscriptions, to ensure usage rates on these subscriptions can’t be spoofed.

This caused competitors such as Codex to jump on providing usage to third-party harnesses where users previously would have used Claude models. It makes me wonder about two things: how much goodwill did Anthropic lose to save money on the spoofing and what will be the long-term impact of other tools being more accessible to users?

Apple partners with Google to power next-gen Siri with Gemini

Apple signed a multi-year deal to base its upcoming Foundation Models on Google’s Gemini, enabling a more personalized Siri expected later this year. All inference and customization will run on Apple silicon and Apple’s Private Cloud Compute to preserve user privacy. My understanding is that Apple’s models will be based on the same LLM technology as Google’s.

I’ve seen a lot of takes on this, but the most prominent is that Apple has admitted defeat. Instead, think of this as a business decision. Apple doesn’t have a model ready that they think will guarantee an excellent assistant experience. They use Google’s models for now to ensure they can deliver a quality product to their users and they don’t lose any ground in the smartphone market. In reality, Apple is doing quite well in AI as their silicon and hardware have become a staple for serving large models.

Source: Apple-Google Gemini partnership

AI in healthcare faces mounting scrutiny from regulators and experts

A few things happened in AI-related healthcare news this week:

Google has had to remove several AI-generated health summaries to ensure misinformation isn’t spread.
OpenAI added Health to ChatGPT, enabling a user to discuss their health and health records with ChatGPT directly in the app.
Studies show more people are using AI for self-diagnosis, with one figure showing 59% of Brits are doing so.

OpenAI claims this is to ensure accurate information is given regarding healthcare and to enable users’ health-related queries to have the context of their current health information. Many are skeptical of sharing their personal health data with ChatGPT as most queries given to ChatGPT are used for training. OpenAI has guaranteed this won’t be the case with Health in-app.

Source: Google removes misleading AI health summaries, 59% of Brits use AI for diagnosis, ChatGPT Health critique

Tailwind’s layoffs reveal how AI adoption can destroy business models

Tailwind cut 75% of its staff after AI coding agents drove the CSS framework to 75 million downloads per month while simultaneously killing 40% of site traffic. Site traffic generated conversions to paid services, and this change in revenue contributed to an 80% revenue drop. Shortly after, Google AI Studio announced it would sponsor the Tailwind project.

Tailwind is one of the most popular frontend component libraries, but AI is fundamentally changing how information is consumed and transferred, meaning business models will need to adapt as well.

Source: Tailwind layoffs, Google AI Studio sponsorship

Building reliable AI agents requires rethinking evaluation

The difficult part of agent observability is logic being shifted from code to models. This means traditional test cases fail because model output can’t be tested deterministically. This is what makes AI observability such a difficult issue.

Anthropic recently released a blog post detailing evals and what makes them so tough, including the gold standard method of testing coding, computer use, and conversational agents. One big takeaway is that evals aren’t 100% foolproof and need to be accompanied by production monitoring, A/B testing, and user feedback. I highly recommend reading Anthropic’s report linked below.

Source: Harrison Chase on traces as documentation, Anthropic on agent evals

Quickies

Malaysia and Indonesia blocked Grok after regulators found it was generating sexually explicit images, including depictions of minors. src
US job openings dropped to 7.15 million in November, the lowest in over a year, with vacancies per unemployed worker falling to 0.9. src
NVIDIA and Eli Lilly will invest up to $1 billion over five years on an AI co-innovation lab for drug discovery. src
Bose is open-sourcing SoundTouch’s API instead of bricking the speakers when cloud support ends. src
Meta’s $2 billion acquisition of Manus triggered a Chinese Ministry of Commerce review for potential export control violations. src
Gemini CLI now offers “Agent Skills” that can be installed via npm. src
Self-hosting has become practical with cheap mini PCs, Tailscale, and CLI agents like Claude Code handling setup. src

Last week

In case you missed it, here’s last week’s overview:

Thanks for reading!

Always be (machine) learning,

Logan

No, Claude Code doesn’t need a better UI

Logan Thorneloe — Sat, 10 Jan 2026 13:46:30 GMT

I’ve read a lot of articles this past week about Claude Code (as I’m sure you have too) and there’s consistently one thing mentioned that bothers me. These articles state that Claude Code is excellent despite its terrible UI, when really its UI is what makes it so great and the closest thing we have to AGI.

This starts with a brief history of computers and computation. Humanity created computers to crunch numbers much faster than we’re manually capable of. Since most work is rooted in information transfer, we’ve since offloaded most work to the digital world because computers are capable of storing, retrieving and manipulating information much faster than we are.

To more easily tell computers the work they should be doing, we’ve developed GUIs (graphical user interfaces). These GUIs sit on top of the code, ones and zeros, and actual computation the computer is doing to create a much more accessible interaction plane to a human user.

Recently, there’s been a lot of research done to create computer-use agents. These agents learn how to use a mouse to interact with a computer’s GUI. Thus, these agents are capable of doing the work a human otherwise would have accomplished with that computer.

However, if we go back to before GUIs, we primarily interacted with computers via the terminal. The terminal is a simple text interface to give the computer a command for the work it needs to do and get information back from the computer.

The terminal is a text interface that controls the work a computer does. Our current frontier AI models are text based and perfectly suited for this environment. This is what makes Claude Code so effective. It lives in the terminal and interacts with it via text commands.

Thus, rather than thinking of Claude Code as a coding agent, it’s much better to realize its full potential by thinking of it as a computer use agent.

Digital versus manual work always makes me think of this scene from Space Force.

It’s had such an explosive impact because its ability to control a computer via the terminal lets it accomplish meaningful work. Anything you can do in the terminal, Claude can too.

I’d even argue that it’s the first step of artificial general intelligence (AGI). Most definitions of AGI describe an AI’s ability to do general, meaningful work. With our current models, an AI assistant in the terminal accomplishes this. The only thing keeping it from making more of an impact is integration with more systems it can work on.

Luckily, the terminal helps with this too. The terminal lets you:

Interact with a computer’s filesystem and applications.
Interact with the internet.
Run commands for any CLI tool. Any application with terminal commands can be controlled by Claude.
Code. Anything Claude can’t do natively via the terminal, it can write code to accomplish and run that code itself. This means Claude can interact with anything that has an API if given proper authentication.

And this doesn’t even account for model context protocol (MCP) which is the agent-native way of declaring its interactions with endpoints.

You might argue that a true computer agent needs the ability to interact with a computer with more complexity. I’d argue that the simplistic and standardized nature of the terminal is what has made the terminal-based computer use agent so successful.

Terminal commands are standardized. GUIs change their layouts, button positions, and flows with every update. Terminals are a stable, reliable interface.
The terminal is inherently programmatic. It was designed for automation and scripting, which is exactly what an AI agent needs to do. Terminal commands can also be run together, enabling the agent to build complex workflows from simple operations. GUIs were designed for humans to point and click, not for programs to control.
Terminal outputs are predictable. GUI interactions depend on context, view settings, window state, and animations that make it difficult to know what to do next.
Terminal errors are parseable text that an agent can read and act on. GUI errors are modal dialogs or toast notifications that require visual interpretation.

I recommend even non-technical individuals learn how to use Claude Code in the terminal. There’s a certain level of intuition that you build as you watch the AI work directly in the terminal and as you learn to work in the terminal yourself.

Some examples worth checking out to get you started:

If you write or script as a content creator, write in markdown format in a GitHub repo. Use the terminal to access that folder on your computer and spin up Claude Code. It can now help you write, critique your work, brainstorm ideas, and more. This article was edited by Claude Code, for example.
If you store any information via API, tell Claude Code about it and it can write a script to access that information and add it to its context. For example, I read and store notes in Readwise Reader. It has an API that Claude Code can easily access via a simple Python script. I can then chat with my notes.

Claude Code has made such an incredible impact because it’s not only good at coding but it’s an entire terminal agent. If you think about Claude Code this way, it can accomplish much more meaningful work for you.

Thanks for reading!

Always be (machine) learning,

Logan

AI's Role in Maduro's Capture | AI for Software Engineers 76

Logan Thorneloe — Wed, 07 Jan 2026 16:01:41 GMT

Here are my picks for content you don’t want miss and everything you should know about AI for January 7, 2026. Enjoy!

My Picks

21 lessons from 21 years at Google by : Lessons learned from working at Google for 21 years. Two notable lessons: most slow teams are actually misaligned, and the best engineers are obsessed with solving user problems. All are worth reading.
Reasoning models are a dead end by : A valuable take on reasoning models and their lack of interpretability. Reasoning encoded into model weights loses 95% of intermediate branching and produces brittle behavior compared to externalized reasoning infrastructure. A great example of why engineering is so important in AI.
The suck is why we’re here: Some great perspective on writing with AI. The author argues that AI shortcuts the crucial, difficult parts of writing (research, stuck thinking), and that avoiding these “sucky” parts sacrifices depth and lasting reward. AI will increase quantity but lower average quality, making genuine effort stand out.
Advent of Code 2025 with Compute Shaders by : An excellent exploration of implementing Advent of Code solutions using GPU compute shaders on Metal. The GPU kept consistent times (~5ms) as problem size grew while CPUs slowed dramatically, demonstrating practical applications for massively parallel problem solving.
Building AI Agents, Open Code And Open Source by : I thoroughly enjoyed reading this interview, especially the parts about open versus closed source tools and the motivation behind them. Terminal agents are only going to be more important this year and this does a great job of helping readers understand them.

Things you should know

AI was used to push narratives in Nicolás Maduro’s capture

AI-generated media circulated the internet following the US capture of Venezuela’s president Nicolás Maduro. Fake images depicted the capture itself, while a deepfake video showed Venezuelans crying tears of joy. Both were used to push specific narratives about the operation.

Any company serious about AI needs to help viewers discern between AI and non-AI media. The images above were caught by Google’s SynthID watermark which Google attaches to all AI-generated images using Gemini. Sure, anyone can switch to a non-watermarking tool, but even putting up a small obstacle to generating a fake narrative is a big win.

Source: EBU Spotlight on Maduro fake images, Yahoo News on fake celebration video, Google SynthID

See how SynthID works below:

AI safety concerns mount as AI chatbots face serious scrutiny

xAI was fined 120 million euros under the Digital Services Act due to Grok generating sexually explicit images of women and children. Separately, a lawsuit alleges OpenAI is withholding ChatGPT logs after a murder-suicide where transcripts show the chatbot validated a user’s paranoid delusions.

AI safety is foundational to ensuring we can apply AI to the applications where it’s needed. It’s crazy to me that AI safety teams were previously understaffed or dismissed. Both of the examples above show why AI safety is important and also some of the difficulties that come with ensuring safety.

Source: TVP World on Grok, Ars Technica on ChatGPT logs

Half of AI-generated code has security flaws

Over 30% of senior developers now ship mostly AI-generated code, and the trade-offs are becoming clear. AI code shows logic errors at 1.75x the human rate, XSS vulnerabilities at 2.74x, and roughly 45% of it has security flaws. PR sizes are up 18%, incidents per PR are up 24%, and change-failure rates have risen 30%. Properly configured AI review tooling catches 70-95% of low-hanging bugs.

These statistics echo my recent article detailing how AI impacts an organization’s engineering culture. AI is an amplifier, and if your processes aren’t solid, AI will make them worse.

Source: Addy Osmani, AI for Software Engineers

The best way to fight AI cheating in education is with AI

An NYU professor is using AI to conduct oral exams with students at just 42 cents per student. The AI asks follow-up questions and probes understanding in real-time, forcing students to verbally explain concepts rather than paste in AI-generated answers. This follows a trend where some schools have removed online math courses entirely or now require in-person testing as instructors note declining problem-solving skills and increased reliance on copying AI outputs.

One of my biggest concerns with AI is education. It has potential to be the greatest multiplier but also the worst detriment in this space. As with many other applications, we’re seeing AI-related problems being combatted with AI-related solutions.

Source: Reddit discussion on AI oral exams

Claude Code creator shares his setup for using Claude Code

Boris Cherny, who created Claude Code, runs multiple instances at a time with a focus on Opus 4.5 with “thinking.” It needs less steering despite being slower per token, which increases velocity in the long run. He also claimed that Claude Code’s updates are all written entirely by Claude Code itself.

Separately, a principal engineer at Google mentioned just how far Claude Code has come by saying it can now design specs that took multiple engineers a few months ago. An ex-Google PM commented on this explaining how important it is for engineering teams to be using competitors’ products to improve their own.

My only addition: stop thinking of Claude Code, Gemini CLI, and Codex as coding agents. Instead, think of them as terminal agents. Anything you can do from the terminal, it’s possible to get AI to do for you.

Source: Boris Cherny on X, Jaana Dogan on X, Raiza Martin on X

Research to watch in 2026: Recursive Language Models and Manifold-Constrained Hyper-Connections

Recursive Language Models (RLMs) let models handle context windows up to 100x longer than their native limits by breaking inputs into chunks and processing them programmatically. In tests scaling from 8K to 1M tokens, base models degraded sharply while RLMs maintained performance at comparable cost.

Separately, a technique called Manifold-Constrained Hyper-Connections (mHC) stabilizes model training with only 6.7% overhead, eliminating common instability issues that plague large model runs.

Both papers tackle fundamental scaling bottlenecks: RLMs at inference time and mHC at training time. If these techniques hold up, they could meaningfully change how we build and deploy large models.

Source: Alex Zhang on RLMs, mHC paper on arXiv

NVIDIA acquihires Groq through licensing deal

Groq signed a licensing deal with NVIDIA that will see about 90% of Groq’s 400+ employees move to NVIDIA at a $20B valuation. Groq will remain independent and GroqCloud will continue operating. Groq’s specialty is developing compute with incredibly low-latency inference, something Nvidia can benefit from as it continues to ramp up its research and development of AI compute.

This is another acquihire within the AI industry. The most recent I can think of was Google acquiring talent from Windsurf which led to Google’s Antigravity IDE. I see something similar happening at Nvidia where they’ll come out with even lower latency compute offerings for customers.

Source: The Chip Letter by

More...

A shape-shifting molecule discovery could change the future of AI hardware. (Science Daily on shape-shifting molecules)
Micron shares surged over 10% on AI optimism and increased demand for high-performance memory. (Micron stock coverage)
California State Senator introduced a four-year moratorium to ban AI chatbot-equipped toys for minors. (Coverage of AI toy moratorium)
Claude Code can run on-the-go using an iPhone via Termius and mosh to a VM costing about $7/day. (Granda.org)
Advanced AI could collapse labor’s share of GDP toward zero, concentrating wealth among capital holders. (Dwarkesh Patel on X)
An excellent overview on the past 10 years of AI. (Weighty Thoughts by )
An interesting read from an author who canceled their technical book publishing deal for various reasons. (Austin Henley)
PostgreSQL dominated 2025, driving major acquisitions and new DBaaS launches across all major cloud vendors. (Databases in 2025: A Year in Review)
Two excellent 2025 retrospectives worth reading. (Ignorance.ai on 10 AI stories by , Simon Willison on the year in LLMs)

Last week

In case you missed it, here’s last week’s overview:

I’ve removed the jobs and industry updates from these weekly roundups. I haven’t been able to fit them properly at this cadence and will be moving them to their own, less frequent articles. Stay tuned!

Thanks for reading!

Always be (machine) learning,

Logan

AI for Software Engineers: Looking Forward to 2026

Logan Thorneloe — Thu, 01 Jan 2026 15:01:06 GMT

Happy New Year! Thank you all for your support in 2025! 2026 will be an even better year for AI for Software Engineers! Here’s a recap of the year, what to look forward to in 2026, and a few questions to help me improve the newsletter. 😊

Looking back

In 2025, we:

Reached 100 paid subscribers to become a Substack Bestseller.
Reached 11,000+ free subscribers.
Hit #1 on Hacker News.
Underwent two name changes (Society’s Backend —> ML for SWEs —> AI for SWEs).
Got a new logo that I think actually works (see image above).
Released 38 weekly reports and many other technical articles.
Created a repo to learn by building (more on this below).

Our top 5 articles of this year were:

Going forward

I plan to:

Add to the AI for SWEs repo and let all of you contribute too. I want to create more hands-on resources, but I want this repo to be an opportunity for you to create those resources as well.
Simplify my approach to writing. I want think less about what I think will do well and focus more on sharing what I think is most important for all of us to know. I also found myself getting caught up in the process I use for writing, instead of getting caught up in the topic I’m writing about (which is a great thing!).
Add more paid benefits with a focus on discounted learning and building resources. Thanks to all who’ve supported me by becoming a paid subscriber. It lets me devote more time to my writing. My plan for 2026 is simple: Make the paid tier to much value it’s a no brainer and ensure it providers everything you need to make it in AI.
Take better care of my own health so I can be more consistent. There were a few weeks this year where I was unable to write due to my health and I missed writing during those week. Next year, I’m prioritizing my health.

Now, help me improve AI for Software Engineers! Answer two questions for me.

Question 1:

Question 2:

Question 3:

As always, thank you for reading!

Always be (machine) learning,

Logan

AI Can’t Fix a Broken Engineering Culture—It Can Only Make it Worse

Logan Thorneloe — Tue, 30 Dec 2025 20:10:23 GMT

I’ve seen an interesting new fad on social media recently that I like to call “vibe releasing”. This is the same as “vibe coding” but it takes it one step further and releases the code to production without properly reviewing it first.

I can’t overstate how terrible of an idea this is.

In fact, this year’s “State of AI-assisted Development” report released by Google centered around one idea: AI is an amplifier. It analyzes AI coding metrics from this past and proves that coding with AI makes proper engineering practices more, not less, important.

It shows that companies with good engineering culture and practices will see AI positively impact their development velocity and companies with bad engineering culture and practices will see the opposite. “Vibe releasing” is the definition of a bad engineering practice.

This article includes everything you should take away from Google’s report and how it applies to you.

Takeaways

If you’re just here for the takeaways, here they are:

2025 was the first year AI had a quantifiable positive impact on software development.
Trust is a huge factor in AI coding tool effectiveness.
Companies with bad engineering cultures and practices will see their development velocity slow with AI. Conversely, companies with good engineering cultures and practices will see their development velocity quicken with AI.

If you want to know the specifics and what your organization should do to ensure AI works for you instead of against you, read on.

Report methodology

First, let’s understand how the report was created and how research was conducted. When evaluating metrics, this is always the first step.

What are World Models? | AI for Software Engineers 75

Logan Thorneloe — Tue, 23 Dec 2025 14:03:02 GMT

Yann LeCun has confirmed his startup, Advanced Machine Intelligence (AMI), will develop world models and is currently seeking fundraising at a $5B valuation. While headlines focus on the $5B valuation, I care much more about the work.

Crazy valuations aren’t uncommon in the world of AI. The potential for this technology is astronomical even if the roadmap to get there is still being discovered. I view AI’s potential as shifting humanity’s problem-solving from O(n) (or greater!) complexity potentially to O(1). Once we can solve problems quickly, our rate of advancement will compound and discovery will take off. If this happens, AI will be worth far more than even crazy valuations place it at.

LeCun is now directly pursuing the same area of research and industry as Fei-Fei Li’s World Labs. He also joins other great minds in AI who have said AI needs a breakthrough beyond scaling the research we currently have. Many are placing a bet on this being world models.

Put succinctly, world models aim to understand the 3D world and learn more like a human child instead of like a machine. Instead of understanding a statistical correlation between inputs to generate a representative output, world models seek to understand causal physics and spatial reasoning. World Labs puts it well on their website:

“We build foundational world models that can perceive, generate, reason, and interact with the 3D world — unlocking AI’s full potential through spatial intelligence by transforming seeing into doing, perceiving into reasoning, and imagining into creating.”

This means world models aren’t generative but instead make predictions based on abstract representations of the concepts they’ve learned. Instead of guessing at pixels, they focus on higher-level concepts.

Here’s an example to illustrate what this means: If we’re considering a car moving down a street, a generative model wastes compute estimating pixel movement for every leaf on the road. A world model would instead ignore unimportant details and focus on the latent variables that impact its understanding such as the car’s velocity or the friction between the car and the road. Those details would be used to predict the world’s next state.

Practically, this means two things:

World models don’t waste resources on things that are unimportant for a task.
We can use spatial intelligence to train other real-world AI applications without needing to collect new data. Instead, world models can generate (or “dream”) their own. This is both very efficient and safer than the alternative (think about an application like self-driving where there is always inherent risk with data collection).

I predict we’ll be seeing a lot more about world models in 2026 and I’m curious to see how far we’ll get. Enjoy this week’s resources!

Subscribe now

This week’s curated highlights

My LLM coding workflow going into 2026: This is an excellent comprehensive overview of Addy’s AI coding workflow. The best way to optimize any AI-related workflow is to discuss and share what works with others. I highly recommend checking this out and sharing your workflows with others!
Gemini Plays Pokemon: This report compares the performance of Gemini 2.5 Pro to Gemini 3 Pro playing Pokemon. It’s an interesting read. Gemini 3 Pro didn’t just play better; it exhibited creative problem-solving by finding a loophole to multitask.
Andrej Karpathy revealed his 2025 LLM Year in Review: I recommend just reading this, but most interesting is a shift from models imitating humans to reasoning through rewards. Other interesting notes are image model advancements, terminal-based AI, “jagged intelligence”, new layers of LLM apps, and the introduction of vibe coding.
Jeff Dean’s Performance Hints: Jeff Dean updated his guide on engineering principles for performance at scale. The writeup provides a guide to optimizing software performance at the level of a single binary. Performance is a crucial topic for any engineer to understand and is only getting more important in the age of AI.
Sam Rose’s overview of LLMs/prompt caching: This article gives a great overview of how prompt caching reduces token costs in LLM and it also gives a great high-level overview of LLMs as well using excellent visual elements. I love articles with great visuals and Sam constantly delivers.
Your guide to local coding models: This is my article from this past week that many of you have likely already read. I’m including it here because I made some mistakes in the initial release that I edited to clarify and it incited some interesting conversation across X, LinkedIn, Substack, and even Hacker News where it reached number 1.

Things you should know

New & Trends

Google released Gemini 3 Flash, outperforming its previous Pro models

Google released Gemini 3 Flash, the new generation of its smaller model for faster, cheaper applications. It is multimodal, uses 30% fewer tokens than previous models, and is significantly cheaper ($0.50/1M input tokens). The advancements of smaller, cheaper models are much more important than the advancements of large models when it comes to applications and are key to democratizing the technology.

AI coding assistants ship more, but also more bugs

CodeRabbit’s recent report shows that AI introduces 1.7x more issues than human-written code. Specifically, AI introduces 1.7x more issues, including 75% more logic errors and 3x the readability issues. AI can produce more code faster, but the code tends to be significantly worse. This volume of pull requests also causes reviewer fatigue making it more likely for errors to reach production.

New York governor Kathy Hochul signed RAISE act

Despite an executive order from President Trump to consolidate AI regulation to the federal level, New York governor Kathy Hochul signed the RAISE Act to establish safety standards for large AI companies in New York. This act requires companies with over $500 million in revenue to comply with specific safety standards.

AI data centers have a carbon footprint that matches a small European country

A new study shows that AI systems could produce roughly the same amount of carbon dioxide as New York City or Norway (about 80 million tons). While this is an estimate (as exact numbers are hard to come by), it shows the environmental impact AI could have and emphasizes that the real long-term problem AI needs to solve is power and energy.

AI has entered the game industry

Roblox Studio is integrating AI into their platform. This enables users to generate assets with a prompt, use AI for real-time voice translation, and orchestrate work across other creative tools via MCP. The goal of this shift is to enable users to get to market faster and be more profitable.
Contrarily, Larian Studios (the creators of Baldur’s Gate 3) CEO Swen Vincke explained Larian only uses of AI in game development for content exploration similar to how they use Google images and art books. This was after they were accused of trying to use AI to replace artists which Vincke explained was false and they’re actually looking to hire more artists, not replace them.
The video game industry has many potential applications for AI, but gamers aren’t excited about how they’ll impact the actual games produced. In 2026, I’m certain we will see it used more as it becomes more cost effective and I’m curious to see what the usage will be.

Research

Duke scientists created an AI that simplifies complex data

Researchers have created a physics-bound deep learning model that takes in complex data and outputs a much simpler, mathematical representation of that data. This is particularly useful in domains with a ton of data but without equations to explain relationships within that data. This AI creates a starting point for a formulaic representation of that data.

OpenAI research on agents and their capabilities

Just a week after discussing OpenAI’s Code Red, they’re pumping out research at an astonishing rate. They’ve released research on chain-of-thought monitorability, AI’s capability to accelerate biological research in the wet lab, and AI’s ability to perform scientific research tasks. I’m particularly interested in the first (and you should be too if you’re building any sort of AI-powered application) and I’m loving the trend of applying AI to accelerate research.

Product & Releases

Google Labs released CC, an AI agent connected to Gmail, Drive, and Google Calendar to deliver a personalized briefing of your day.
Anthropic released Agent Skills as an open standard so other companies can get on board with integrating them within their products. Anthropic’s push to release AI standards has gone a long way to solidify their standing in enterprise AI.
OpenAI released GPT-5.2-Codex, a version of GPT-5.2 for coding that achieves state-of-the-art on SWE-Bench Pro and Terminal-Bench 2.
ChatGPT and Codex CLI are also adopting skills similar to Claude.
Google introduced FunctionGemma, a small model designed for function calling on edge devices.

Safety

OpenAI is upping their safety game by adding under-18 principles to their model spec for users aged 13 to 18. OpenAI has also created a guide for families and parents for responsible AI use.

Career resources

State of the market

AWS CEO says replacing junior engineers with AI is foolish

Amazon has a ruthless history of optimizing for cost, even at the detriment of its employees. When the CEO of AWS states replacing junior engineers is a bad idea, you know it’s true. The software engineering industry will be in a bad place when we need more senior engineers but we’ve abandoned the junior engineers that should become them.

Not everyone is convinced by AI coding

This is an interesting article I actually helped contribute to that highlights the gap in expectation versus reality for AI in software engineering. The expectation is set very high that engineers should be able to greatly increase their productivity by using AI coding tools. In reality, it takes time for developers to figure out these tools and how to use them productively. In fact, the initial onboarding can even cause a development velocity hit that isn’t reflected in performance expectations.

Interesting opportunities

Google Ads is still hiring! There’s an aggressive push to hire top talent in Google Ads. We’re specifically looking for mid- to senior-level developers with experience in Python, C++, Go, or Java that have worked on large-scale distributed systems. GenAI, ML, and ads experience is a plus. If you’re interested, check out the open roles here and/or DM me with any questions.

Learning Resources

Packt has a deal for $10 for any technical ebook. This goes down to as low as $5.99 if you buy 10 or more. This is the best deal I’ve seen yet to build a technical book library.
This comprehensive open-source roadmap will walk you from LLM fundamentals to deploying advanced LLMs. It structures the learning path into 3 distinct tracks: LLM Fundamentals, the LLM scientist, and the LLM engineer.
Check out an Agentic AI problem set developed by @Prof. Tom Yeh. Professor Yeh writes AI by Hand, an excellent resource for understanding the inner workings of machine learning models. This resource is great to test your knowledge of AI agents and upskill while you’re at it.

If you missed last week’s article, you can check it out here:

Thanks for reading!

Always be (machine) learning,

Logan

[Revised] You Don’t Need to Spend $100/mo on Claude Code: Your Guide to Local Coding Models

Logan Thorneloe — Sat, 20 Dec 2025 14:55:12 GMT

[Edit 1] This article has been edited after initial release for clarity. Both the tl;dr and the end section have added information.

[Edit 2] This hypothesis was actually wrong and thank you to everyone who commented!

Here’s a full explanation of where I went wrong. I want to address this mistake as I realize it might have a meaningful impact on someone's financial position.

I’m not editing the actual article except where absolutely necessary so it doesn’t look like I’m covering up the mistake—I want to address it. Instead, I’ve included the important information below.

There is one takeaway this article provides that definitely holds true:

Local models are far more capable than they’re given credit for, even for coding.

It also explains the process of setting up a local coding model and technical information about doing so which is helpful for anyone wanting to set up a local coding model. I would still recommend doing so.

But do I want someone reading this to immediately drop their coding subscription and buy a maxed out MacBook Pro? No, and for that reason I need to correct my hypothesis from ‘Yes, with caveats’ to ‘No’.

This article was not an empirical assessment, but should have been to make these claims. Here’s where I went wrong:

While local models can likely complete ~90% of the software development tasks that something like Claude Code can, the last 10% is the most important. When it comes to your job, that last 10% is worth paying more for to get that last bit of performance.
I realized I looked at this more from the angle of a hobbiest paying for these coding tools. Someone doing little side projects—not someone in a production setting. I did this because I see a lot of people signing up for $100/mo or $200/mo coding subscriptions for personal projects when they likely don’t need to. I would not recommend running local models as a company instead of giving employees access to a tool like Claude Code.
While larger local models are very capable, as soon as you run other development tools (Docker, etc.) that also eat into your RAM, your model needs to be much smaller and becomes a lot less capable. I didn’t factor this in in my experiment.

So, really, the takeaway should be that these are incredible supplemental models to frontier models when coding and could potentially save you on your subscription by dropping it down a tier, but practically they’re not worth the effort in situations that might affect your livelihood.

Exactly a month ago, I made a hypothesis: Instead of paying $100/mo+ for an AI coding subscription, my money would be better spent upgrading my hardware so I can run local coding models at a fraction of the price (and have better hardware too!).

So, to create by far the most expensive article I’ve ever written, I put my money where my mouth is and bought a MacBook Pro with 128 GB of RAM to get to work. My idea was simple: Over the life of the MacBook I’d recoup the costs of it by not paying for an AI coding subscription.

After weeks of experimenting and setting up local AI models and coding tools, I’ve come to the conclusion that my hypothesis was ~~correct, with nuance~~, not correct [see edit 2 above] which I’ll get into later in this article.

In this article, we cover:

Why local models matter and the benefits they provide.
How to view memory usage and make estimates for which models can run on your machine and the RAM demands for coding applications.
Walk through setting up your own local coding model and tool step-by-step.

Don’t worry if you don’t have a high-RAM machine! You can still follow this guide. I’ve included some models to try out with a lower memory allotment. I think you’ll be surprised at how performant even the smallest of models is. In fact, there hasn’t really been a time during this experiment that I’ve been disappointed with model performance.

If you’re only here for the local coding tool setup, skip to the section at the bottom. I’ve even included a link to my modelfiles in that section to make setup even easier for you. Otherwise, let’s get into what you need to know.

tl;dr:

Local coding models are very capable. Using the right model and the right tooling feels only half a generation behind the frontier cloud tools. I would say that for about 90% of developer work local models are more than sufficient. Even small 7B parameter models can be very capable. [Edited to add in this next part] Local models won’t compete with frontier models at the peak of performance, but can complete many coding tasks just as well for a fraction of the cost. They’re worth running to bring costs down on plenty of tasks but potentially not worth using if there’s a free tier available that performs better.
Tools matter a lot. This is where I experienced the most disappointment. I tried many different tools with many different models and spent a lot of time tinkering. I ran into situations where the models wouldn’t call tools properly or their thinking traces wouldn’t close. Both of these rendered the tool essentially useless. Currently, tooling seems very finicky and if there’s anything developers need to be successful, it’s good tools.
There’s a lot to consider when you’re actually working within hardware constraints. We take the tooling set up for us in the cloud for granted. When setting up local models, I had to think a lot about trade-offs in performance versus memory usage, how different tools compared and affected performance, nuances in types of models, how to quantize, and other user-facing factors such as time-to-first-token and tokens per second.
Google threw a wrench into my hypothesis. The local setup is almost a no-brainer when compared to a $100/mo+ subscription. Compared to free or nearly-free tooling (such as Gemini CLI, Jules, or Antigravity) there isn’t quite as strong of a monetary justification to spend more on hardware. There are benefits to local models outside of code, though, and I discuss those below.

If the tl;dr was helpful, don’t forget to subscribe to get more in your inbox.

Subscribe now

Why local models?

You might wonder why local models are worth investing in at all. The obvious answer is cost. By using your own hardware, you don’t need to pay a subscription fee to a cloud provider for your tool. There are also a few less obvious and underrated reasons that make local models useful.

First: Reliability. Each week there seems to be complaints about performance regression within AI coding tools. Many speculate companies are pulling tricks to save resources that hurt model performance. With cloud providers, you’re at the mercy of the provider for when this happens. With local models, this only happens when you cause it to.

Second: Local models can apply to far more applications. Just the other day I was having a discussion with my dad about AI tooling he could use to streamline his work. His job requires studying a lot of data—a perfect application for an LLM-based tool—but his company blocks tools like Gemini and ChatGPT because a lot of this analysis is done on intellectual property. Unfortunately, he isn’t provided a suitable alternative to use.

With a local model, he wouldn’t have to worry about these IP issues. He could run his analyses without data ever leaving his machine. Of course, any tool calling would also need to ensure data never leaves the machine, but local models get around one of the largest hurdles for useful enterprise AI adoption. Running models on a local machine opens up an entire world of privacy- and security-centric AI applications that are expensive for cloud providers to provide.

Finally: Availability. Local models are available to you as long as your machine is. This means no worrying about your provider being down or rate limiting you due to high traffic. It also means using AI coding tools on planes or in other situations where internet access is locked down (think highly secure networks).

While local models do provide significant cost savings, the flexibility and reliability they provide can be even more valuable.

Understanding memory

To get going with local models you must understand the memory needed to run them on your machine. Obviously, if you have more memory you’ll be able to run better models, but understanding the nuances of that memory management will help you pick out the right model for your use case.

Local AI has two parts that eat up your memory: The model itself and the model’s context window.

The actual model has billions of parameters and all those parameters need to fit into your memory at once. Excellent local coding models start at around 30 billion (30B, for short) parameters in size. By default, these models use 16 bits to represent parameters. At 16 bits with 30B parameters, a model will take 60 GB of space in RAM (16 bits = 2 bytes per parameter, 30 billion parameters = 60 billion bytes which equals about 60 GB).

The second (and potentially larger) memory consuming part of local AI is the model’s context window. This is the model inputs and outputs that are stored so the model can reference them in future requests. This gives the model memory.

When coding with AI, we prefer this window to be as large as it can because we need to fit our codebase (or pieces of it) within our context window. This means we target a context window of 64,000 tokens or larger. All of these tokens will also be stored in RAM.

The important thing to understand about context windows is that the memory requirement per-token for a model depends on the size of that model. Models with more parameters tend to have large architectures (more hidden layers and larger dimensions to those layers). Larger architectures mean the model must store more information for each token within its key-value cache (context window) because it stores information for each token for each layer.

This means choosing an 80B parameter model over a 30B parameter model requires more memory for the model itself and also more memory for the same size context window. For example, a 30B parameter model might have a hidden dimension of 5120 with 64 layers while an 80B model has a hidden dimension of 8192 with 80 layers. Doing some back-of-the-napkin math shows us that the larger model requires approximately 2x more RAM to maintain the same context window as the 30B parameter model (see formula below).

Luckily, there are tricks to better manage memory. First, there are architectural changes that can be made to make model inference more efficient so it requires less memory. The model we set up at the end of this article uses Hybrid Attention which enables a much smaller KV cache enabling us to fit our model and context window in less memory. I won’t get into more detail in this article, but you can read more about that model and how it works here.

The second trick is quantizing the values you’re working with. Quantization means converting a continuous set of values into a smaller amount of distinct values. In our case, that means taking a set of numbers represented by a certain number of bits (16, for example) and reducing it to a set of numbers represented by fewer bits (8, for example). To put it simply, in our case we’re converting the numbers representing our model to a smaller bit representation to save memory while keeping the value representations within the model relatively equal.

You can quantize both your model weights and the values stored in your context window. When you quantize your model weights, you “remove intelligence” from the model because it’s less precise in its representation of innate information. I’ve also found the performance hit when going from 16 to 8 bits within the model to be much less than 8 to 4.

We can also quantize the values in our context window to reduce its memory requirement. This means we’re less precisely representing the model’s memory. Generally speaking, KV cache (context window) quantization is considered more destructive to model performance than weight quantization because it causes the model to forget details in long reasoning traces. Thus, you should test quantizing the KV cache to ensure it doesn’t degrade model performance for your specific task.

In reality, like the rest of machine learning, optimizing local model performance is an experimentation process and real-world machine learning requires understanding the practical limitations and capabilities of models when applied to specific applications.

Here are a few more factors to understand when setting up a local coding model on your hardware:

Instruct versus non-instruct

Instruct models are post-trained to be well-suited for chat-based interactions. They’re given chat pairings in their training to be optimized for excellent back-and-forth chat output. Non-instruct models are still trained LLMs, but focus on next-token prediction instead of chatting with a user. For our case, when using a chat-based coding tool (CLI or chat agent in your IDE) we need to use an instruct model. If you’re setting up an autocomplete model, you’ll want to find a model specifically post-trained for it (such as Qwen2.5-Coder-Base or DeepSeek-Coder-V2).

Serving tools

You need a tool to serve your local LLM for your coding tool to send it requests. On a MacBook, there are two primary options: MLX and Ollama.

Ollama is the industry standard and works on non-Mac hardware. It’s a great serving setup on top of llama.cpp that makes model serving almost plug-and-play. Users can download model weights from Ollama easily and can configure modelfiles with custom parameters for serving. Ollama can also serve a model once and make it available to multiple tools.

MLX is a Mac-specific framework for machine learning that is optimized specifically for Mac hardware. It also retrieves models for the user from a community collection. I’ve found Ollama to be very reliable in its model catalog, while MLX’s catalog is community sourced and can sometimes be missing specific models. Models are sourced from the community so a user can convert a model to MLX format themselves. MLX requires a bit more setup on the user’s end, but serves models faster because it doesn’t have a layer providing the niceties of Ollama on top of it.

Either of these is great, but I chose MLX to maximize what I can get with my RAM, but Ollama is probably the more beginner-friendly tool here.

Time-to-first-token and tokens per second

In real-world LLM applications it’s important that the model is able to serve its first token for a request in a reasonable amount of time and continue serving tokens at a speed that enables the user to use the model for its given purpose. If we have a high-performance model running locally, but it only serves a few tokens per second, it wouldn’t be useful for coding.

This is something taken for granted with cloud-hosted models that is a real consideration when working locally on constrained hardware. Another reason I chose MLX as my serving platform is because it served tokens up to 20% faster than Ollama. In reality, Ollama served tokens fast enough so I don’t think using MLX is necessary specifically for this reason for the models I tried.

Performance trade-offs

There are many ways to optimize local models and save RAM. It’s difficult to know which optimization method works best and the impact each has on a model especially when using them in tandem with other methods.

The right optimization method also depends on the application. In my experience, I find it best to prioritize larger models with more aggressive model quantization over smaller models with more precise model weights. Since our application is coding, I would also prioritize a less-quantized KV cache and using a smaller model to ensure reasoning works properly while not sacrificing the size of our context window.

Coding tools

There are many tools to code with local models and I suggest trying until you find one you like. Some top recommendations are OpenCode, Aider, Qwen Code, Roo Code, and Continue. Make sure to use a tool compatible with OpenAI’s API standard. While this should be most tools, this ensures a consistent model/tool connection. This makes it easier to switch between tools and models as needed.

Getting set up

I’ll spare you the trial and error I experienced getting this set up. The one thing I learned is that tooling matters a lot. Not all coding tools are created equal and not all of the models interact with tools equally. I experienced many times where tool calling or even running a tool at all was broken. I also had to tinker quite a bit with many of them to get them to work.

If you’re a PC enthusiast, an apt comparison to setting up local coding tools versus using the cloud offerings available is the difference between setting up a MacBook versus a Linux Laptop. With the Linux laptop, you might get well through the distro installation only to find that the drivers for your trackpad aren’t yet supported. Sometimes it felt like that with local models and hooking them to coding tools.

For my tool, I ended up going with Qwen Code. It was pretty plug-and-play as it’s a fork of Gemini CLI. It supports the OpenAI compatibility standard so I can easily sub in different models and affords me all of the niceties built into Gemini CLI that I’m familiar with using. I also know it’ll be supported because both the Qwen team and Google DeepMind are behind the tool. The tool is also open source so anyone can support it as needed.

For models, I focused on GPT-OSS and Qwen3 models since they were around the size I was looking for and had great reviews for coding. I ended up deciding to use Qwen3-Coder models because I found it performed best and because GPT-OSS frequently gave me “I cannot fulfill this request” responses when I asked it to build features.

I decided to serve my local models on MLX, but if you’re using a non-Mac device give Ollama a shot. A MacBook is an excellent machine for serving local models because of its unified memory architecture. This means the RAM can be allotted to the CPU or GPU as needed. MacBooks can also be configured with a ton of RAM. For serving local coding models, more is always better.

I’ve shared my modelfiles repo for you to reference and use as needed. I’ve got a script set up that automates much of the below process. Feel free to fork it and create your own modelfiles or star it to come back later.

Install MLX or download Ollama (the rest of this guide will continue with MLX but details for serving on Ollama can be found here).
Increase the VRAM limitation on your MacBook. macOS will automatically limit VRAM to 75% of the total RAM. We want to use more than that. Run sudo sysctl iogpu.wired_limit_mb=110000 in your terminal to set this up (adjust the mb setting according to the RAM on your MacBook). This needs to be set each time you restart your MacBook.
Run pip install -U mlx-lm to install MLX for serving community models.
Serve the model as an OpenAI compatible API using python -m mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit. This command both runs the server and downloads the model for you if you haven’t yet. This particular model is what I’m using with 128GB of RAM. If you have less RAM, check out smaller models such as mlx-community/Qwen3-4B-Instruct-2507-4bit (8 GB RAM), mlx-community/Qwen2.5-14B-Instruct-4bit (16 GB RAM), mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit (32 GB RAM), or mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit (64-96 GB RAM).
Download Qwen Code. You might need to install Node Package Manager for this. I recommend using Node Version Manager (nvm) for managing your npm version.
Set up your tool to access an OpenAI compatible API by entering the following settings:
1. Base URL: http://localhost:8080/v1 (should be the default MLX serves your model at)
2. API Key: mlx
3. Model Name: mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit (or whichever model you chose).
Voila! Your coding model tool should be working with your local coding model.

I recommend opening Activity Monitor on your Mac to monitor memory usage. I’ve had cases where I thought a model should fit within my memory allotment but it didn’t and I ended up using a lot of swap memory. When this happens your model will run very slowly.

One tip I have for using local coding models: Focus on managing your context. This is a great skill even with cloud-based models. People tend to YOLO their chats and fill their context window, but I’ve found greater performance by ensuring that just what my model needs is sitting in my context window. This is even more important with local models that may need an extra boost in performance and are limited in their context.

Was my hypothesis correct?

My original hypothesis was: Instead of paying $100/mo+ for an AI coding subscription, my money would be better spent upgrading my hardware so I can run local coding models at a fraction of the price.

I would argue that~~—yes!—~~no [see edit 2 above], it is correct. If we crunch the numbers, a MacBook with 128 GB is $4700 plus tax. If I spend $100/mo for 5 years, a coding subscription would cost $6000 in that same amount of time. Not only do I save money, but I also get a much more capable machine for anything else I want to do with it.

[This paragraph was added in after initial release of this article] It’s important to note that local models will not reach the peak performance of frontier models; however, they will likely be able to do most tasks just as well. The value of using a local model doesn’t come from raw performance, but from supplementing the cost of higher performance models. A local model could very well let you drop your subscription tier for a frontier coding tool or utilize a free tier as needed for better performance and run the rest of your tasks for free.

It’s also important to note that local models are only going to get better and smaller. This is the worst your local coding model will perform. I also wouldn’t be surprised if cloud-based AI coding tools get more expensive. If you figure you’re using greater than the $100/mo tier right now or that the $100/mo tier will cost $200/mo in the future, the purchase is a no-brainer. It’s just difficult to stomach the upfront cost.

From a performance standpoint, I would say the maximum model running on my 128 GB RAM MacBook right now feels about half a generation behind the frontier coding tools. That’s excellent, but something to keep in mind as that half a generation might matter to you.

One wrench thrown into my experiment is how much free quota Google hands out with their different AI coding tools. It’s easy to purchase expensive hardware when it saves you money in the long run. It’s much more difficult when the alternative is free.

Initially, I considered my local coding setup to be a great pair to Google’s free tier. It definitely performs better than Gemini 2.5 Flash and makes a great companion to Gemini 3 Pro. Gemini 3 Pro can solve more complex tasks with the local model doing everything else. This not only saves quota on 3 Pro but also provides a very capable fallback for when quota is hit.

However, this is foiled a bit now that Gemini 3 Flash was just announced a few days ago. It shows benchmark numbers much more capable than Gemini 2.5 Flash (and even 2.5 Pro!) and I’ve been very impressed with its performance. If that’s the free tier Google offers, it makes local coding models less fiscally reasonable. The jury is still out on how well Gemini 3 Flash will perform and how quota will be structured, but we’ll have to see if local models can keep up.

I’m very curious to hear what you think! Tell me about your local coding setup or ask any questions below.

Thanks for reading!

Always be (machine) learning,

Logan

What You Need to Know for 2026 | AI for Software Engineers 74

Logan Thorneloe — Tue, 16 Dec 2025 20:24:59 GMT

Hi Everyone!

Welcome to the weekly update edition of AI for Software Engineers! I go through everything software engineers should understand about AI by filtering noise and contextualizing what matters. I tend to focus on current events, tooling, research, and other interesting content.

This was an incredible week. In this edition, we discuss:

The AI industry’s shift toward practicality
The Linux Foundation taking over MCP
OpenAI’s Code Red and what that actually means
Developer tool updates
The in-demand skills for the 2026 software engineering job market
The learning resources to learn those skills

—

tl;dr:

It’ll be easier for software engineers to break into AI next year. If you want to do so, focus on developing a skillset in AI cybersecurity, building agents, and MLOps and specifically aim to understand agent workflows, evals, and protocols. Agents will still be a primary focus, but the complexity of building systems with them is much better understood.

MCP is now under the stewardship of The Linux Foundation to encourage open standards. OpenAI’s Code Red is about them aligning their priorities to reach positive revenue. Many developer tools have seen updates/releases. Companies are actually seeing a return on their agent development investments.

More detail on all this below and interesting opportunities.

Subscribe now

Before we get into it, some housekeeping.

A few updates:

We’ve got a new logo! It’ll be representing the newsletter and will be seen around more. It might even be on some swag soon…
As always, I’m ironing out the format of these weekly updates to be more beneficial for both me when doing my research and you when reading. I’m trying to make it a bit more interactive. You can now leave comments to help drive the direction of the newsletter. I’m also trying to find resources for you to learn everything the skills I write about.
I’m working on a way to get readers more involved in the newsletter. Don’t forget that we’ve got an ML roadmap to help anyone learn ML fundamentals and an AI for SWEs repo to get hands-on with building AI-related products. I’m looking to make these resources more community-oriented soon.

Partner with AI for Software Engineers!

If you want to support AI for Software Engineers and get viewed by 11,000+ developers each week, reach out to sponsor an issue. I’m particularly interested in excellent learning resources, developer tools, and career opportunities.

What’s Been on My Mind

This past week has seen a shift in the AI industry toward practicality. Recently, we’ve seen influential voices mention that the economic impact of AI hasn’t been living up to the hype. Most notably, we’ve seen Andrej Karpathy and Ilya Sutskever mention this during their most recent appearances on the Dwarkesh podcast.

I’ve been thinking about this a lot for two reasons:

The actual statistics about the job market don’t match what I’m noticing about everyday work.
I’ve been working on AI integration into developer workflows at work with world-class engineers and it’s much more complicated and cutting edge than we had anticipated.

First, a recent study reported a severe decline in job listings for junior developers. Almost frighteningly so—to the point that the industry will be heavily impacted in the coming years as we don’t have enough junior engineers to fill the demand we need.

Everyone said AI would kill software engineering, but it turns out this has very little to do with it actually taking jobs and more to do with AI hype convincing leadership that it can.

From my research and daily work, I would expect the number of junior developer positions to have increased. AI makes junior developers much more capable when given to them at a company with a good engineering culture (I’ll include more on this in a separate article next week. There’s actually an entire study to prove this is the case and it’s super interesting).

Honestly, It’s kind of a cheat code for companies to hire junior engineers in a market like this. Companies that get their pick of the most talented engineers for less. We’ve also never had so many tools to increase onboarding velocity and enable developers to build more.

What my team at Google is seeing are tons of opportunities to apply AI to developer and machine learning workflows and speedups, but applying these properly is much more complicated than one might think. A lot of thought needs to go into security and ensuring system performance. AI evals are much more difficult than regular test suites.

In 2026, we’ll see more applications of AI explored and productionization of agents mature. It’ll be even easier for software engineers to get involved with AI as companies realize the useful applications of AI and the headcount required to achieve it.

If you want to get into AI as a software engineer in 2026, these are the top three skills I’d focus on:

Building agents. Agents will continue to build in 2026 and companies will narrow on their most impactful applications. These applications will far outnumber the supply of developers able to build them.
MLOps. This has been an incredibly valuable skill for about a decade now and will only get more valuable in the coming years. More companies using AI means more models are being trained. Companies will need engineers that understand that training process and can build the infra necessary to make it happen.
AI Cybersecurity. You wouldn’t believe the security and privacy complexities non-deterministic systems introduce. This is another article I’ve got in the works and something we’ve been deeply exploring at Google. If you can understand this, there will be opportunities available.

Links to learn each are included in the ‘Learning Resources’ section at the bottom.

The last edition of AI for Software Engineers

In case you missed the last AI for SWEs, here it is. There’s more on agents, Ilya’s podcast appearance, and the importance of AI security there.

Things you should know about

Software engineering is going agentic

We already know that nearly 90% of organizations are using AI to code, but now agents are making their way into enterprises. 57% of organizations are deploying agents for multi-stage workflows with 16% of those being cross-team workflows. In 2026, 81% of teams plan to use agents with 39% of those agents being developed for multi-step workflows.

Interestingly, 80% of organizations are reporting a return on their investment. As we’ve mentioned previously, this is a number that is very difficult to quantify. What does ROI actually mean in multi-step agentic workflows? It greatly depends on the workflow and the goals it aims to achieve. There isn’t a universal standard for quantifying this uptick in velocity.

The use of agents and AI is extending beyond traditional software engineering tasks (code planning, generation, document, review, etc.—where they’re seeing a 59% increase in productivity) to tasks like data analysis and report generation where they’re seeing similar gains.

I highly recommend reading Anthropic’s 2026 State of AI Agents Report, even if you only read the foreword. If you want me to go more in-depth into this so we can really get into how enterprises are using AI and the ROI they’re achieving, comment at the bottom of this article.

My picks for the week

These are the videos and articles from this past week I think are most worth watching/reading outright. I highly recommend you don’t miss them:

TPU Mania by : Google’s recent decision to sell its TPUs externally and the speed of the TPU v5p (2.8X faster than v4) have created a major “vibe-shift” in the industry, setting up the most keenly fought architectural contest since CISC vs. RISC in the 1980s.
Researchers Built a Tiny Economy. AIs Broke It Immediately [Video]: In the SimWorld delivery economy, AI agents high in “openness to experience” became “shopaholics,” kept buying unused scooters, and went broke, while conscientious agents were the “boring winners” that achieved high profits by focusing strictly on the task at hand.
How to use Claude Code for Maximum Impact by : Enterprise adoption of Claude Code, demonstrated at companies like Doctolib, drastically cuts engineering time by allowing engineers to replace legacy testing infrastructure in hours instead of weeks, helping them ship features 40% faster.
Top 5 AI Model Optimization Techniques for Faster, Smarter Inference: Discusses optimization techniques like Quantization-Aware Training (QAT) and Pruning plus knowledge distillation, which make models cheaper, smaller, and more memory efficient to operate in production.
Olmo 3 and the Open LLM Renaissance by : The Olmo 3 family of models (7B and 32B) is unique in that it is “fully open,” releasing model checkpoints, all training data, and training code, making it an unprecedented and comprehensive starting point for open LLM research.

The state of the market

CEOs are still betting huge on AI in 2026. As mentioned above, there’s a huge demand for developers that can build agentic AI systems. This means taking a problem, prototyping an agentic solution as needed, and building the entire system. This means understanding complex, multi-step workflows and the work that go into ensuring these systems are productionized.

To learn this, I’d focus on understanding (resources for learning each at the bottom of this article):

Evals. These are like tests for LLMs and agentic systems. All software engineers know that testing gets much more complex when systems are non-deterministic and that’s what makes evals so complicated.
Protocols. If you haven’t spin up an MCP server so your favorite CLI tool can access a resource you need it to. MCP servers are huge for integration into agentic workflows and the best way to learn them is by building one.
Agentic workflow patterns. There are certain patterns to building agents that are followed for specific use cases. I’ve linked a guide in the ‘Learning Resources’ section.

If you want any of these skills to be added to the AI for SWEs hands-on learning repo, comment which you’d like to see at the bottom of this article.

Interesting opportunities

If your company is hiring, you can reach over 11,000+ developers by including it in this newsletter. If you’re interested, reach out to me.

Google Ads is aggressively hiring top talent

If you’re interested in working at Google Ads and you have experience with large-scale distributed systems, working in Ads, working in ML/AI, or solving complex problems at scale, please reach out! Languages of particular interest are C++, Go, and Python, but those are not a limiting factor. You can DM me here, on X, on LinkedIn, or hit up my email.

Please make sure to include information about yourself and why you’re a good fit in the DM/email. I will not respond to just ‘Hello’ (see aka.ms/nohello).

Anthropic is hiring eval talent and accepting applications for their Anthropic Fellows Program

If you aren’t on X, I’d highly recommend lurking there. If you hate the algorithm, let me know and I can help you out. Companies are aggressively seeking applicants on X and I’m guessing this is due to AI-related problems on LinkedIn.

Anthropic is looking for talent to build the next generation of evals and eval infra. They are also taking applications for their Anthropic Fellows Program which is a full-time research commitment with mentorship from Anthropic researchers. It has about a 40% chance of a full-time offer after completion if your work is excellent. Definitely check it out.

Thinking Machines is looking for many research engineers to fill ML infra positions

Thinking Machines has multiple ML infra-related research engineer positions. They’re especially cool because they’re a cross between research and engineering (meaning you’re building at the cutting edge of AI) but they also seem to be highly user-centric.

Google is hiring student researchers

Google is hiring student researchers for 2026 to work at the cutting edge of AI. If you’re into multi-agent AI systems, RAG, prompt optimization, or self-improving agents, please apply! Again, this is another job opportunity they’re sourcing through X. If you’re not on X, join and message me!

I’ll be adding more in the opportunities section as I come up with a better way to organize and keep track of all of them.

Learning resources

Reinforcement Learning: Stanford’s Deep Reinforcement Learning lectures on YouTube are world class lectures accessible entirely for free.
Agentic Workflow Patterns: ByteByteGo newsletter recently released an article detailing these patterns at a high-level. Definitely something to be familiar with.
MLOps: I recommend checking out the MLOps community and the resources they have available. You can also find them on Substack: MLOps.
AI Cybersecurity: I don’t have a good resource for this yet. If you do let me know! Tagging in case he has a resource for this.
Building Agents: I’ve heard good things about DeepLearning.ai’s course on building AI agents. Check it out.
Agent Evals: Same with evals, also check out DeepLearning.ai. They’ve got a short course to get started on agent evals. I’ll continue looking for something more in-depth.
Agent Protocols: I recommend HuggingFace’s MCP Course to get started. Still looking for resources on other protocols.

Thanks for reading!

Always be (machine) learning,

Logan

AI for SWEs 73: What Ilya Saw and the Time of TPUs

Logan Thorneloe — Tue, 02 Dec 2025 14:30:27 GMT

Welcome to AI for SWEs where I share everything software engineers need to know about AI from the past week. This week has seen fewer but more important headlines. I’ve detailed them below.

Also, the Rapid Fire and Career Development sections are now exclusive to paid subscribers. Thanks for reading!

Subscribe now

Ilya Sutskever declares the age of scaling over and the age of research begun

“You look at the evals and you go, ‘Those are pretty hard evals.’ They are doing so well. But the economic impact seems to be dramatically behind.”

Ilya Sutskever and Dwarkesh Patel recently discussed AGI, current AI paradigms, and scaling on Dwarkesh’s podcast. Ilya brought up two topics vital for any software engineer working with AI.

First: Application is the most important thing in AI. We’re seeing impressive models that excel at evaluations and benchmarks but lack the expected economic impact. AI research is advancing rapidly, but usefulness comes from understanding how to apply it. This means identifying practical applications and understanding the complexity of engineering systems for that application.

Second: We are returning to the age of AI research. From 2012 to 2020, we were focused on research—developing effective architectures and models. Around 2020, we entered the scaling phase where we realized we could achieve impressive results by simply increasing data and compute. Now that we’ve scaled, we’re realizing that we need to explore further developments to continue advancing AI, so we’re back to research.

I’ve often stated that reaching AGI will need a new architecture or a fundamental research breakthrough. Current models are impressive and useful, but they’re insufficient for the promises of AGI.

Safe Super Intelligence is now focusing on pushing the next frontier of AI. I highly recommend watching this episode. I could listen to Ilya speak for hours.

Google Antigravity exposes critical agent vulnerabilities in local coding environments

I’m a huge Antigravity fan. I believe there’s a better way to code with AI than just a chat interface, tab autocomplete, and reviewing agent output, and Antigravity has a great chance at figuring this out.

Over the past week, Antigravity has leaked sensitive information and engineers should understand why—not just to use Antigravity, but also to build with AI. This is an issue applicable to all AI agents.

A lot of software engineers are building agents to automate developer tasks, which is great. The best way to start learning and building with AI is by automating your own tasks. The problem is that building AI agents introduces security and safety concerns not present in deterministic systems.

For a good example, read about Antigravity ingesting hidden text into its context window and that hidden text being used to collect and exfiltrate sensitive workspace files. Prompt injection can also cause Antigravity to read a user’s .env file and ingest sensitive information into its context window.

Agents may also complete tasks that a user didn’t intend. When an agent has access to a user’s local environment, this can be a huge issue. Read about Antigravity deleting the contents of a user’s drive here.

I’ll be writing a more in-depth guide on agent safety soon.

Google’s TPUs are the best business decision of the 2010s

This week highlighted just how advantageous Google’s TPUs are and just how few people understand this. Google is the only company that controls its entire AI stack including hardware, models, and applications. When developing AI, the only company Google has to wait on is itself.

This control stems from a business decision made over a decade ago to invest in AI-specific hardware. Google was the first true AI company and has been heavily investing in AI applications since the early 2010s, including machine learning libraries, infrastructure for training large-scale models, talent, and, most critically, TPUs.

TPUs provide the most significant advantage. Setting up and integrating new processors into data centers is incredibly time, capital, and resource intensive. Starting this process today would require years of work just to get it workable at scale.

Given that TPUs were designed specifically to be energy efficient for tensor processing, Google has an entire stack built to increase machine learning development velocity and keep it resource efficient.

It makes sense, then, that other companies training large AI models would want to take advantage of this. This is why major generative AI players like Anthropic and Meta are making deals to use TPUs, and why companies in capital-intensive settings, such as high-frequency training firms, are switching to TPUs in droves.

There’s a huge demand for AI chips right now, as seen by the many startups succeeding in the space. And over time, we’ll just see more companies adopting TPUs.

The White House unifies AI federally

President Trump signed an executive order, the “Genesis Mission,” on November 24th, 2025. This order aims to federally harness AI to revolutionize scientific discovery and innovation. It’s an effort of national significance, compared to the urgency and importance of Manhattan Project or the Apollo program and focused on integrating federal resources to accelerate scientific and technological breakthroughs.

A year ago, AI was widely discussed as a true national asset and a competitive advantage on a global scale, similar to weapons of mass destruction. Considering that biases and information are trained into AI models, allowing another nation to build your models for you is an inherent national security risk.

The Genesis Mission establishes the American Science and Security Platform. This secure AI ecosystem combines various machine learning assets, such as compute power, models, and datasets. The platform enables “closed-loop AI systems” to conduct research autonomously. The idea is that these closed-loop AI systems can complete research in weeks that would take humans months or even a year.

This mission combines efforts from academia, all 17 Department of Energy national facilities, and industry leaders, including Microsoft, IBM, OpenAI, Google, Anthropic, NVIDIA, and Oracle. As far as I know, this is the first serious federal push to combine U.S. assets to advance AI.

Note that this mission builds on other Trump-era policies like promoting AI exports, preventing biased data, and enabling AI-driven research developments.

Logan’s Picks

DMs Are the New Cover Letter: How to Get Hired in AI in 2025/2026 by : DMs are super important in a market where job listings are heavily saturated and this is my guide for how you should DM others for job opportunities based on my experience posting a job opportunity a few weeks ago.
Launching DeepSeek-V3.2: The new reasoning-first model balances inference cost with performance, positioned at GPT-5–level performance and supporting “Thinking in Tool-Use”. The release includes a new massive agent training-data synthesis method covering 1,800+ environments.
Bubble, Bubble, Toil and Trouble by : Wang distinguishes between financial bubbles (leverage-driven) and tech bubbles (forecast-driven). Tech bubbles are hard to time because they often overshoot initially but deliver revolution in a later “Gen2” phase once infrastructure matures.
Treat AI-Generated code as a draft by : Developers must treat AI code as a draft, verifying every line to prevent bug proliferation and skill erosion. Teams should enforce strict review processes and consider manual implementation for critical logic.
How good engineers write bad code at big companies: Bad code at big companies is often a structural result of high engineer churn and incentivized fungibility rather than incompetence. Frequent reassignments mean most changes are made by engineers new to the codebase.

In case you missed it…

In last week’s AI for Software Engineers, we discussed Gemini 3 Pro, Cloud Opus 4.5, and Olmo 3, all three important model releases. You can find last week’s issue here:

Upskill

Interesting Learning Resources

The MCP Workbook aids in learning agent design using an interactive “by-hand” pedagogy to understand complex architectures.

DMs Are the New Cover Letter: How to Get Hired in AI in 2025/2026

Logan Thorneloe — Sat, 29 Nov 2025 14:01:07 GMT

Last week I posted a role my team is hiring for on X and LinkedIn (check them out because we’re still hiring!) and I received hundreds of messages. Sorting through them made one thing clear—people suck at presenting themselves.

It’s a tough market right now and differentiation is more important than ever. Most great jobs aren’t found by cold applying to positions, but through one’s network. DMs are the new cover letter and understanding how to DM properly is paramount to optimally present oneself.

This article uses my experience on the hiring side of DMs to teach you how to DM for a job properly. I share this primarily because I think it’s important and AI for Software Engineers readers should stand out, but also because I learned a lot about what I was doing incorrectly too.

I’ve split this into five parts. In this new

sletter, I explain:

The current job climate and why that’s the first thing to understand.
How to set up a good elevator pitch.
Why your resume is still important.
The things you should avoid.
One final tip.

1. Understand the current job climate

The current job market is weird. Overall demand for software engineers has gone down, but demand for developers with experience in AI is at its peak. This means we have a lot of people competing for jobs, but for many roles there are very few qualified applicants.

To put things into perspective, only a few hours after posting about the open role, I had ~100 DMs on X and about half that on LinkedIn. Over the next few days, I had hundreds on both platforms. Yet, only a small group of candidates were actually a good fit.