20 Years of Code Optimized in Two Days | Weekend Reads 4
Your AI reading list 03-15-2026
Welcome to the weekly reading list! This is how I keep up with AI news and deepen my understanding of the topics that matter for building production systems. I focus on primary sources and authors I trust to keep the signal-to-noise ratio high.
You can support AI for Software Engineers for only $5/month and get the complete edition of this list as a thank you. Thank you to all paid subscribers for your support!
In this list
As has been the case for 2026, there are a ton of interesting reads this week about getting agents working in production and what they can do. Most interesting are:
Shopify’s CEO pointed a coding agent at a 20-year-old Ruby codebase with a benchmark script and 974 unit tests. 120 automated experiments and 93 commits later, it was 53% faster.
StrongDM built a production software pipeline where three humans manage AI agents. The rules: “code must not be written by humans” and “code must not be reviewed by humans.” Each engineer spends ~$1,000/day on tokens.
AMD published a diagnostic framework where Claude Code and Cursor act as autonomous agents debugging large training clusters, tracing a 23% throughput drop to RDMA degradation on 4 of 24 nodes.
84% of Uber devs are now agentic coding users and Claude Code usage nearly doubled in three months, from 32% to 63%, while IDE-based tools have plateaued.
OpenAI shared a phishing-style prompt injection that tricked ChatGPT into exfiltrating employee PII 50% of the time, and their defense framework treats it like a call center problem, not a code injection problem.
Shopify/liquid: Performance: 53% faster parse+render, 61% fewer allocations by Simon Willison
“Having a robust test suite - in this case 974 unit tests - is a massive unlock for working with coding agents. This kind of research effort would not be possible without first having a tried and tested suite of tests.”
Shopify CEO Tobi Lutke took Andrej Karpathy’s autoresearch pattern, pointed it at Liquid, Shopify’s template engine, and ran 120 automated experiments over two days. The agent gave itself a benchmark script, iterated against 974 unit tests, and produced 93 commits.
Parse+render time dropped by 53%, allocations dropped by 61%
Replacing
StringScannerwithString#byteindexwas ~40% faster for single-byte searchingPre-computing frozen strings for integers 0-999 eliminated 267 allocations per render
A comprehensive test suite gave the agent enough context to make changes and verify them independently. “Make it faster” only becomes an actionable goal when the agent can measure its own progress and confirm it hasn’t broken anything along the way.
Designing AI agents to resist prompt injection
“If the problem is not just identifying a malicious string, but resisting misleading or manipulative content in context, then defending against it cannot rely only on filtering inputs.”
Real-world prompt injection attacks now look like phishing, not code injection. OpenAI shared an example: a phishing-style email that worked 50% of the time against ChatGPT, getting it to extract employee PII and send it to a third party.
Their defense framework borrows from how organizations protect human customer service agents. You don’t train a call center worker to detect every possible scam, you constrain their capabilities. For AI agents, this means source-sink analysis: monitor when information would leave the conversation or when the agent would follow an external link, rather than trying to perfectly classify inputs.
The Shape of the Thing by Ethan Mollick
“Code must not be written by humans. Code must not be reviewed by humans.”
StrongDM built a “Software Factory” where three humans manage AI agents that write, test, and ship code. Each engineer spends ~$1,000/day on AI tokens. Coding agents build from product roadmaps, testing agents build simulated customer environments and try to break what the coding agents built, and the agents loop feedback to each other until satisfied.
We’ve moved from co-intelligence, prompting back and forth, to management, giving agents hours of work and getting results in minutes. Every major AI lab is now explicitly working on recursive self-improvement. OpenAI says Codex was “instrumental in creating itself,” and Anthropic says their engineers barely write code anymore.
Nemotron 3 Super: NVIDIA’s gpt-oss killer? by Maxime Labonne
“Reducing the expert dimension by a factor of d/l = 4 lets you reinvest those savings into both more total experts and higher top-k.”
NVIDIA’s Nemotron 3 Super, 120B total with 12B active, is worth paying attention to because of LatentMoE. Standard MoE routes tokens from the full hidden dimension directly to experts, but LatentMoE wraps the expert path with shared linear projections that compress from d=4096 down to l=1024, do all expert computation in that compressed space, then project back up.
Reducing the expert dimension by 4x lets you run 512 total experts with top-22 routing where standard MoE typically uses 128 experts with top-6 or top-8 at the same compute cost
Artificial Analysis flagged the model as extremely verbose though, generating 110M tokens during their eval suite vs an average of 7.3M, which could erase most of those throughput gains in practice
How to Diagnose Failures in Large AI Training Clusters by Devansh
“The teams that figure out how to make that transition -- how to turn their debugging knowledge into repeatable infrastructure instead of leaving it trapped in someone’s head -- those are the teams that will compound their advantage over everyone else.”
AMD published a diagnostic framework for large training clusters where Claude Code and Cursor act as autonomous diagnostic agents. It uses a three-skill pipeline: job-log-triage to identify what happened, performance-analysis to locate where in compute, and tsdb-diagnosis to determine why via Prometheus queries.
In one case study, a 23% throughput drop on a 192 GPU run was traced to RDMA degradation on 4 of 24 nodes. The agent isolated the unhealthy nodes from TSDB metrics, and excluding them restored throughput by 30%. The skills themselves are structured instruction files that encode how senior systems engineers actually debug these problems, turning tribal knowledge into repeatable runbooks.
AI should help us produce better code
“Shipping worse code with agents is a choice. We can choose to ship code that is better instead.”
Willison’s argument is that agents should make code quality go up, not down. Common tech debt like renaming concepts, fixing API inconsistencies, and splitting large files is conceptually simple but time-consuming, and agents handle it well.
He recommends using async agents like Gemini Jules, Codex web, and Claude Code web for background refactoring so it doesn’t interrupt flow, and using agents for cheap exploratory prototyping. You can spin up a Redis simulation with load tests from a single prompt to validate technology choices before committing to an approach.
Applying Statistics to LLM Evaluations by Cameron R. Wolfe, Ph.D.
“Fundamentally, evaluations are experiments; but the literature on evaluations has largely ignored the literature from other sciences on experiment analysis and planning.”
The standard industry practice for evals is to run a model on a benchmark, report the number, and bold it if it’s the highest. No confidence intervals, no significance tests, and no accounting for the fact that your eval score has at least two sources of randomness: which questions were sampled and the model’s stochastic generation.
Based on Anthropic’s paper on statistical best practices, this deep-dive builds the framework from scratch.
Central Limit Theorem gives you confidence intervals for eval scores
Bernoulli simplification for pass/fail evals gives a cleaner standard error formula
Law of total variance decomposes eval uncertainty into question-sampling variability vs. within-question generation variability
On a 70B model, evaluating with too few questions can produce confidence intervals wide enough to make model comparisons meaningless
Coding After Coders: The End of Computer Programming as We Know It
“I feel like programmers have it easy... If you’re a lawyer, you’re screwed, right? There’s no way to automatically check a legal brief written by A.I. for hallucinations -- other than face total humiliation in court.”
The NYT Magazine’s comprehensive piece on AI-assisted development, based on interviews with 70+ developers from Google, Amazon, Microsoft, and Apple. The general attitude was optimistic, with mentions of Jevons paradox potentially increasing demand.
The request for anonymity from the Apple engineer who said “I believe that it can be fun and fulfilling and engaging, and having the computer do it for you strips you of that” is itself a data point. Corporate dynamics may be suppressing critical voices.
You can support AI for Software Engineers for just $5/mo. You’ll get more research articles and the extended reading list each week. In case you missed it, here’s last week’s reading list:
Better Agents Mean Better Surveillance | Weekend Reads 3
Enjoy this weekend’s reading list! There are a few topics that were especially prevalent: the dangers of a surveillance state, the importance of evals, and agentic engineering practices and resources.




