Slopification and Its Discontents

The Fall of Man

O sovran, virtuous, precious of all trees
In Paradise! of operation blest
To sapience, hitherto obscured, infamed.

A couple weeks back Anthropic announced a promotion offering six free months of their maximum Claude plan to open-source developers. I submitted an application, and a few days later an invitation arrived in my inbox and I was up-and-running on the latest Opus 4.6 model. Previously I had been limited to Sonnet on whatever pittance of tokens is given to free accounts, so this was quite a shift from everything I had been accustomed to when using AI tools.

With the fruit of the tree plucked and in my hand, I plunged in. I began by having Claude review some of the work I'd been doing on Peewee to implement asyncio, finding ways to optimize performance in cysqlite, and creating a plan for re-structuring Peewee's documentation. All of these tasks were fairly self-contained but dealt with increasingly large amounts of data and context:

Claude's performance was mixed across these tasks. Several bugs were identified and fixed in the asyncio implementation, mostly around resource cleanup which can be a fairly tricky problem with asyncio (destructors being synchronous, specifically). Ultimately this work led to a cleaner implementation of task-local storage, which comprises about 50% of the magic needed to make Peewee async. The implementation required very close review of the diffs and a lot of discernment to judge whether the proposed solutions were any good. Taken piecemeal, the code produced by Claude has a way of looking so correct that it is almost second-nature to me to glaze right over it. That is why reviewing it carefully is incredibly important - the first few iterations were terrible, though they had a certain plausability that an unwary developer might mistake for inevitability.

When tasked to optimize performance of cysqlite, Claude did quite poorly. For example, it suggested using C-level constructions when the Python was already properly translated to optimized calls. Applying Claude's other suggestions to the codebase (modifications to sections it identified as "hot") decreased code quality and readability, and produced gains that were indistinguishable from noise when measured, so these were never committed. Claude performed much better at finding bugs in the implementation (including producing failing test cases, which is great), and I'm very grateful to have those bugs fixed before someone stumbled across them. It helped me as well to identify gaps in test coverage, ensuring that all kinds of tedious edge-cases ended up covered. For these aspects it undoubtedly saved me time and led to a better library.

For the Peewee documentation reorganization, Claude did a great job when I restricted it to analyzing and planning - what the top-level main sections should be, where section X in the current documentation would fit most appropriately in the new hierarchy - things like that. There were all sorts of problems when I prompted Claude to apply the plan to the actual documentation, though. Entire sections of existing docs got dropped on the floor and never made it into the outputs, while other sections were partially duplicated. Claude would hallucinate usage examples that were faulty, inventing and calling APIs that did not exist. In the end I had Claude perform the reorganization simply to have something concrete to look at. I then put the Claude outputs to one side and went through document-by-document, section-by-section comparing the current documentation with the Claude versions, and doing a kind of manual merge. This took many hours to accomplish. I think that it may have been a wash in terms of time-saved had I simply taken the outline (which was good) and worked directly from that myself, having Claude look for gaps or places where better examples were needed.

r8myclock

Claude: I need a big clock to put on my tower.

Claude: I need a gravity-powered clock capable of reliably driving four faces and bells over continuous multi-year use. Must provide continuous torque sufficient to rotate a 5m-long minute hand weighing ~100kg and a 3m-long hour hand weighing ~300kg. Account for torque variation due to changing lever arm as hands rotate, and wind forces on large dial surfaces. Operate reliably between 10 and 100 degrees, and account for thermal expansion, humidity, dust and lubrication breakdown. Must be robust against partial mechanical failure.

Claude: ... r8 my clock

An insight occurs

I realized that there is a huge gap between Claude's ability to read and analyze versus it's ability to generate or modify in-place, which becomes more dramatic the larger the scope of work. This skill-gap was apparent even in the smaller tasks, such as analyzing the asyncio extension: Claude was able to skilfully identify areas where resource cleanup was fragile and find gaps in test-coverage. Much of this required deep thinking about lifetimes, API contracts, async-vs-sync behavior, race conditions, deadlocks, etc. But when tasked to produce new or novel implementations, it stumbled and kept gifting me the same turd in different colored wrapping paper. What happened to all the insight and ingenuity Claude had shown while reading the code?

Whether this is the nature of next-token generation or whether it indicates the need for multi-agent feedback loops (agent 1 writes code, agent 2 provides criticism), there is a palpable difference between Claude's performance in two types of task that becomes more apparent as context-size increases. It begs the question why the extended-thinking model does not have more robust internal feedback loops during code gen, but ultimately it forced me to re-evaluate my prompting strategy.

I'm in the process of using Claude to help with reorganization and cleanup of the Peewee test suite, and am applying what I've learned from my failures and frustrations in the earlier tasks. Peewee has over 1MB of tests, so the dream of typing "refactor and clean these tests up" is never gonna happen. Many of the tests are implicitly verifying multiple subtle behaviors. To make matters worse, many tests share model classes, although sometimes they use their own identically-named local variations. Additionally, the SQL generated by Peewee is dependent upon the names of the model classes and field instances (unless explicit names are provided), so any renaming requires finding and updating corresponding SQL generation assertions.

What has been working well for this project has been very different than what I anticipated. Up-front I needed to do a lot more specification:

Then I prompted Claude to propose a plan for reorganizing the tests in small, independent steps. After verifying the plan looked sound, I had Claude proceed. At each step, I verified the test-runner outputs and diffs against the current test-suite, to ensure the work was proceeding correctly. Claude ended up devising a system of using comments to serve as "anchors" within the test modules and was able to complete a mechanical reorganization without too many issues.

The work is still progressing, and some of it may be left undone as Claude's own analysis indicated that the risks may outweigh the benefits of attempting a deeper logical refactor.

Prompting converges with coding

On the whole, I believe Claude has saved me a lot of time, and many of the shortcomings were addressed by adapting my prompting strategy and breaking the work into smaller pieces. This seems in-line with what Nicholas Carlini at Anthropic describes from working on the Rust C Compiler:

So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

I went through several iterations of watching Claude attempt to refactor the test-suite, and based on the shortcomings of each run adapted my above prompting strategy to draw particular attention to the new class of problem which must be avoided. This was fairly effective.

In general, Claude is absolutely helpful for:

Claude did poorly at:

AI software development is treated with an almost religious reverence among founders, who believe they've found the silver bullet for converting ideas into profitability. Added to this intoxication is a parallel fear of obsolescence, being beaten to the market, or missing out on the AI Native revolution. I think both of these assumptions are flawed. My experience has been that AI overwhelmingly can increase my efficiency, both for writing code and verifying the correctness of existing code. When working on a larger, more complex problem, the prompt and the workflow begins to become incredibly important, and begins to take on many of the same shapes and aspects that coding itself requires. Taken to an extreme, prompting converges with coding.

In a similar way, I think the fears are misplaced. If AI were there then any pure-software or even mostly-software B2B type enterprise software would be under existential threat from hundreds of vibe-coded alternatives offering 80% of the features at a bargain price. Slack, SalesForce, PhotoShop, Office365, DocuSign - I don't believe vibe-coding AI-Native startups are going to displace these businesses any time soon. And I don't believe inertia or entrenchment are the main reasons for this. AI just isn't there yet.

I asked my wife, who is a nurse practitioner, her thoughts on all this, as AI has been a topic of discussion in her office and at conferences. The AI-Native vision for healthcare would be elimination of all EMR - the patient shows up with a complaint, an exam is performed while AI listens, AI orders any relevant tests which are fed back, analyzed and correlated with patient overall health, history and medications. AI correlates a diagnosis and develops a treatment plan or processes an admission and schedules a surgery. AI bills insurance and handles that side as well. Would this work?

She answered with a story about a patient they have who is dealing with heart failure. He is non-compliant about taking his medicine, but does not like to be told he is non-compliant. He is not terribly old, nor is he in terrible health, but his body is failing. She worries that people like him, people who may be partially non-compliant, people who don't tolerate certain medicines but can take them in a pinch, and all the other "edge cases" will not fit into an AI-run system. She related how a colleague who uses her phone during consultations to help outline notes finds that the AI-generated notes are generally good but occasionally they insert crazy diagnoses that were never even discussed. Nonetheless, her concern was that the human discretion and discernment that occurs during an exam, talking to the patient, would be lost. Forget the hallucinated diagnoses, forget the possibility of cutting off the wrong leg - the far more risky proposition was losing the interaction between patient and provider.

I'm guessing that none of this is news to any of us who've really spent time using AI, but I wanted to collect my thoughts here while they're fresh. Or perhaps that small-engine repair class I'm starting at the end of March is coming just in the nick of time.

It says it on the homepage, but it bears re-iterating here - no AI was used in any way to write or edit this post. There's enough slop out there already.

Comments (0)


Post a comment