Slopification and Its Discontents

March 23, 2026 21:33 / ai python / 0 comments

O sovran, virtuous, precious of all trees
In Paradise! of operation blest
To sapience, hitherto obscured, infamed.

A couple weeks back Anthropic announced a promotion offering six free months of their maximum Claude plan to open-source developers. I submitted an application, and a few days later an invitation arrived in my inbox and I was up-and-running on the latest Opus 4.6 model. Previously I had been limited to Sonnet on whatever pittance of tokens is given to free accounts, so this was quite a shift from everything I had been accustomed to when using AI tools.

With the fruit of the tree plucked and in my hand, I plunged in. I began by having Claude review some of the work I'd been doing on Peewee to implement asyncio, finding ways to optimize performance in cysqlite, and creating a plan for re-structuring Peewee's documentation. All of these tasks were fairly self-contained but dealt with increasingly large amounts of data and context:

asyncio extension to Peewee is under 800 lines of code.
cysqlite is several thousand lines of Cython, but the surface area is very well-defined in terms of both Python's DB-API, and the straightforward interaction between connections, cursors, and statements.
Peewee's prose and API reference documentation is organized in a hierarchy, and the content is structured (reStructuredText). It is about 750KB of text.

Claude's performance was mixed across these tasks. Several bugs were identified and fixed in the asyncio implementation, mostly around resource cleanup which can be a fairly tricky problem with asyncio (destructors being synchronous, specifically). Ultimately this work led to a cleaner implementation of task-local storage, which comprises about 50% of the magic needed to make Peewee async. The implementation required very close review of the diffs and a lot of discernment to judge whether the proposed solutions were any good. Taken piecemeal, the code produced by Claude has a way of looking so correct that it is almost second-nature to me to glaze right over it. That is why reviewing it carefully is incredibly important - the first few iterations were terrible, though they had a certain plausability that an unwary developer might mistake for inevitability.

When tasked to optimize performance of cysqlite, Claude did quite poorly. For example, it suggested using C-level constructions when the Python was already properly translated to optimized calls. Applying Claude's other suggestions to the codebase (modifications to sections it identified as "hot") decreased code quality and readability, and produced gains that were indistinguishable from noise when measured, so these were never committed. Claude performed much better at finding bugs in the implementation (including producing failing test cases, which is great), and I'm very grateful to have those bugs fixed before someone stumbled across them. It helped me as well to identify gaps in test coverage, ensuring that all kinds of tedious edge-cases ended up covered. For these aspects it undoubtedly saved me time and led to a better library.

For the Peewee documentation reorganization, Claude did a great job when I restricted it to analyzing and planning - what the top-level main sections should be, where section X in the current documentation would fit most appropriately in the new hierarchy - things like that. There were all sorts of problems when I prompted Claude to apply the plan to the actual documentation, though. Entire sections of existing docs got dropped on the floor and never made it into the outputs, while other sections were partially duplicated. Claude would hallucinate usage examples that were faulty, inventing and calling APIs that did not exist. In the end I had Claude perform the reorganization simply to have something concrete to look at. I then put the Claude outputs to one side and went through document-by-document, section-by-section comparing the current documentation with the Claude versions, and doing a kind of manual merge. This took many hours to accomplish. I think that it may have been a wash in terms of time-saved had I simply taken the outline (which was good) and worked directly from that myself, having Claude look for gaps or places where better examples were needed.

Claude: I need a big clock to put on my tower.

Claude: I need a gravity-powered clock capable of reliably driving four faces and bells over continuous multi-year use. Must provide continuous torque sufficient to rotate a 5m-long minute hand weighing ~100kg and a 3m-long hour hand weighing ~300kg. Account for torque variation due to changing lever arm as hands rotate, and wind forces on large dial surfaces. Operate reliably between 10 and 100 degrees, and account for thermal expansion, humidity, dust and lubrication breakdown. Must be robust against partial mechanical failure.

Claude: ... r8 my clock

An insight occurs

I realized that there is a huge gap between Claude's ability to read and analyze versus it's ability to generate or modify in-place, which becomes more dramatic the larger the scope of work. This skill-gap was apparent even in the smaller tasks, such as analyzing the asyncio extension: Claude was able to skilfully identify areas where resource cleanup was fragile and find gaps in test-coverage. Much of this required deep thinking about lifetimes, API contracts, async-vs-sync behavior, race conditions, deadlocks, etc. But when tasked to produce new or novel implementations, it stumbled and kept gifting me the same turd in different colored wrapping paper. What happened to all the insight and ingenuity Claude had shown while reading the code?

Whether this is the nature of next-token generation or whether it indicates the need for multi-agent feedback loops (agent 1 writes code, agent 2 provides criticism), there is a palpable difference between Claude's performance in two types of task that becomes more apparent as context-size increases. It begs the question why the extended-thinking model does not have more robust internal feedback loops during code gen, but ultimately it forced me to re-evaluate my prompting strategy.

I'm in the process of using Claude to help with reorganization and cleanup of the Peewee test suite, and am applying what I've learned from my failures and frustrations in the earlier tasks. Peewee has over 1MB of tests, so the dream of typing "refactor and clean these tests up" is never gonna happen. Many of the tests are implicitly verifying multiple subtle behaviors. To make matters worse, many tests share model classes, although sometimes they use their own identically-named local variations. Additionally, the SQL generated by Peewee is dependent upon the names of the model classes and field instances (unless explicit names are provided), so any renaming requires finding and updating corresponding SQL generation assertions.

What has been working well for this project has been very different than what I anticipated. Up-front I needed to do a lot more specification:

Provide thorough explanations of all the subtleties and pitfalls lurking in the current test suite (such as SQL being dependent on Model and Field names by default, or shared models vs local variations).
Thoroughly list all the high-level constructs and responsibilities in the core ORM, such as the database class, query-builder primitives, cursor handlers, model and field classes.
Instruct Claude to cross-reference these implementations with all their corresponding tests in the existing test suite and build a detailed report.
Warn against losing test cases or test methods, and instruct Claude to verify integrity of tests at each step.

Then I prompted Claude to propose a plan for reorganizing the tests in small, independent steps. After verifying the plan looked sound, I had Claude proceed. At each step, I verified the test-runner outputs and diffs against the current test-suite, to ensure the work was proceeding correctly. Claude ended up devising a system of using comments to serve as "anchors" within the test modules and was able to complete a mechanical reorganization without too many issues.

The work is still progressing, and some of it may be left undone as Claude's own analysis indicated that the risks may outweigh the benefits of attempting a deeper logical refactor.

Prompting converges with coding

On the whole, I believe Claude has saved me a lot of time, and many of the shortcomings were addressed by adapting my prompting strategy and breaking the work into smaller pieces. This seems in-line with what Nicholas Carlini at Anthropic describes from working on the Rust C Compiler:

So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

I went through several iterations of watching Claude attempt to refactor the test-suite, and based on the shortcomings of each run adapted my above prompting strategy to draw particular attention to the new class of problem which must be avoided. This was fairly effective.

In general, Claude is absolutely helpful for:

Identifying bugs in an existing implementation and proposing a minimal patch.
Writing failing test-cases and following the existing idioms of the test suite.
Identifying gaps in documentation and proposing ways to improve examples.
Organizing documentation along various axes (high-to-low level? topical? use-case? example-oriented vs reference manual? etc).
Proposing ways to reorganize test suite at a high-level and identifying gaps in test coverage.
Implementing refactoring plans in small steps with checks along the way (though the code produced often had minor faults or quirks).
Following-through on a detailed plan provided the instructions given ahead-of-time were clear enough. It took me a while to get the hang of this.

Claude did poorly at:

Coming up with a novel implementation that has a different "shape" than the existing one, and adapting all the places in the code to utilize this cleaner implementation. When prompted explicitly to use a particular approach, then it can accomplish this somewhat, but it does not seem to "find" these novel approaches on its own.
Refactoring without oversight. When I saw how thoroughly and accurately Claude could analyze problematic areas in code, I assumed that it could apply equal thoroughness and ingenuity to resolving issues through a refactor. Refactoring is tricky, because it's crucial to have both the big picture and the small details in-mind at all times. I found I had to be very explicit in order to get usable results. None of the actual test methods were even changed - this was merely moving existing sections around.

AI software development is treated with an almost religious reverence among founders, who believe they've found the silver bullet for converting ideas into profitability. Added to this intoxication is a parallel fear of obsolescence, being beaten to the market, or missing out on the AI Native revolution. I think both of these assumptions are flawed. My experience has been that AI overwhelmingly can increase my efficiency, both for writing code and verifying the correctness of existing code. When working on a larger, more complex problem, the prompt and the workflow begins to become incredibly important, and begins to take on many of the same shapes and aspects that coding itself requires. Taken to an extreme, prompting converges with coding.

In a similar way, I think the fears are misplaced. If AI were there then any pure-software or even mostly-software B2B type enterprise software would be under existential threat from hundreds of vibe-coded alternatives offering 80% of the features at a bargain price. Slack, SalesForce, PhotoShop, Office365, DocuSign - I don't believe vibe-coding AI-Native startups are going to displace these businesses any time soon. And I don't believe inertia or entrenchment are the main reasons for this. AI just isn't there yet.

I asked my wife, who is a nurse practitioner, her thoughts on all this, as AI has been a topic of discussion in her office and at conferences. The AI-Native vision for healthcare would be elimination of all EMR - the patient shows up with a complaint, an exam is performed while AI listens, AI orders any relevant tests which are fed back, analyzed and correlated with patient overall health, history and medications. AI correlates a diagnosis and develops a treatment plan or processes an admission and schedules a surgery. AI bills insurance and handles that side as well. Would this work?

She answered with a story about a patient they have who is dealing with heart failure. He is non-compliant about taking his medicine, but does not like to be told he is non-compliant. He is not terribly old, nor is he in terrible health, but his body is failing. She worries that people like him, people who may be partially non-compliant, people who don't tolerate certain medicines but can take them in a pinch, and all the other "edge cases" will not fit into an AI-run system. She related how a colleague who uses her phone during consultations to help outline notes finds that the AI-generated notes are generally good but occasionally they insert crazy diagnoses that were never even discussed. Nonetheless, her concern was that the human discretion and discernment that occurs during an exam, talking to the patient, would be lost. Forget the hallucinated diagnoses, forget the possibility of cutting off the wrong leg - the far more risky proposition was losing the interaction between patient and provider.

I'm guessing that none of this is news to any of us who've really spent time using AI, but I wanted to collect my thoughts here while they're fresh. Or perhaps that small-engine repair class I'm starting at the end of March is coming just in the nick of time.

It says it on the homepage, but it bears re-iterating here - no AI was used in any way to write or edit this post. There's enough slop out there already.

Slopification and Its Discontents

An insight occurs

Prompting converges with coding

Comments (0)

Post a comment