vibe engineering 1
long overdue draft. epistemic status: collection of hacks.
did anyone already try to hand over a huge task to Claude and loop it so that Claude works on it for long long long time?
yes; tldr it works, but only in the quite narrow set of cases.
1. you must have a good verifier in advance. example of a problem with good natural verifier: find a way/configuration to use that underdocumented api so it does something you want.
2. problems or sub-problems where mechanism design is required is a deal-breaker, ie the problem must be decomposable to independently-verifiable components with independent subsets of requirements. else you need an exhaustive test suite in advance. so composable platforms are out.
3. concurrency is a deal-breaker as well, eg fancy data processing pipelines are ok, process scheduler or even leader election is not.
4. architecture is a deal-breaker. in particular, limiting file/function sizes helps a lot bc it both caps the complexity of any particular component and makes more dependencies explicit. any hidden state is complexity. functional programming does somewhat help here.
5. remember it’s a language model, when it names things itself during the decomposition it helps bc then they are closer in the representation space to what claude imagines when it sees the name
6. dumb but working heuristic: hard-limit the file size in LOC. claude must only ever read files completely.
7. more caveats: complexity meter includes all implicit behaviors of the language and the libraries, in particular macros are a deal-breaker (often for humans as well, still). hooks, declarative composition of whatever, interpreters, any other kinds of indirect control/data flow.
8. once you exceed the complexity that fits into the internal representation space of the model via any of the ways mentioned above (the list is incomplete, there are other caveats), you need exponential amount of tokens aka 2-3x for every new distinct feature/interaction of features. at this point your verifier better be capital-F fast (seconds, not minutes, for the exhaustive suite)
9. anthropic is literally out of compute so they’re fighting fires and have set up a lot of artificial limits to save on token usage with claude code on subsidized subscriptions, unfortunately at this point it is mandatory to work around/reconfigure/disable them.
10. xbow and other fancy harnesses really do help with divide-and-conquer-ing the complexity, and in with poking holes with good feedback to claude or other LLMs by sheer brute force.
11. making and cross-checking 4x-8x implementations of the same feature in parallel is one genuinely new technique which helps as well. but at this point we’re talking hundreds to thousands $/day API spend, not $20 or even $200/mo subscriptions.
12. poking holes with claude is a good thing, which is useful for the human-generated code as well, think of it as a very glorified fuzzer. just dont believe the fancy explanations and don’t rush to fix, fancier explanation does not imply the problem is more real. the price caveat still applies.
to be continued i guess.
