Tech / Criticism

The agentic hype cycle is just the expert system hype cycle in a different font

Every few years the AI field rediscovers that you can chain decisions together and forgets that it tried this before.

The current thing is "agentic AI." The pitch is that instead of one model answering one question, you have a network of models calling each other, routing tasks, and accomplishing goals autonomously. This is described as a breakthrough. It is also, depending on how old you are, a description of expert systems circa 1985, or workflow automation circa 2015, or "just use a for-loop" circa always.1

I want to be fair. The underlying models are better now. The things being chained together are more capable than the rule-based nodes they are being compared to. That matters. It is not nothing.

But the framing does something specific and annoying: it treats the orchestration as the innovation. Most of what is being sold as "agentic" is a task queue with GPT-4 at each node. The nodes are good. The queue is a queue. The press release calls the queue "autonomous multi-agent reasoning" and the demo shows it booking a calendar event.2

The press release calls the queue "autonomous multi-agent reasoning" and the demo shows it booking a calendar event.

There was a paper, or a post, I cannot remember which, by someone at a serious institution (not a startup), arguing that the reliability requirements for actual autonomy scale nonlinearly with the length of the task chain.3 The argument was roughly: a 90% success rate per step sounds good and becomes catastrophic by step seven. He had a table. The table was clarifying. I want to find this and will link it when I do.


Buried in the footnotes of the eval

1. This is the actual problem with the agentic framing.

2. The benchmark showing 72% task completion on a curated eval.

2.5. The eval was designed by the company running the benchmark.

3. The real number, in production, on arbitrary tasks, is lower. Nobody publishes this.


What this means in practice is that most "agentic" products work in demos and degrade in production in ways that are specifically hard to debug, because the failure is somewhere in the middle of a chain and there is no clean error state. The user experience is: it almost worked, and now it is not clear where to start.

Anyway. I do not think this technology is without value. I think the claims are running several years ahead of the reliability, which is what hype cycles do, and the hangover is going to be proportional.4 The boring version of this technology, a model that reliably answers one question well, is already useful. The exciting version, a model that autonomously manages your business processes, is mostly a demo.

Footnotes
1

The expert system comparison is not original to me. There is a version of this argument I encountered somewhere that named specific systems from that era with the specific claims made about them. I cannot place it now. The pattern is accurate even if I have the genealogy slightly wrong.

2

This is not a hypothetical. The actual demo at the actual conference was a calendar event.

3

I think this was a LessWrong post. It may have been a substack. The institution was real. The table was real. I am not confident about anything else.

4

The hangover metaphor is tired and also accurate. Both things can be true.

— Pout

Comments

No comments yet.