> If you asked me six months ago what I thought of generative AI, I would have s...

libraryofbabel · 2026-01-30T16:46:57 1769791617

This point about "some people already said six months ago that it was better than it was six months ago" is regularly trotted out in threads like this like it's some sort of trump card that proves AI is just hype. It doesn't make sense to me. What else do you expect people to be saying about a rapidly-improving technology? How does it help you to distinguish technologies that are hype from those that are not?

I'm sure people were saying similar things about, say, aviation all through the first decades of the 20th century, "wow, those planes are getting better every few years"... "Until recently planes were just gimmicks, but now they can fly across the English channel!"... "I wouldn't have got in one of those death traps 5 years ago, but now I might consider it!" And different people were saying things like that at different times, because they had different views of the technology, different definitions of usefulness, different appetites for risk. It's just a wide range of voices talking in similar-sounding terms about a rapidly-developing technology over a span of time.

This is just how people are going to talk about rapidly-improving technologies for which different people have different levels of adoption at different times. It's not a terribly interesting point. You have to engage with the specifics, I'm afraid.

deweller · 2026-01-30T15:16:53 1769786213

The second half of that argument was not in this article. The author was just relating his experience.

For what it is worth, I have also gone from a "this looks interesting" to "this is a regular part of my daily workflow" in the same 6 month time period.

jofla_net · 2026-01-30T16:43:16 1769791396

"The challenge isn’t choosing “AI or not AI” - that ship has sailed."

Aurornis · 2026-01-30T16:23:32 1769790212

I’m a light LLM user myself and I still write most of the important code by myself.

Even I can see there has been a clear advancement in performance in the past six months. There will probably be another incremental step 6 months from now.

I use LLMs in a project that helps give suggestions for a previously manually data entry job. Six months ago the LLM suggestions were hit or miss. Using a recent model it’s over 90% accurate. Everything is still manually reviewed by humans but having a recent model handle the grunt work has been game changing.

If people are drinking a firehose of LinkedIn style influencer hype posts I could see why it’s tiresome. I ignore those and I think everyone else should do. There is real progress being made though.

candiddevmike · 2026-01-30T15:23:18 1769786598

I think the rapid iteration and lack of consistency from the model providers is really killing the hype here. You see HN stories all the time around how things are getting worse, and it seems folks success with the major models is starting to heavily diffuse.

The model providers should really start having LTS (at least 2 years) offerings that deliver consistent results regardless of load, IMO. Folks are tired of the treadmill and just want some stability here, and if the providers aren't going to offer it, llama.cpp will...

KptMarchewa · 2026-01-30T15:40:27 1769787627

There is a difference between quantization of SOTA model and old models. People want non-quantized SOTA models, rather than old models.

jdjeeee · 2026-01-30T15:59:13 1769788753

Put that all aside. Why can’t they demo a model on max load to show what it’s capable of…?

Yeah, exactly.

aspenmartin · 2026-01-30T15:51:09 1769788269

Yea I hear this a lot, do people genuinely dismiss that there has been step change progress over 6-12 months timescale? I mean it’s night and day, look at benchmark numbers… “yea I don’t buy it” ok but then don’t pretend you’re objective

benrutter · 2026-01-30T16:01:03 1769788863

I think I'd be in the "don't buy it" camp, so maybe I can explain my thinking at least.

I don't deny that there's been huge improvements in LLMs over the last 6-12 months at all. I'm skeptical that the last 6 months have suddenly presented a 'category shift' in terms of the problems LLMs can solve (I'm happy to be proved wrong!).

It seems to me like LLMs are better at solving the same problems that they could solve 6 months ago, and the same could be said comparing 6 months to 12 months ago.

The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.

That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.

To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.

WarmWash · 2026-01-30T16:15:26 1769789726

The improvement in LLMs has come in the form of more successful one shots, more successful bug finding, more efficient code, less time hand-holding the model.

"Problem solving" (which definitely has improved, but maybe has a spikey domain improvement profile) might not be the best metric, because you could probably hand hold the models of 12 months ago to the same "solution" as current models, but you would spend a lot of time hand holding.

aspenmartin · 2026-01-30T19:05:52 1769799952

> The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.

Yes I agree here in principle here in some cases: I think there are certainly problems that LLMs are now better at but that don't reach the critical reliability threshold to say "it can do this". E.g. hallucinations, handling long context well (still best practice to reset context window frequently), long-running tasks etc.

> That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.

This is where I disagree (but again you are absolutely right for certain classes of capabilities and problems).

- Claude code did not exist until 2025

- We have gone from e.g. people using coding agents for like ~10% of their workflow to like 90-100% pretty typically. Like code completion --> a reasonably good SWE (with caveats and pain points I know all too well). This is a big step change in what you can actually do, it's not like we're still doing only code completion and it's marginally better.

- Long horizon task success rate has now gotten good enough that basically also enable the above (good SWE) for like refactors, complicated debugging with competing hypotheses, etc, looping attempts until success

- We have nascent UI agents now, they are fragile but will see a similar path as coding which opens up yet another universe of things you can only do with a UI

- Enterprise voice agents (for like frontline support) now have a low enough bounce rate that you can actually deploy them

So we've gone from "this looks promising" to production deployment and very serious usage. This may kind of be like you say "same capabilities but just getting gradually better" but at some point that becomes a step change. Before a certain failure rate (which may be hard to pin down explicitly) it's not tolerable to deploy, but as evidenced by e.g. adoption alone we've crossed that threshold, especially for coding agents. Even sonnet 4 -> opus 4.5 has for me personally (beyond just benchmark numbers) made full project loops possible in a way that sonnet 4 would have convinced you it could and then wasted like 2 whole days of your time banging your head against the wall. Same is true for opus 4.5 but its for much larger tasks.

> To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.

Precisely. Lots and lots of hyperbole, some with varying degrees of underlying truth. But I would say: the true underlying reality here is somewhat easy to follow along with hard numbers if you look hard enough. Epoch.ai is one of my favorite sources for industry analysis, and e.g. Dwarkesh Patel is a true gift to the industry. Benchmarks are really quite terrible and shaky, so I don't necessarily fault people "checking the vibes", e.g. like Simon Willison's pelican task is exactly the sort of thing that's both fun and also important!