It’s hard to elaborate just how wild this model might be if it performs as claim...

Aurornis · 2026-02-03T20:20:51 1770150051

I experimented with the Q2 and Q4 quants. First impression is that it's amazing we can run this locally, but it's definitely not at Sonnet 4.5 level at all.

Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.

A few times it got stuck in thinking loops and I had to cancel prompts.

This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.

Kostic · 2026-02-03T20:53:19 1770151999

I would not go below q8 if comparing to sonnet.

anon373839 · 2026-02-04T12:30:18 1770208218

Yeah. Q2 in any model is just severely damaged, unfortunately. Wish it weren’t so.

cubefox · 2026-02-03T20:58:45 1770152325

> I experimented with the Q2 and Q4 quants.

Of course you get degraded performance with this.

Aurornis · 2026-02-03T22:40:15 1770158415

Obviously. That's why I led with that statement.

Those are the quant thresholds where people with mid-high end hardware can run this locally at reasonable speed, though.

In my experience Q2 is flakey, but Q4 isn't dramatically worse.

cubefox · 2026-02-04T09:48:31 1770198511

> Obviously. That's why I led with that statement.

Then why did you write this?

> It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.

margalabargala · 2026-02-03T20:59:37 1770152377

Wonder where it falls on the Sonnet 3.7/4.0/4.5 continuum.

3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.

If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.

Computer0 · 2026-02-04T01:50:24 1770169824

I still have yet to find a "Small" model that can use function calls consistently enough to not be frustrating. That is the most noticeable difference I consistently see between even older "SOTA" models and the best performing "SMALL" models (<70b).

cmrdporcupine · 2026-02-03T21:51:19 1770155479

It feels more like Haiku level than Sonnet 4.5 from my playing with it.

cirrusfan · 2026-02-03T16:45:25 1770137125

If it sounds too good to be true…

theshrike79 · 2026-02-03T17:01:33 1770138093

Should be possible with optimised models, just drop all "generic" stuff and focus on coding performance.

There's no reason for a coding model to contain all of ao3 and wikipedia =)

jstummbillig · 2026-02-03T18:31:37 1770143497

There is: It works (even if we can't explain why right now).

If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.

moffkalast · 2026-02-03T18:22:15 1770142935

That's what Meta thought initially too, training codellama and chat llama separately, and then they realized they're idiots and that adding the other half of data vastly improves both models. As long as it's quality data, more of it doesn't do harm.

Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.

theshrike79 · 2026-02-03T21:13:08 1770153188

But as far as I know, that was way before tool calling was a thing.

I'm more bullish about small and medium sized models + efficient tool calling than I'm about LLMs too large to be run at home without $20k of hardware.

The model doesn't need to have the full knowledge of everything built into it when it has the toolset to fetch, cache and read any information available.

noveltyaccount · 2026-02-03T17:29:04 1770139744

I think I like coding models that know a lot about the world. They can disambiguate my requirements and build better products.

regularfry · 2026-02-03T17:52:05 1770141125

I generally prefer a coding model that can google for the docs, but separate models for /plan and /build is also a thing.

noveltyaccount · 2026-02-03T18:31:00 1770143460

> separate models for /plan and /build

I had not considered that, seems like a great solution for local models that may be more resource-constrained.

regularfry · 2026-02-03T18:49:11 1770144551

You can configure aider that way. You get three, in fact: an architect model, a code editor model, and a quick model for things like commit messages. Although I'm not sure if it's got doc searching capabilities.

MarsIronPI · 2026-02-03T17:34:18 1770140058

But... but... I need my coding model to be able to write fanfiction in the comments...

wongarsu · 2026-02-03T22:25:19 1770157519

Now I wonder how strong the correlation between coding performance and ao3 knowledge is in human programmers. Maybe we are on to something here /s

FuckButtons · 2026-02-03T20:12:01 1770149521

There have been advances recently (last year) in scaling deep rl by a significant amount, their announcement is in line with a timeline of running enough experiments to figure out how to leverage that in post training.

Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.

Der_Einzige · 2026-02-03T20:16:51 1770149811

It literally always is. HN Thought DeepSeek and every version of Kimi would finally dethrone the bigger models from Anthropic, OpenAI, and Google. They're literally always wrong and average knowledge of LLMs here is shockingly low.

cmrdporcupine · 2026-02-03T21:52:52 1770155572

Nobody has been saying they'd be dethroned. We're saying they're often "good enough" for many use cases, and that they're doing a good job of stopping the Big Guys from creating a giant expensive moat around their businesses.

Chinese labs are acting as a disruption against Altman etcs attempt to create big tech monopolies, and that's why some of us cheer for them.

Der_Einzige · 2026-02-04T05:05:28 1770181528

"Nobody says X" is as presumptuous and wrong (both metaphorically and literally) as "LLMs can't do X". It is one of the worst thought terminating cliches.

Thousands have been saying this, you aren't paying attention.

cmrdporcupine · 2026-02-04T05:25:29 1770182729

As thought terminating as "HN Thought [insert strawman here]"

C'mon.