Hacker Newsnew | past | comments | ask | show | jobs | submit | tosh's commentslogin


Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5

https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks


I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.

Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.


Didn't expect this to be on HN haha - but sometimes HN does have older posts come up sometimes.

No your conclusion is false - only the old Q4_K_XL had slightly higher perplexity, all other quants are fine. We uploaded 9TB of research artifacts to https://huggingface.co/unsloth/Qwen3.5-35B-A3B-Experiments-G... for the community.

If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...

This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.

So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.


The Q4_K_XL is easily the most popular quant for the model, though.

So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?

The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?


each layer is made up of various weights, the weights are adjusted to quant it. a pure q8 will have all the weights as q8, or a q4 the same. but some are kept as f32, etc. here's an example of q3_k_xl - https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/tree/ma... we can see certain weights are f32, q8, q5, q3, etc. They used mxfp4 in some weights and mxfp4 doesn't seem to place nicely in quants so that's why they are retiring it. read their publication again and it should make more sense.

I am aware of all that.

They literally never say “they used mxfp4 in some weights”. What you’re claiming they said doesn’t exist.

This isn’t a postmortem, it’s PR fluff without actually addressing the issue.


It's right there https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks I looked at the weights before. It's not PR fluff, they made it clear by showing how it really affected various tensors terribly.

"MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them."

The Q4 quants had a mixture of mxfp4 leading to worse outcomes.


I’m curious how NVFP4 compares to their Q4.

Looking at their benchmarks there doesn't appear to be meaningful difference between their quants and bartowsky quants.

No our Qwen3.5 new ones show the opposite see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

Am I misreading the table?

  Unsloth Q4_K_M

  PPL:       6.6053     KLD 99.9%: 0.5478     KLD mean: 0.0192

  bartowski Qwen_Q4_K_M

  PPL:       6.6097     KLD 99.9%: 0.5771     KLD mean: 0.0182

Barely noticeable drop in PPL; noticeable KLD drop (good, 5%); but worse KLD mean (bad, 5%).

You forgot to check the disk sapce - _M and _XL are not the same across quants:

Unsloth Q4_K_M 18.49GB 0.5478 KLD 99.9% 0.0192 mean

Unsloth Q4_K_XL 19.17GB 0.4097 KLD 99.9% 0.0137 mean

bartowski Q4_K_M 19.77GB 0.5771 KLD 99.9% 0.0182 mean


The table doesn't have bartowski Q4_K_XL to compare, but given the metrics of _Ms aren't universally better it's unclear if smaller size doesn't come with a cost.

The tweet storm has a bit more substance

e.g. it talks about running NVIDIA's systems (?) on AWS

> NVIDIA has long been one of our most important partners, and their chips are the foundation of AI computing. We are grateful for their continued trust in us, and excited to run their systems in AWS. Their upcoming generations should be great.


Probably something like NVLink Fusion. AWS has been doing deals with suppliers for which the smallest unit of deployable compute is a 44U rack (e.g. Oracle), so this is more of the same.

https://www.nvidia.com/en-us/data-center/nvlink-fusion/


> We continue to have a great relationship with Microsoft. Our stateless API will remain exclusive to Azure, and we will build out much more capacity with them.

This sounds a bit like going forward (some) OpenAI APIs will also run on platforms other than Azure (AWS)?

Anyone knows more?


I guess Amazon would have a hard time justifying their investment if OpenAI remained Azure-exclusive...

https://openai.com/index/amazon-partnership/


Curious what is meant by "stateless".

OpenAI desperately needs to be available outside Azure. We are exclusively using Anthropic atm because it is what is available in AWS Bedrock and it works. These things are solidifying fast.


Unless I'm mistaken wasn't someone at Microsoft suggesting they would just develop their own models soon?

At least for me codex seems to write way more python than bash for general purpose stuff

Agreed! Very notable codex behavior to prefer python for scripting purposes.

I keep telling myself to make a good zx skills or agents.md. I really like zx ergonomics & it's output when it shells out is friendly.

Top comments are lua. I respect it, and those look like neat tools. But please, not what I want to look at. It would be interesting to see how Lua fairs for scripting purposes though; I haven't done enough io to know what that would look like. Does it assume some uv wrapper too?


nb: there is a SBCL release at end of every month: https://www.sbcl.org/all-news.html

We upgraded to 2.6.1 about a week ago and switched to using the new(ish) parallel(ish) garbage collector. I still can't tell what the impact has been.

Claude Code (which is a wizard at analyzing log files but also, I fear, an incorrigible curve-fitter) insisted that it was a real breakthrough and an excellent choice! On the other hand there was a major slowdown last night, ending in SBCL dying from heap exhaustion. I haven't had a chance to dig into that yet.


>SBCL dying from heap exhaustion

Due to hitting the cap, or to framentation? My understanding is the new parallel GC compacts the heap rather infrequently.


If by the cap you mean the heap size passed in as the --dynamic-space-size argument, it didn't hit the cap. It was using about 2/3 of that.

> My understanding is the new parallel GC compacts the heap rather infrequently

Can you explain more?


I'm going to caveat this by stating up front that obviously HN's source code is not public so I don't know what your hot path looks like, and that I'm not a domain expert on garbage collection, but I do write a fair amount of lisp for SBCL.

Immix-style collectors, like the new GC in SBCL, only compact on an opportunistic basis and so you get fragmentation pressure under load. In that situation, you might be well under the dynamic space size cap but if it can't find a large enough contiguous chunk of free heap it will still die.

So, fragmentation would be my prime suspect given what you described.


Sorry for suddenly clinging to you for support but might we be better off using the older GC in that case?

No problem. You might be better off moving back, yes.

My understanding of immix-style collection is that it divides the heap into blocks and lines. A block is only compacted/reused if every object in it is dead, and so if you mix lifetimes (i.e. lots of short-lived requests, medium-life sessions, long-life db connections/caches/interned symbols) then you tend to fill up blocks with a mix of short and long-lived objects as users log in and make requests.

When the requests get de-allocated the session remains (because the user closed the tab but didn't log out, for example, so the session is still valid) and so you end up with a bunch of blocks that are partially occupied by long-lived objects, and this is what drives fragmentation because live objects don't get moved/compacted/de-fragged very often. Eventually you fill up your entire heap with partially-allocated blocks and there is no single contiguous span of memory large enough to fit a new allocation and the allocator shits its pants.

So if that's what the HN backend looks like architecturally (mixed lifetimes), then you'd probably benefit from the old GC because when it collects, it copies all live objects into new memory and you get defragmentation "for free" as a byproduct. Obviously it's doing more writing so pauses can be more pronounced, but I feel like for a webapp that might be a good trade-off.

Alternatively you can allocate into dedicated arenas based on lifetime. That might be the best solution, at the expense of more engineering. Profiling and testing would tell you for sure.


I love HN. This is gold.


Hey, a different comment is put in highlights.

I might be wrong, but could it be that there’s an error?


Hmm let me check...

Edit: I just forgot to add it. eesh. added now thanks!


Hey it's totally possible that I'm actually a golden retriever who has no idea what he's talking about woof woof bark wag wag

Thank you!

You're welcome, good luck!

Totally. This kind of stuff is what keeps me coming back.

I have also seen some outright crashes on the new GC.

SBCL doesnt know when it's running low on available heap space? clisp uses libsigsegv, so it knows when to garbage collect really, and when it's not so needed.

Ah, the enthusiasm to please from our AI minions. :)

Thanks. Your link gives more insight into "why submit now?" Appreciate it.

gForth [0] is great for getting started

if you are working with specific hardware (e.g. microcontrollers) it depends on which forth dialects are available but for the raspberry pico and pico 2 I recently found zeptoforth [1]

or you know you can always bootstrap your own :)

[0] https://gforth.org [1] https://github.com/tabemann/zeptoforth


not the author but afaiu r3 uses the "color" concept:

tokens are tagged by type via 8bits (number literal, string, word call, word address, base word, …)

and the interpreter dispatches using these bits

it just doesn't use the colors visually in the editor and uses prefixes instead (" for string, : for code definition, ' for address of a word, …) which also means the representation in the editor matches that of the r3 source in files.


It also means people with color vision deficiencies like me don't struggle distinguishing all the hues.

taped and transcribed by Jeff Fox

https://www.ultratechnology.com/1xforth.htm



Another one to the list, however it hardly sounds like a killer application.


It's been my daily driver for close to a year now. It might not be a killer application, but it's certainly enough to prove Zig isn't vapourware.


If that is enough, there are plenty of languages around that fit the bill.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: