I'm aware of that, but that's not the link of the post. The post is linking to their UD 2.0 quants from a few months back.
Also, the benchmarks are because they messed up the first version of their Qwen 3.5 XL quants by quanting some tensors to mxfp4 that should have been in higher quality, and this is their bugfix. The post literally starts out with "We updated Qwen3.5-35B Unsloth Dynamic quants being SOTA on nearly all bits" without explaining WHY they needed to update from the original version.
If you read our blog, it says KLD and PPL are actually sometimes counterintuitive - for example MiniMax some of our quants do worse on PPL and KLD vs AesSedai's one for example, but does worse on LiveCodeBench by a lot see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-3-...
This is because see https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-1-... - although bitwidths are in general monotonic ie q2_k < q3_k < q4_k < q5_k etc, we find KLD and PPL are actually not monotonic ie q3_k can actually have BETTER PPL than q4_k.
So the main point is bad luck on quantization - sometimes lower bits might get lower PPL and KLD, but actually this is a ruse and wrong, since on actual real world tasks, it's worse.
The Q4_K_XL is easily the most popular quant for the model, though.
So then why was Q4_K_XL having issues? Is it just a PPL issue that doesn't reflect in real world usage? If yes, why not just say that? "The Q4_K_XL had lower PPL, but don't worry, PPL can be wrong, and other benchmarks show it's fine". If it was a real quality issue, then where was the issue caused by?
The blog post says "Retiring MXFP4 from all GGUF quants: Q2_K_XL, Q3_K_XL and Q4_K_XL, except for pure MXFP4_MOE" but doesn't say why. The easy assumption that most people would make is "oh, you quanted attention or ssn or something to mxfp4 and that turned out to be bad, so you retire mxfp4" but if you say that it's not that, then what's the actual issue?
each layer is made up of various weights, the weights are adjusted to quant it. a pure q8 will have all the weights as q8, or a q4 the same. but some are kept as f32, etc. here's an example of q3_k_xl - https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF/tree/ma... we can see certain weights are f32, q8, q5, q3, etc. They used mxfp4 in some weights and mxfp4 doesn't seem to place nicely in quants so that's why they are retiring it. read their publication again and it should make more sense.
"MXFP4 is much worse on many tensors - attn_gate, attn_q, ssm_beta, ssm_alpha using MXFP4 is not a good idea, and rather Q4_K is better - also MXFP4 uses 4.25 bits per weight, whilst Q4_K uses 4.5 bits per weight. It's better to use Q4_K than MXFP4 when choosing between them."
The Q4 quants had a mixture of mxfp4 leading to worse outcomes.
The table doesn't have bartowski Q4_K_XL to compare, but given the metrics of _Ms aren't universally better it's unclear if smaller size doesn't come with a cost.
e.g. it talks about running NVIDIA's systems (?) on AWS
> NVIDIA has long been one of our most important partners, and their chips are the foundation of AI computing. We are grateful for their continued trust in us, and excited to run their systems in AWS. Their upcoming generations should be great.
Probably something like NVLink Fusion. AWS has been doing deals with suppliers for which the smallest unit of deployable compute is a 44U rack (e.g. Oracle), so this is more of the same.
> We continue to have a great relationship with Microsoft. Our stateless API will remain exclusive to Azure, and we will build out much more capacity with them.
This sounds a bit like going forward (some) OpenAI APIs will also run on platforms other than Azure (AWS)?
OpenAI desperately needs to be available outside Azure. We are exclusively using Anthropic atm because it is what is available in AWS Bedrock and it works. These things are solidifying fast.
Agreed! Very notable codex behavior to prefer python for scripting purposes.
I keep telling myself to make a good zx skills or agents.md. I really like zx ergonomics & it's output when it shells out is friendly.
Top comments are lua. I respect it, and those look like neat tools. But please, not what I want to look at. It would be interesting to see how Lua fairs for scripting purposes though; I haven't done enough io to know what that would look like. Does it assume some uv wrapper too?
We upgraded to 2.6.1 about a week ago and switched to using the new(ish) parallel(ish) garbage collector. I still can't tell what the impact has been.
Claude Code (which is a wizard at analyzing log files but also, I fear, an incorrigible curve-fitter) insisted that it was a real breakthrough and an excellent choice! On the other hand there was a major slowdown last night, ending in SBCL dying from heap exhaustion. I haven't had a chance to dig into that yet.
I'm going to caveat this by stating up front that obviously HN's source code is not public so I don't know what your hot path looks like, and that I'm not a domain expert on garbage collection, but I do write a fair amount of lisp for SBCL.
Immix-style collectors, like the new GC in SBCL, only compact on an opportunistic basis and so you get fragmentation pressure under load. In that situation, you might be well under the dynamic space size cap but if it can't find a large enough contiguous chunk of free heap it will still die.
So, fragmentation would be my prime suspect given what you described.
No problem. You might be better off moving back, yes.
My understanding of immix-style collection is that it divides the heap into blocks and lines. A block is only compacted/reused if every object in it is dead, and so if you mix lifetimes (i.e. lots of short-lived requests, medium-life sessions, long-life db connections/caches/interned symbols) then you tend to fill up blocks with a mix of short and long-lived objects as users log in and make requests.
When the requests get de-allocated the session remains (because the user closed the tab but didn't log out, for example, so the session is still valid) and so you end up with a bunch of blocks that are partially occupied by long-lived objects, and this is what drives fragmentation because live objects don't get moved/compacted/de-fragged very often. Eventually you fill up your entire heap with partially-allocated blocks and there is no single contiguous span of memory large enough to fit a new allocation and the allocator shits its pants.
So if that's what the HN backend looks like architecturally (mixed lifetimes), then you'd probably benefit from the old GC because when it collects, it copies all live objects into new memory and you get defragmentation "for free" as a byproduct. Obviously it's doing more writing so pauses can be more pronounced, but I feel like for a webapp that might be a good trade-off.
Alternatively you can allocate into dedicated arenas based on lifetime. That might be the best solution, at the expense of more engineering. Profiling and testing would tell you for sure.
SBCL doesnt know when it's running low on available heap space? clisp uses libsigsegv, so it knows when to garbage collect really, and when it's not so needed.
if you are working with specific hardware (e.g. microcontrollers) it depends on which forth dialects are available but for the raspberry pico and pico 2 I recently found zeptoforth [1]
not the author but afaiu r3 uses the "color" concept:
tokens are tagged by type via 8bits (number literal, string, word call, word address, base word, …)
and the interpreter dispatches using these bits
it just doesn't use the colors visually in the editor and uses prefixes instead (" for string, : for code definition, ' for address of a word, …) which also means the representation in the editor matches that of the r3 source in files.
reply