I also see, VERY interesting: - Linux container support (So overlayfs is ZFS awa...

someplaceguy · on July 5, 2023

> This means you can enable Zstd compression by default, and if the data isn't compressible, it aborts the operation and stores it uncompressed, avoiding the extra CPU use

This was already implemented for all compression algorithms since ZFS was released as open source in 2005, at least (and even though zstd didn't yet exist then, it still benefited from this feature when it was integrated into ZFS).

The feature you describe only saves CPU time on decompression, but still requires ZFS to try to compress all data before storing it on disk.

This new early abort feature actually seems to be even more clever/sophisticated, because it also saves a lot of CPU time when compressing, not just descompressing.

Usually you are forced to select a fast compression level so that your performance doesn't degrade too much for little benefit, because if you had a higher compression level, ZFS would always be paying a high CPU cost for trying to compress data which often is not even compressible (since a lot of data on a typical filesystem belongs to large files, which are often not compressible).

However, with early abort, you can select a higher compression level (zstd-3 or higher) without paying a penalty for spending a lot of time trying to compress uncompressible data.

The way it seems to work is that ZFS first tries to compress your data with LZ4, which is an extremely fast compression algorithm. If LZ4 succeeds in compressing the data, then ZFS discards the LZ4-compressed data and recompresses it with your chosen higher compression level, knowing that it will be worth it.

If LZ4 can't compress the data, then ZFS doesn't even try to compress with the slower algorithm, it just stores the data uncompressed.

This almost completely avoids paying the CPU penalty for trying to compress data that cannot be compressed, and only results in about 9% less compression savings than if you had simply chosen the higher compression level without early abort.

The actual algorithm is even more clever: in-between the LZ4 heuristic and a zstd-3 or higher compression level, there is an intermediate zstd-1 heuristic that is also done to see whether the data is compressible.

This intermediate heuristic seems to almost completely eliminate the 9% compression savings cost while still being much faster than trying to compress all data with zstd-3 or higher, since zstd-1 is also very fast.

In the end, you get almost all the benefits of a higher compression level without paying almost any cost for wasting CPU cycles trying to compress uncompressible data!

rincebrain · on July 5, 2023

edit: I can't tell if you edited this since I read it, or I just skipped the last few paragraphs outright, my apologies. I'll leave this here for posterity anyway.

To add something amusing but constructive - it turns out that one pass of Brotli is even better than using LZ4+zstd-1 for predicting this, but integrating a compression algorithm _just_ as an early pass filter seemed like overkill. Heh.

It's actually even funnier than that.

It tries LZ4, and if that fails, then it tries zstd-1, and if that fails, then it doesn't try your higher level. If either of the first two succeed, it just jumps to the thing you actually requested.

Because it turns out, just using LZ4 is insanely faster than using zstd-1 is insanely faster than any of the higher levels, and you burn a _lot_ of CPU time on incompressible data if you're wrong; using zstd-1 as a second pass helps you avoid the false positive rate from LZ4 having different compression characteristics sometimes.

Source: I wrote the feature.

someplaceguy · on July 5, 2023

> I can't tell if you edited this since I read it, or I just skipped the last few paragraphs outright, my apologies.

I made some minor edits within 15-30 mins of posting the comment, but the LZ4+zstd-1 heuristic was there from the beginning (I think) :)

Since you wrote this feature, I've been wondering about something and I was hoping you could give some insights:

Is there a good reason why this early abort feature wasn't implemented in zstd itself rather than ZFS?

It seems like this feature could be useful even for non-filesystem use cases, right? It could make compression at higher levels substantially faster in some cases (e.g. if a significant part of your data is not compressible).

Having this feature also available as an option in the zstd CLI tool could be quite useful, I think. I was thinking it could be implemented in the core zstd code and then both the CLI tool and ZFS could just enable/disable it as needed?

Thank you for your great work!

Edit: I just now realized, based on your Brotli comment, that since these early passes can be done (in theory) with arbitrary algorithms, perhaps it's better implemented as an additional layer on top of them rather than implementing it in each and every compression algorithm.

rincebrain · on July 6, 2023

Briefly, because I initially looked at ripping out LZ4's, but it just skips chunks in the middle of the compression algorithm, so it didn't come out easily.

So I just tried using LZ4 as a first pass, and that worked well enough, with a high false positive rate, so zstd-1 was duct-taped onto it.

Trying to just glue an entropy calculator on is something I want to try, but I haven't, because this worked "well enough".

As far as why not do it in just zstd or LZ4 proper...well, this works very well for extremely quantized chunks, like ZFS records, but for things with the streaming mode, it becomes more exciting to generalize.

Plus it's much nicer for the ZFS codebase to keep it abstracted.

Also, as far as using it for, say, gzip...well, I should really open that PR from that branch I have, shouldn't I? ;)

tecleandor · on July 6, 2023

Ah,you were the one that gave the "Refining ZFS compression" talk, isn't it? Thanks for all your work! Works great and it's pretty interesting.

I bought a super cheap QAT card with the idea of doing hardware accelerated gzip compression on ZFS (just for playing with it), but seems like the drivers are kind of a mess, and not being a developer who could debug it, maybe I should stay with the easy and fast zstd with early abort :P :) Less headaches .

rincebrain · on July 6, 2023

I did, yes.

I thought Intel stopped selling QAT cards with their newer generations of it?

Personally I find that gzip is more or less strictly worse than zstd for any use case I can imagine, though if you can hardware offload it, then...it's free but still worse at its job. :)

Though zstd can apparently be hardware offloaded if you have like, Sapphire Rapids-era QAT and the right wiring...sometimes. (I believe the demo code says it doesn't work on the streaming mode and only works on levels 1-12.) [1]

[1] - https://github.com/intel/QAT-ZSTD-Plugin

tecleandor · on July 7, 2023

Seems like they stop selling dedicated QAT cards, but their new E2000 DPUs (kind of a smart NIC I think your employer uses for accelerating cloud loads) have QAT instructions. But I'd say that's because those DPUs have a couple Xeons inside :)

I think QAT API used to be a mess in past generations, and IIRC it's one of the factors drivers for older hardware weren't included in TrueNAS Scale. I think it was this ticket [1] where they said that it was complicated to add support for previous hardware, but can't confirm because JIRA's been acting funny since yesterday.

Feels like QAT has stabilized since this Gen4 hardware and interesting things are appearing again. Not only the ZSTD plugin you linked, but I just noticed QATlib added support for lz4/lz4s past July [2]! That's interesting.

They are also publishing guides for accelerating SSL operations in SSL terminators / load balancers like HAProxy [3], so between this and cloud providers using DPUs to accelerate their load, doesn't feel like they're shutting this down soon.

I'd love if they make this a bit more accessible to non-developers.

  [1] : https://ixsystems.atlassian.net/browse/NAS-107334
  [2] : https://github.com/intel/qatlib/tree/main#revision-history
  [3] : https://networkbuilders.intel.com/solutionslibrary/intel-quickassist-technology-intel-qat-accelerating-haproxy-performance-on-4th-gen-intel-xeon-scalable-processors-technology-guide

rincebrain · on July 7, 2023

Cute, the former Ubuntu ZFS developer now works for Intel on QATlib, among other things I imagine.

nwmcsween · on July 5, 2023

This seems kind of wasteful, would it be better to estimate entropy across random blocks in a file and then compress with $algo?

rincebrain · on July 6, 2023

Not random, right? ZFS compression doesn't (at present; there have been unmerged proposals to try otherwise) care about the other records for calculating whether to try compressing something, and it would be a much more invasive change to actually change what it compresses with rather than a go/no-go decision.

As far as wasteful - not really? It might be possible to be more efficient (as I kept saying in my talk about it, I really swear this shouldn't work in some cases), but LZ4 is so much cheaper than zstd-1 is so much cheaper than the higher levels of zstd, that I tried making a worst-case dataset out of records that experimentally passed both "passes" but failed the 12.5% compression check later, and still, the amount of time spent on compressing was within noise levels compared to without the first two passes, because they're just that much cheaper.

I tried this on Raspberry Pis, I tried this on the single core single thread SPARC I keep under my couch, I tried this on a lot of things, and I couldn't find one where this made the performance worse by more than within the error bars across runs.

(And you can't use zstd-fast for this, it's a much worse second pass or first pass.)