So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.
Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.
How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?
AFAIK, yes. Blocks are reference counted, so if the duplicate file is in a snapshot then the blocks would be referenced by the snapshot and hence not be eligible for deallocation. Only once the reference count falls to zero would the block be freed.
This is par for the course with ZFS though. If you delete a non-duplicated file you don't get the space back until any snapshots referencing the file are deleted.
Yes that snapshots incur a cost I know. But I’m wondering whether now the action of deduplicating actually created an extra copy instead of saving’one.
copy_file_range already works on zfs, but it doesn't guarantee anything interesting.
Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest.
(BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)
Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.
However, the fact that editing one copy edits all of them still makes this a non-solution for me at least. I'd also strongly prefer deduping at the block level vs file level.
https://github.com/pauldreik/rdfind