Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I run rdfind[1] as a cronjob to replace duplicates with hardlinks. Works fine!

https://github.com/pauldreik/rdfind



So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.

Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.


How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?


AFAIK, yes. Blocks are reference counted, so if the duplicate file is in a snapshot then the blocks would be referenced by the snapshot and hence not be eligible for deallocation. Only once the reference count falls to zero would the block be freed.

This is par for the course with ZFS though. If you delete a non-duplicated file you don't get the space back until any snapshots referencing the file are deleted.


Yes that snapshots incur a cost I know. But I’m wondering whether now the action of deduplicating actually created an extra copy instead of saving’one.


I don't fully understand the scenario you mentioned. Could you perhaps explain in a bit more detail?


copy_file_range already works on zfs, but it doesn't guarantee anything interesting.

Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest. (BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)

Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.


Quite cool, though it's not as storage saving as deduplicating at e.g. N byte blocks, at block level.


But then you have to be careful not to remove the one which happens to be the "original" or the hardlinks will break, right?


No, pointing to an original is how soft links work.

Hard links are all equivalent. A file has any number of hard links, and at least in theory you can't distinguish between them.

The risk with hardlinks is that you might alter the file. Reflinks remove that risk, and also perform very well.


Thank you, I was unaware of this.

However, the fact that editing one copy edits all of them still makes this a non-solution for me at least. I'd also strongly prefer deduping at the block level vs file level.


I would suspect a call to $(chmod a-w) would fix that, or at least serve as a very fine reminder that there's something special about them




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: