File deduplication.

I use rsync for backups, and it has a very nice feature, where it uses hard-links to a reference directory tree, in order to reduce space. That works well, but sometimes the chain of links is broken, and disk space gets used up unnecessarily. The most common cause is when I move a file – rsync can’t tell that it’s moved, so the new location gets its own copy. That can be expensive if I’m renaming a load of videos!

So, it would be nice to find a tools that can spot these cases, and replace the duplicate files with hard links. There are a load of candidates, but none of them seem to quite meet my requirements.

Here’s what I need:

  1. Files must match metadata as well as content. Hard links merge the metadata, so if two files have different owners, then I can’t replace one of them with a hard link. I’m willing to accept some variations in meta data – access time, certainly. Probably modification time, and maybe even group. The ideal tool would let me choose what metadata is important to me.

  2. I probably want to set a minimum file size. The benefits of reducing duplication are pretty minimal when file sizes are small.

  3. I need to be able to blacklist/whitelist files by name, or by location. If a file is ever modified, then all of its link-brothers will also get modified. That’s no so bad for back-ups, which are hopefully immutable. But if I ever use this tool on files in use, it could be catastrophic.

  4. I’d like to restrict hardlinking to between directory trees, rather than just will-nilly within a tree. That would preserve the integrity of each back-up snapshot, just like rsync does.

  5. The number of hard links per inode is limited on some filesystems (particularly ext4 – which is what I’m using). The tool must know about this limit.

Frankly, i haven’t found anything that fits the bill, so I’m thinking of writing something myself. Here are the candidates…

  1. freedup Overall, the description makes it sound like a solid tool, and the documentation seems relatively complete. However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code – which is apparently triggered by undocumented options. Personally, I’m a bit leery of file-system level tools that contain undocumented server code, so I’ll not be using it.
  2. fdupes This is a popular tool (packaged in Debian), but I don’t think it covers any of my metadata requirements.
  3. rmlink This is a fairly new tool – it’s not packaged, so i had to get it from github. It’s got good name filtering, but it doesn’t check metadata, allow for size or directory tree limits. Finally, it doesn’t natively support hard linking – you can give it a custom command to use, but that could not easily be taught about hard link limits. On the plus side, rmlink is reportedly very fast. Finally, the name is horrendously dangerous. I caught myself editing “rm” commands, because I’d not been paying attention when I did a reverse history search… Had I not noticed, and hit return, I might well have found myself needing those backups.
  4. fslint This is a GUI tool. Not much use on my server.
  5. hardlink is a pythobn script. I’ve not investigated it too much since there’s no online documentation.
  6. rdfind is a very basic tool. It doesn’t have any of the features I’m looking for.
  7. duff seems like a solid tool. It can’t make hard links, only report duplicates. It does permit me to set a minimum file size. No coverage for any of my other requirements.

The fdupes Wikipedia page contains a useful list of other such tools. I may investigate more of them later.

Comment · Comments Feed · TrackBack

  1. h6w said,

    10 December, 2014 @ 00:18

    Actually, according to fslint FAQ, fslint-gui is just a wrapper around fslint, so it’s a command-line tool.

Leave a Comment