Archive for Linux

File deduplication.

I use rsync for backups, and it has a very nice feature, where it uses hard-links to a reference directory tree, in order to reduce space. That works well, but sometimes the chain of links is broken, and disk space gets used up unnecessarily. The most common cause is when I move a file – rsync can’t tell that it’s moved, so the new location gets its own copy. That can be expensive if I’m renaming a load of videos!

So, it would be nice to find a tools that can spot these cases, and replace the duplicate files with hard links. There are a load of candidates, but none of them seem to quite meet my requirements.

Here’s what I need:

  1. Files must match metadata as well as content. Hard links merge the metadata, so if two files have different owners, then I can’t replace one of them with a hard link. I’m willing to accept some variations in meta data – access time, certainly. Probably modification time, and maybe even group. The ideal tool would let me choose what metadata is important to me.

  2. I probably want to set a minimum file size. The benefits of reducing duplication are pretty minimal when file sizes are small.

  3. I need to be able to blacklist/whitelist files by name, or by location. If a file is ever modified, then all of its link-brothers will also get modified. That’s no so bad for back-ups, which are hopefully immutable. But if I ever use this tool on files in use, it could be catastrophic.

  4. I’d like to restrict hardlinking to between directory trees, rather than just will-nilly within a tree. That would preserve the integrity of each back-up snapshot, just like rsync does.

  5. The number of hard links per inode is limited on some filesystems (particularly ext4 – which is what I’m using). The tool must know about this limit.

Frankly, i haven’t found anything that fits the bill, so I’m thinking of writing something myself. Here are the candidates…

  1. freedup Overall, the description makes it sound like a solid tool, and the documentation seems relatively complete. However, when I built it, I noticed that the Makefile attempts to write new lines into /etc/services, and yes, the program does contains socket/server code – which is apparently triggered by undocumented options. Personally, I’m a bit leery of file-system level tools that contain undocumented server code, so I’ll not be using it.
  2. fdupes This is a popular tool (packaged in Debian), but I don’t think it covers any of my metadata requirements.
  3. rmlink This is a fairly new tool – it’s not packaged, so i had to get it from github. It’s got good name filtering, but it doesn’t check metadata, allow for size or directory tree limits. Finally, it doesn’t natively support hard linking – you can give it a custom command to use, but that could not easily be taught about hard link limits. On the plus side, rmlink is reportedly very fast. Finally, the name is horrendously dangerous. I caught myself editing “rm” commands, because I’d not been paying attention when I did a reverse history search… Had I not noticed, and hit return, I might well have found myself needing those backups.
  4. fslint This is a GUI tool. Not much use on my server.
  5. hardlink is a pythobn script. I’ve not investigated it too much since there’s no online documentation.
  6. rdfind is a very basic tool. It doesn’t have any of the features I’m looking for.
  7. duff seems like a solid tool. It can’t make hard links, only report duplicates. It does permit me to set a minimum file size. No coverage for any of my other requirements.

The fdupes Wikipedia page contains a useful list of other such tools. I may investigate more of them later.

Comments (1)


I’m dipping my toe into the world of IPv6. My home network now supports it (Thanks to AAISP), and now I’ve started to slowly shift my server addresses over. Inspiration from ipv6friday.


Remove Textile from a WordPress blog

Here’s a Python script to remove Textile mark-up from a WordPress blog.

Download the script here:

Read the rest of this entry »

Comments (5)

Oracle’s ‘proc’ program leaks temporary files

proc is a program that ‘compiles’ Pro*C into C or C++. It’s shocking that a big company like Oracle could produce something so shoddy. Read the rest of this entry »

Comments (5)

Services on AIX

I’ve just spent a rather dry afternoon working out how to create a service on AIX. In brief, AIX doesn’t provide any real support for SysV style Unix services, instead it has its own scheme, which does not use the familiar start/stop wrapper scripts. Read the rest of this entry »

Comments (2)

Mapnik on Debian Etch.

Late last year I went to the Ordnance Survey to see a demonstration of their new OpenSpace mapping service. It was there that I met Artem Pavlenko, who wrote Mapnik. We talked briefly about a few things, one of which was the Boost C++ libraries. I suggested that using such libraries makes software harder to build and install… “Oh no,” he said, “Boost’s all in the header files, so there are no library dependency problems.”

Well that turned out to be a big fat lie.

Read the rest of this entry »

Comments (3)

IBM ScrollPoint Pro Mouse

I have one of these mice and I love it. Instead of a scrollwheel, it has a little “joystick” just behind the middle button. And what’s more, it lights up blue!

Anyway, I keep having to set up my xorg.conf file for it, and I always have to puzzle out the correct configuration. (Yes my hard drive failed and I’m having to rebuild my whole machine.) Well, here it is:

Read the rest of this entry »


Fix Iceweasel Sound on Debian

My sound disappeared after upgrading from Firefox to Iceweasel. Irritating, but the problem is easily solved:

Ensure that alsa-oss is installed. Then, edit /etc/iceweasel/iceweaselrc and set:


(Thanks to macewan.)

Comments (2)


I’ve stupidly bought a VIA EPIA motherboard. It’s a tiny, low power server board, with two built-in VIA-Rhine ethernet controllers. I’ve added a Prism based wireless PCI card. I use it as a firewall and WiFi base-station. Read the rest of this entry »


NASA’s SRTM Elevation data

A simple C++ interface to NASA’s SRTM Elevation data. I use this code in my Flood Maps project. This code will only run on a Unix operating system.

Download it here: nasagrid.tgz

Read the rest of this entry »

Comments (8)