Nature abhors a vacuum, and this is especially true for our old platters storage disks which cracks and squeals under the gigabytes of files we feed them. As far as Leeloo and I are concerned, our main worry is the loss of the numerous pictures we own. All those digital photos, videos, derivatives, modified, cropped, resized versions and web-albums pile up on our disks.
This early December 2012, we realized that more than 600 GB were necessary to store all these images. And this amount of space is nothing but the emerged part of the iceberg! Because we fear as the apocalypse the possible loss of these precious recordings; losing them will irrevocably alter or destroy memories of several moments of the past, childhood, events, wedding, and soon enough glimpse of our soon-to-be-born-daughter Alice growth which, I am sure, will require hundred of gigabytes to satisfy the appetite of her pretty mum’s hungry lens.
To protect all these images, burning DVDs or CDs is out of the question, those prehistorical unreliable and unusable pieces of junk could never be trusted for any digital content you liked to preserve. As a consequence, we use hard drives. The main partition hosting the source of the photos is on a RAID1 array (the data is duplicated on two hard drives), which means you have to double the needed storage, that is to say 1.2 TB.
Still, I can hardly forget that some years ago, a failing power supply died in a sparking smoke and took ALL the hard drives of the machine with it! Since then, the pictures are also duplicated on another machine, in RAID1, thus raising to 2.4 TB the needed storage space.
Being paranoid may be wise during these trouble times, so a third machine with RAID5 array (meaning that only one disk over a bunch is using as security, 1 over 5 in our case) is also used to keep our pictures. Now, we are past 3.1 TB.
Of course any local catastrophe, like a fire or a burglary could vanish all the three machines at once, so one needs to have a remote copy of the data. We have two, because the remote backups are not on duplicated devices. But we still do not have a copy on the opposite side of the planet, way safer if any meteor ever crash around here (maybe we will have other things to think about by then, but still, in difficult times an old picture of somebody we love is of great help)…
So, 600 GB of pictures are only the emerging part of a storage need raising over 4.3 TB, 7 times more (this is not exactly an iceberg ratio, still we are close).
With the raw images, the resized, retouched, cropped and altered versions, these 600 GB are split over 960 000 files. Trying to sort anything out is quite impossible, but we are pretty sure duplicates are swarming among all those files: when we separated some pics to print them, when we created temporary copies to send to family, or to process them later, or to send to a web site, or more…
I wanted to write a little program to check and find duplicates comparing images names, sizes and ultimately the checksum because as incredible as this may seem, modern cameras are unable to count past 10,000, even when it is easy to take 20,000 pictures or more with any of them.
Well, nearly ready to start working, some Sunday slackness pushed me to perform a little apt-cache search duplicate on my debian wheezy first. And this is how I found the marvelous rdfind.
This small program does exactly what I was looking for, it performs a recursion over the folders and find duplicates. Quite efficiently, it first checks the size of the file, then it looks for the first and last bytes, to sort out all the unique files, and then calculates a md5 checksum (or sha1 for ultra-paranoids) on the remaining file list.
rdfind can list the duplicated files, replace them with hard links (the same file is in two different directories but is only stored physically at one place on the underlying file system), or symbolic links (also known as shortcuts).
It’s super fast, it only took us a few minutes to free up to 60 GB in the 960,000 files of our picture folders, 60 GB which transform to 400 GB of saved space including the redundancy!
Thanks a lot rdfind!