Thursday, 8 January 2009

Finding duplicate files

At work we have a problem. We've got an expensive Storage Area Network (SAN) which isn't expensive purely because of the servers and disks, but also because we back up a lot of that data every single night.
Now, imagine that (like us) you are an ISV. So, we have a nightly build which results in a code base of around 900MB. Some of these make it to testing. Some of these make it to release candidate and some of these actually become a release.
Our main file server has it's storage on the SAN so even taking into account the fact that most releases never hit the customers, we have issues with duplicate files on the server /SAN. Consider an overnight build...

We end up with a build (21390) in the build area. Good for the Devs....
Testers then copy it to their own area...
It gets released so we maintain a "gold" copy of the released code.
The Service Desk have their own library of released code.

With a typical release, we end up with at least four copies of the 900MB code; auto-build, test, gold-release, support-library.

SAN is one of the most expensive resources that I need to account for, so I looked around for a utility that could help me. I was hampered on a number of fronts;

  1. Vitally, most of the dup utilities did not expect to run against shares with ~100,000 folders and millions of files
  2. They seemed to be aimed at people who wanted to delete dups. I don't want to do that; I want to investigate failures in process and fix that
  3. Their view of a duplicate did not match mine, which is; Suffix, Size, MD5 hash

So, I decided to write my own!

The app searches a folder or share, and then breaks down files into the following sets;

  • All large files (as defined by you, when you run the program)
  • Sets of files that have the same suffix, and are the same length
  • Sets of files that have the same suffix, length and md5 signature

These sets are then displayed in to the user, where you have the option to save the details to a CSV file.

Using this, we have been able to discover duplicate files (obviously) but also failures in process. For example, when files are sent to us as part of a support incident, we were copying them to several different folders. We were also storing builds in one folder and then testing by copying to another. This app identified where people were not following the correct process and were copying files to secondary folders.

You can download the app from here. I'll be posting the source to CodeProject shortly.

S.

No comments:

Post a Comment