Right now I’m in the middle of a whole-house PC upgrade, and the result of that process is that my current home-server will become an off-site backup server. That means I’ll have a new, much more powerful 32TB server at home, and a low-power 29TB server at a relatives house. Ideally my home-server would back up to the offsite server nightly, and for the most part it will. However, when it comes to large files (like my GoPro footage), nightly backups could end up exceeding my relatives’ internet cap, which is set at 1TB a month.

Given that I don’t want to pay an extra $15 a month to upgrade them to unlimited data, I thought about how I might use an external hard drive to transfer data over to the backup server whenever I go (about twice a month). The difficulty of this plan, however, is figuring out how the heck I’m going to keep track of what’s already on the backup server when I prepare the drive. Not only that, but with two file copies with each move, I want to make sure that nothing gets corrupted as it’s being transferred.

I was feeling a tad motivated one evening trying to figure out how to do this when I decided, ‘screw it, I’ll write a quick Python script’. Well, about 20 hours+ of coding later, including entirely refactoring all the code, and we have what I call Waterlock: a Python script for *incremental, offline backups using external hard drives. Named after the marine ‘locks’ that boats use to navigate through river systems in separate stages, Waterlock moves data between a source, middle, and end device. You can download it here from GitHub

*I should note that I use ‘incremental backup’ perhaps a tad loosely here. It doesnt backup actual deltas or let you restore to a certain point in time (though I may add this feature in the future). But it works great for those ‘write-once and hopefully read-never’ scenarios, like moving your movie collection or photo library to an off-site computer and keeping it up to date.

A Quick Note on Development

I’m still hammering away on Waterlock, so it is definitely ‘alpha’ software. It seems to work well and the tests I’ve written haven’t thrown any errors, but regardless, I am not responsible for any lost data. Additionally, one iteration of the script may be incompatible with another due to changes to the database structure and whatnot, so please keep this in mind when re-downloading the script for use with existing deployments. But most of all, feel free to contribute! I’m definitely just an amateur Python developer so any help or feedback would be much appreciated.

How Waterlock Works, Step by Step

A bit of warning: I’m going to go into quite a bit of detail with this, but know that overall the usage of this tool is fairly straightforward and makes maintaining your backups incredibly easy, as you’ll see in the following section. With that said, here’s roughly how it works under the hood:

After saving the script to an external hard drive (i.e. the ‘middle’ location), you then feed it the paths to your desired source and end directories, each of course resting on two separate systems. Waterlock automatically detects how far along the transfer is based on whether it can see the source or end directory, so it won’t work if it can see both at the same time.

Second, Waterlock will create two folders in the same directory as the script: config/ and cargo/. In the config folder, a SQLite database will be created storing the path of all the files in the source directory, the last time they were modified, a record of whether the file has been moved to the middle or end directories yet, while also leaving room to store a hash of each file. Of course, if you’ve already run the script it will skip this step.

Third, Waterlock will generate a blake2 hash of each file to store in the database. Note that if the script has already been run previously then it will check the database to avoid re-hashing the file. Then (ignoring some string manipulation that proved to be quite a headache), it will copy the file from the source to the cargo/ folder (again, the ‘middle’ step in the process), before hashing the file once more to check that nothing got corrupted during the transfer. If everything went to plan, it will mark it in the database as having been moved to the middle step and move on to the next file. If, however, the hashes did not match, it will retry moving and hashing the file five times before quitting. Otherwise, it will keep moving files until the middle drive gets filled up. The default configuration will leave 1GB left on the drive, but you can change this in the script settings if you want to leave more space.

Finally, we move onto the stage where we transfer files to the destination. Waterlock will check the database for all the files that have been marked as having been moved to the middle drive but not the destination, and will begin to transfer those in much the same way as the third step. Obviously it won’t rehash the file on the middle drive, instead just pulling it from the database. If everything goes smoothly, the files should end up safely on the destination and the database will be updated to reflect everything that safely made it across. Now at this point, there’s a function in the script called dump_cargo() that will run if you’ve enabled it, which will delete all the data in the cargo/ folder. Note that you’ll need to confirm this by typing “Yes” (case sensitive).

The next time you run the script on the source folder, the following will occur (though not necessarily in this particular order):

  • Waterlock will once again scan all the files, adding anything that is new or that didn’t get moved last time to the database.
  • If a file is in the database but can no longer be found on the source, Waterlock will give you the option to mark it either to be skipped or to be removed from the destination. If you select the latter, the next time Waterlock is run on the destination, it will confirm whether you want to delete the file. If you select no, it will simply mark it to be skipped instead.
  • Waterlock will also scan the file modification times, and if a file has been updated it will get marked it as unmoved and its hash will be recalculated.
  • If a file is already on the middle drive (or destination for that matter), Waterlock will check the size of the file and if it doesn’t match what’s on the source (or middle drive in case of moving to the destination) it will be deleted and replaced. This helps solve the issue of files being cancelled half way through being copied.
  • Finally, Waterlock will check for any files that made it onto the middle drive, but not the destination. This may happen if you forgot to run it on the end destination or if you deleted the data before it got transferred. If the files are no longer on the middle drive, it will again mark them for copying.

Waterlock also has two additional functions you can call to verify all the files on the middle or destination drives. These are verify_middle() and verify_destination(). This will compare the hashes stored in the database with all the files on the middle or destination, of course depending on what function you called.

Again, this sounds like a lot but using the tool is easy and takes care of all the difficult parts of the process for you. All you have to do is edit two lines, run it before you go to your off-site backup, and then again when you arrive.

Setting it up

With how it works out of the way, setting up Waterlock is fairly straightforward. Just download the script from GitHub and save it onto the external hard drive you plan to use. Then, open it in a text editor and enter the absolute file paths for your source and destination directories at the top of the script (see below). Do not use relative file paths. Note that you can add multiple paths, but make sure to do so in the same order between the source and destination directories. You can also configure how much reserved space you want at this time.

'''===== IF RUNNING AS SCRIPT CHANGE THE FOLLOWING FOLDERS ====='''
# Absolute File Paths Only! Add comma-separated paths (e.g. a list of strings) to support multiple directories
# If using multiple source and end directories, ensure they are in the same order! See example in comment below.
source_directory = ['/ABSOLUTE/PATH/TO/FOLDER/'] # ['/ABSOLUTE/PATH/ONE', '/ABSOLUTE/PATH/TWO']
end_directory = ['/ABSOLUTE/PATH/TO/FOLDER/'] # ['/ABSOLUTE/PATH/ONE', '/ABSOLUTE/PATH/TWO']
reserved_space = 1 # Enter value in Gibibytes

'''============================================================='''

If you want to enable additional functions like dump_carg() or verify_destination(), then just scroll down to the bottom of the script and uncomment the corresponding line of code. For instance, in the example below I’ve enabled the dump_carg() function:

if __name__ == "__main__":

    if len(source_directory) != len(end_directory):
        raise Exception("Error: different number of source and end directories.")

    for i in range(len(source_directory)):
        wl = Waterlock( source_directory=source_directory[i],
                        end_directory=end_directory[i], 
                        reserved_space=reserved_space
                        )
        wl.start()

        #wl.verify_middle()
        #wl.verify_destination()
        wl.dump_cargo()

        del wl

And you’re ready to go! Run the script with python waterlock.py and fire away!