David Swanlund

Upgrading My Home Server: A Build Log

Sat, 06 Nov 2021 00:00:00 +0000

Over the past three weeks I’ve been completely overhauling my home server setup, and I’ve learned a lot throughout the process that I thought I’d document it here in a build log.

For context, I’ve had my current home-server since 2015. It’s a low-powered i3-4160 running 32GB of ECC memory, with 27TB of usable storage spread across a hodge-podge of 3TB, 4TB, and 8TB drives all jammed into a cheap Rosewill 4U chassis. Until 2020 it was running FreeNAS, but like many others I made the switch over to Unraid and never looked back.

Unraid was great for me, as its seamless Docker integration allowed me to add significant functionality to what was basically just a file server with Plex. With Unraid I was able to start self-hosting a lot of additional services, like Bookstack for keeping notes organized, a Unifi controller to manage my wifi access points, Teedy to organize my important documents, a few containers for archiving podcasts and Youtube channels, etc.

With this growth in functionality, however, my poor little i3 was struggling to keep up. I had bought it initially because it was cheap and supported ECC memory, a feature you otherwise have to go Xeon to get. And so for a while now I’ve been looking to upgrade.

Between some unexpected income, some timely sales, and some other PC upgrades I’ve done around the house, the time to upgrade has finally arrived. It certainly wasn’t cheap, but by shuffling around some components between my existing systems, I was able to not only build a significantly upgraded home server, but squeeze out an offsite backup as well! Here’s how it came together.

The Specs

CPU: Ryzen 3600. This was pulled straight from my gaming PC, which I just upgraded to a 5900x. It gives me 6 hyperthreaded cores (e.g. 12 threads) of computational goodness. It’s not the fastest chip out there, but it’s certainly more than enough for a home server, supports ECC memory (more on this later), and doesn’t suck back a tonne of power.
Cooler: Noctua NH-U12S. Can’t go wrong with Noctua, and at this point we’re a Noctua household. I probably could have gotten away with the stock cooler, but given that this system will run 24/7/365 I wanted to keep the CPU as frosty as possible.
Motherboard: Asus B550-F. Solid board with exceptional power delivery. Bonus is that it has 2.5Gb ethernet built right in, which goes well with the 2.5Gb switch I just picked up. This puts my whole home network at speeds fast enough to saturate the drives without having to fork over big money for 10Gb.
Memory: Timetec 2x16GB ECC. A bit slow and boring without any RGB, but this is a server dammit!
Power Supply: Seasonic Focus GX-650W 80+ Gold. Like the motherboard, this gives great power delivery and comes with a 10 year warranty.
Case: Fractal Define 7 XL. An absolute chonker of a unit with the ability to store 18 hard drives. Clean aesthetic as well with sound dampening to keep things quiet. I made sure to buy lots of extra hard drive sleds so that I can keep adding drives long after Fractal drops support for it.
Host Bus Adapter: LSI 9210-8i (used off Ebay). The motherboard I bought only has 6 SATA ports, so this essentially allows me to add 8 extra hard drives using a couple of SAS to SATA breakout cables
GPU: HP Quadro P400 (used off Ebay, still waiting for it to arrive). At the time of ordering, I wasn’t sure if the motherboard I got supported headless boot. Coming in at around $120 and paired with a dummy plug, this rather inexpensive GPU solves that problem and while also helping significantly with Plex transcoding, all without drawing much extra power.
Hard Drives: 3x8TB Seagate Ironwolfs as well as 3x8TB WD Reds I had from the old server. Nothing special here, just some standard CMR NAS drives.
Cache Drives: 2x1TB Samsung 860 Evo SSDs. I actually got these a couple years ago, but these are relatively affordable and have decent write endurance at 600TBW, so they’ll do just fine as a cache for Unraid.
Blu-Ray Drive: LG WH16NS40. This actually came from my desktop PC, as the new case I got for it didn’t support an optical drive. I’ll be feeding it into a Windows VM so that I can archive data using BD-R HTL discs (which contrary to popular opinion are great for archival purposes due to their use of inorganic dyes)

Using Unraid this gives me 32TB of usable storage, with the ability to lose two drives from the pool without losing any data. I also shucked an old external 10TB drive and bought a 12TB Toshiba Enterpise drive to add to all the 3TB and 4TB drives on my old server. This left me with 29TB of usable offsite storage, with the 12TB drive being used for parity. I went with a slightly larger 12TB drive as the parity so that when all those 6-year-old 3TB drives start drppping like flies I can replace several of them with a single drive and cut down the overall amount of rust I have spinning at once.

(I’ve got an old Nvidia 660ti in there right now while I wait for the P400 to arrive)

(Excuse the horrendous cable management, but it seems wrong to include photos without showing the hard drives.)

The Problems

1. ECC Memory: I usually love building PCs, but this one was straight up stressful. Let’s start with the memory. Ryzen desktop CPUs are wonderful because they support ECC memory, whereas Intel likes to lock that feature down to their more costly Xeon line. Great, right? Well, it’s not that simple. While ECC is technically supported by the chip, the implementation on the motherboard supply is… inconsistent. There’s lots of mixed information out there about whether particular motherboards actually support ECC with Ryzen, and that’s muddied even more by the fact that many motherboards have the ECC silently correcting errors without reporting them to the operating system. This isn’t great, since your RAM may be failing and correcting lots of errors, but you’d never know. In short, ECC seems great on Ryzen, but in practice is an absolute shitshow.

So after banging my head against the keyboard, I finally decided to just plunk the ECC memory that had already arrived into my partner’s new Asus B550-f motherboard. Unfortuantely, she was running a 5700G which does not support ECC, but the motherboard BIOS nevertheless had an option for enabling ECC. So I took a leap of faith and ordered the same board from Amazon, knowing I could return it if I had to. Once it arrived, I booted up memtest and literally started dancing when I saw that it reported ECC polling enabled!

2. LSI & Ironwolf: So I built the server, spent a day getting 18TB of data transferred over, and was in the process of moving a few hard drives from the old server to the new one. This involved recalculating parity a couple of times as I added the drives, which is always a bit nerve-racking. Already anxious, my stomach then dropped like a rock when Unraid gave me an error message saying that one of my drives had reported read-errors and had been taken offline. Shit. If one more drive goes down I have to restart and move 18TB of data all over again. So I frantically start Googling and discover that as of the most recent version of Unraid, LSI controllers had started to have problems with 8TB and 10TB Ironwolf drives causing them to drop out of the array randomly. Fortunately, some users over on the Unraid forum had figured out a fix involving disabling low current spinup and EPC on the drive firmware itself. This turned out to be only about 20 minutes of work, but I certainly could have done without the stress that caused.

The Backup Strategy

With those issues out of the way, I was ready to start figuring out how exactly I was going to do my backups. Up until very recently, I had used Duplicati to back up my data to Backblaze B2, but there are two caveats here. The first is that at some point during the summer my Duplicati backup had corrupted and was rendered useless. My experience with Duplicati hadn’t been great so far, but losing 1.5TB of backups made me start looking elsewhere. The second issue is that I had only backed up 1.5TB of data, when I had around 18TB overall. At $5 per terabyte per month, B2 certainly isn’t expensive, but paying $90USD per month for backups was simply not an option for me. And so I had made the decision to only back up my critical data, such as photos and documents. This left a tonne of Bluray rips, GoPro footage, and other large datasets I’ve archived like various Youtube channels and Flashpoint at risk of being lost entirely should something happen to my server, such as theft or fire. I often told myself that most of this data was replaceable; I could always redownload Flashpoint, for instance. But the reality is that it would take me years to accumulate the same collection.

This is why I was so eager to get that offsite backup server. It would allow me to finally back up all of my data on a nightly basis. Unfortunately, my plan was to leave it at a relatives house, but that relative only has 1TB of bandwidth a month. While I doubt my nightly backups would exceed that limit, there have been a few months here and there where I’ve added 1-2TB of data to my hoard. And so while I now had offsite storage available, I didn’t have a good way of getting data onto it. I thought about physically moving the server back and forth once in a while to update it, but knew that there was no way I’d actually follow through on that and could leave myself open to losing several months of data. So instead, I developed a three-prong strategy.

The Big(ish) Data: For all of those extra-large datasets that might exceed that 1TB bandwidth cap, I’ll use an external SSD to move data over whenever I visit. Fortunately, I go there about twice a month so it should stay fairly up to date. To make this easier, I developed a Python tool called Waterlock (Github) that takes care of all the hard work involved. I wrote about it just recently, but essentially when it’s run on the source system it will fill up the external drive with as much data as it can, and when it’s rerun on the destination it will move all that over, keeping a record of everything that’s been transferred. It also uses checksums to verify file copies, and I’m in the process of adding versioning and a few other handy features as well. This allows me to easily and incrementally move all that data without touching their bandwidth cap whatsoever.

The Smaller Data: With the extra-large datasets taken care of, I decided to use rsnapshot for nightly backups of everything else. I had initially considered rsync, but if my home-server were to be hit by ransomware this would end up just syncing the damage over. rsnapshot, on the other hand, gives me some versioning that sovles this issue. In this case, I’ll be doing nightly versions that are held onto for 7 days, and then weekly versions afterwards that are held onto for a month.

The Failsafe: Just to be safe, I’m also using Duplicacy (not Duplicati, whose name is annoyingly similar) to back up all of that core data to B2 every night as well, with versions gradually being pruned over the course of about 6 months. Again, this is about 1.5TB of data, so only $7.50 a month to store on B2. So far Duplicacy has been rock solid and exponentially faster than Duplicati; hopefully it is more stable and less prone to corruption as well. It is paid software, but personal licenses are dirt cheap for what you get, which by the way includes deduplication, versioning, and client-side encryption. So far I’m loving it and have even bought extra licenses so that I can do more frequent backups of the computers in my house to the home-server as well. As for why I don’t use Duplicacy for backing up to the offsite server, well I want to keep at least one backup as standard, native files rather than cut into chunks, compressed, encrypted, etc. Suffice to say that corupted Duplicati backup has put the fear of god into me.

The Management Solution

Finally, I needed some way to manage the offsite server. I’m fairly adamant about using full-disk encryption, but that means that when the server reboots I need to enter a passphrase to start it back up properly, which I can’t easily do without being on-site. I am using Wireguard to connect the on-site and off-site servers together for the rsnapshot job to run and had considered extending this to my desktop so that I could access Unraid’s management interface, but couldn’t figure out how to configure it to run exactly how I wanted without opening ports on their end. Moreover, if the server reset and I didn’t notice, it would mean backups might not run properly. So as a compromise I decided to use SpaceInvaderOne’s tutorial for automatically fetching a keyfile when Unraid boots, only saving it to memory, and using that to unlock the encrypted disks. The upside here is that it allows the server to fully reboot on its own, and if it ever gets stolen I can just take down that keyfile. However, whereas SpaceInvaderOne uses an SFTP server to host the key, I decided to leverage that existing Wireguard connection so that I can keep the keyfile securely on my home-server and just rsync it over via the VPN connection.

Great, the server now boots up on its own, but I still want some way to access the management interface without driving over there. For this I am using Tailscale. Tailscale is a really incredible tool insofar as it’s dead simple to use. After installing the program, you just need to log in to your Google or Github account. Repeat this for each of your devices, and one by one they begin to form a mesh network powered by Wireguard. I was shocked at just how simple and effective it was, and it even has an iOS app, allowing me to even manage all my systems while I’m on the go. Best of all, it uses relay servers so that you don’t have to forward any ports. I couldn’t be happier with it.

So after weeks of planning, smashing my face into my keyboard, drives randomly dropping out of the array, cut fingers from the heatsink fins slicing through my flesh, and endless data transfer and parity calculations, I’ve finally got a home server I can really take full advantage of (or at least I will once this P400 arrives). And with that, it’s time to pack up that old Rosewill chassis and officially retire it to the offsite-backup location.

Introducing Waterlock: Making Incremental, Offsite & Offline Backups Easy

Thu, 28 Oct 2021 00:00:00 +0000

Right now I’m in the middle of a whole-house PC upgrade, and the result of that process is that my current home-server will become an off-site backup server. That means I’ll have a new, much more powerful 32TB server at home, and a low-power 29TB server at a relatives house. Ideally my home-server would back up to the offsite server nightly, and for the most part it will. However, when it comes to large files (like my GoPro footage), nightly backups could end up exceeding my relatives’ internet cap, which is set at 1TB a month.

Given that I don’t want to pay an extra $15 a month to upgrade them to unlimited data, I thought about how I might use an external hard drive to transfer data over to the backup server whenever I go (about twice a month). The difficulty of this plan, however, is figuring out how the heck I’m going to keep track of what’s already on the backup server when I prepare the drive. Not only that, but with two file copies with each move, I want to make sure that nothing gets corrupted as it’s being transferred.

I was feeling a tad motivated one evening trying to figure out how to do this when I decided, ‘screw it, I’ll write a quick Python script’. Well, about 20 hours+ of coding later, including entirely refactoring all the code, and we have what I call Waterlock: a Python script for *incremental, offline backups using external hard drives. Named after the marine ‘locks’ that boats use to navigate through river systems in separate stages, Waterlock moves data between a source, middle, and end device. You can download it here from GitHub

*I should note that I use ‘incremental backup’ perhaps a tad loosely here. It doesnt backup actual deltas or let you restore to a certain point in time (though I may add this feature in the future). But it works great for those ‘write-once and hopefully read-never’ scenarios, like moving your movie collection or photo library to an off-site computer and keeping it up to date.

A Quick Note on Development

I’m still hammering away on Waterlock, so it is definitely ‘alpha’ software. It seems to work well and the tests I’ve written haven’t thrown any errors, but regardless, I am not responsible for any lost data. Additionally, one iteration of the script may be incompatible with another due to changes to the database structure and whatnot, so please keep this in mind when re-downloading the script for use with existing deployments. But most of all, feel free to contribute! I’m definitely just an amateur Python developer so any help or feedback would be much appreciated.

How Waterlock Works, Step by Step

A bit of warning: I’m going to go into quite a bit of detail with this, but know that overall the usage of this tool is fairly straightforward and makes maintaining your backups incredibly easy, as you’ll see in the following section. With that said, here’s roughly how it works under the hood:

After saving the script to an external hard drive (i.e. the ‘middle’ location), you then feed it the paths to your desired source and end directories, each of course resting on two separate systems. Waterlock automatically detects how far along the transfer is based on whether it can see the source or end directory, so it won’t work if it can see both at the same time.

Second, Waterlock will create two folders in the same directory as the script: config/ and cargo/. In the config folder, a SQLite database will be created storing the path of all the files in the source directory, the last time they were modified, a record of whether the file has been moved to the middle or end directories yet, while also leaving room to store a hash of each file. Of course, if you’ve already run the script it will skip this step.

Third, Waterlock will generate a blake2 hash of each file to store in the database. Note that if the script has already been run previously then it will check the database to avoid re-hashing the file. Then (ignoring some string manipulation that proved to be quite a headache), it will copy the file from the source to the cargo/ folder (again, the ‘middle’ step in the process), before hashing the file once more to check that nothing got corrupted during the transfer. If everything went to plan, it will mark it in the database as having been moved to the middle step and move on to the next file. If, however, the hashes did not match, it will retry moving and hashing the file five times before quitting. Otherwise, it will keep moving files until the middle drive gets filled up. The default configuration will leave 1GB left on the drive, but you can change this in the script settings if you want to leave more space.

Finally, we move onto the stage where we transfer files to the destination. Waterlock will check the database for all the files that have been marked as having been moved to the middle drive but not the destination, and will begin to transfer those in much the same way as the third step. Obviously it won’t rehash the file on the middle drive, instead just pulling it from the database. If everything goes smoothly, the files should end up safely on the destination and the database will be updated to reflect everything that safely made it across. Now at this point, there’s a function in the script called dump_cargo() that will run if you’ve enabled it, which will delete all the data in the cargo/ folder. Note that you’ll need to confirm this by typing “Yes” (case sensitive).

The next time you run the script on the source folder, the following will occur (though not necessarily in this particular order):

Waterlock will once again scan all the files, adding anything that is new or that didn’t get moved last time to the database.
If a file is in the database but can no longer be found on the source, Waterlock will give you the option to mark it either to be skipped or to be removed from the destination. If you select the latter, the next time Waterlock is run on the destination, it will confirm whether you want to delete the file. If you select no, it will simply mark it to be skipped instead.
Waterlock will also scan the file modification times, and if a file has been updated it will get marked it as unmoved and its hash will be recalculated.
If a file is already on the middle drive (or destination for that matter), Waterlock will check the size of the file and if it doesn’t match what’s on the source (or middle drive in case of moving to the destination) it will be deleted and replaced. This helps solve the issue of files being cancelled half way through being copied.
Finally, Waterlock will check for any files that made it onto the middle drive, but not the destination. This may happen if you forgot to run it on the end destination or if you deleted the data before it got transferred. If the files are no longer on the middle drive, it will again mark them for copying.

Waterlock also has two additional functions you can call to verify all the files on the middle or destination drives. These are verify_middle() and verify_destination(). This will compare the hashes stored in the database with all the files on the middle or destination, of course depending on what function you called.

Again, this sounds like a lot but using the tool is easy and takes care of all the difficult parts of the process for you. All you have to do is edit two lines, run it before you go to your off-site backup, and then again when you arrive.

Setting it up

With how it works out of the way, setting up Waterlock is fairly straightforward. Just download the script from GitHub and save it onto the external hard drive you plan to use. Then, open it in a text editor and enter the absolute file paths for your source and destination directories at the top of the script (see below). Do not use relative file paths. Note that you can add multiple paths, but make sure to do so in the same order between the source and destination directories. You can also configure how much reserved space you want at this time.

'''===== IF RUNNING AS SCRIPT CHANGE THE FOLLOWING FOLDERS ====='''
# Absolute File Paths Only! Add comma-separated paths (e.g. a list of strings) to support multiple directories
# If using multiple source and end directories, ensure they are in the same order! See example in comment below.
source_directory = ['/ABSOLUTE/PATH/TO/FOLDER/'] # ['/ABSOLUTE/PATH/ONE', '/ABSOLUTE/PATH/TWO']
end_directory = ['/ABSOLUTE/PATH/TO/FOLDER/'] # ['/ABSOLUTE/PATH/ONE', '/ABSOLUTE/PATH/TWO']
reserved_space = 1 # Enter value in Gibibytes

'''============================================================='''

If you want to enable additional functions like dump_carg() or verify_destination(), then just scroll down to the bottom of the script and uncomment the corresponding line of code. For instance, in the example below I’ve enabled the dump_carg() function:

if __name__ == "__main__":

    if len(source_directory) != len(end_directory):
        raise Exception("Error: different number of source and end directories.")

    for i in range(len(source_directory)):
        wl = Waterlock( source_directory=source_directory[i],
                        end_directory=end_directory[i], 
                        reserved_space=reserved_space
                        )
        wl.start()

        #wl.verify_middle()
        #wl.verify_destination()
        wl.dump_cargo()

        del wl

And you’re ready to go! Run the script with python waterlock.py and fire away!

The Cheap & Easy Audio Upgrade

Tue, 27 Oct 2020 00:00:00 +0000

Welcome to Zoom University, where students listen to lectures recorded through a Pringles can stuffed with tinfoil and people yell “Sorry I missed that” every 30 seconds in meetings because someone’s dog started barking.

Except, it doesn’t have to be like that. Some quick changes to your audio setup can really, really make your videos and meetings have a far more professional feel to them. In fact, audio quality makes a much bigger difference to the production quality of a video than the image quality. In other words, you’re far better off investing in a decent mic than a decent webcam if you want to make high quality videos.

Rather than get bogged down in audiophile nonsense and trying to sell $400 cables because they “have better shielding” and XLR interfaces, here’s some straightforward and generally affordable (if not free) ways to seriously ugprade your audio.

Note: if you’re a sound engineer, look away now. This is not for you.

1. Position Your Mic

This is the easiest fix you can make. Just position your mic better, about 3-4 inches away from the corner of your mouth, and the audio will sound far less distant and echo-ey than if it’s just sitting of your desk. In fact, even if you buy a nice $200 mic, it will probably still sound bad unless it’s positioned correctly. Take a listen to this mic sample I’ve recorded comparing just how much of a difference mic positioning can make.

To get the right position, you’ll of course need to use a mic that’s external to your laptop or webcam. Personally, I run a $40 Fifine USB mic, and mount it on a $40 boom arm, both purchased off Amazon. The arm lets me easily position the mic for meetings and video recording, and afterwards a quick push lets me get it out of the way. It also includes a pop filter, which helps mitigate those loud ‘P’ sounds.

2. Use RTX Voice

RTX Voice is dark magic that would tempt even Albus Dumbledore himself to join Voldemort. It does an amazing job cancelling out background noise, to the point that you can literally run a vacuum in the background and maintain relatively clear audio. It requires an Nvidia graphics card, which is the only downside, but it’s worth going Nvidia over. When using RTX Voice, you’ll direct your microphone audio into the software, and then select “RTX Voice” as your microphone on Zoom, OBS, or whatever other recording software you’re using. This will give your listeners crystal clear audio that’s free of barking dogs, screaming children, lawn mowers, or loud laundry machines. As an additional benefit, RTX Voice can also cut out other people’s background noise so that you can hear them better as well.

3. Buy A Better Mic

This doesn’t mean buy an expensive mic. As I mentioned, I achieve pretty decent results with just a $40 Fifine mic and would highly recommend it (though the price seems to have gone up due to COVID). In fact, it is leaps and bounds better than the mic on my $300 Logitech Brio webcam (in case you’re wondering, you should probably just get a C920 instead of the Brio). Here’s an audio sample comparing the two microphones.

If you’re willing to drop a little bit of extra cash, the $130 Audio Technica ATR2100x-USB is an excellent option that will do a much better job of not picking up background noise (though you really will need to postion it close to your mouth as it is a dynamic microphone). The Blue Yeti is also very popular, though more expensive. Personally, I’d recommend just buying the Audio Technica and investing the savings into a boom arm because once again, mic positioning is everything.

4. Pay Attention to Gain and Clipping

This is a big one. If you’ve ever heard audio that sounds loud and crunchy, it’s probably because of clipping. Take a listen to this sample I’ve recorded comparing the difference gain makes.

If you don’t watch out for clipping, even an expensive mic will still sound horrible. Gain is essentially how much your microphone volume is boosted. If it’s boosted too much, then the louder parts of your speech will ‘clip’ and sound crunchy. In order to control this, talk into your mic and watch the little sound meter in your recording software. As you’re talking, dial back the gain until the sound meter just barely hits the red (upper end) at the loudest parts of your speech. Some mics have built-in gain dials, so use that if you have one, otherwise use the microphone settings in Windows, MacOS, or whatever program you’re using to dial these levels back.

One catch that I’ve noticed on Windows is that programs behave differently with regard to gain and clipping, so don’t assume that just because you’ve set it to sound good in Zoom that it will also sound good in OBS.

5. Fiddle With Your Software

There are a lot of other optimizations you can make in various programs. Here are a handful of tips and suggestions:

Audacity: follow this workflow to give your audio that podcasty feel.
OBS: experiment with filters, especially the compressor. See this video for some great tips on how to do that.
Zoom: try turning on Original Audio, which is well documented here, and allows you to bypass some of Zoom’s own processing (don’t do this unless you already have good audio).
Davinci Resolve: watch this tutorial on improving the audio in Resolve. Pay particular attention to the compressor and de-esser.

Parallelizing GIS with Geopandas and Multiprocessing in Python

Mon, 14 Oct 2019 00:00:00 +0000

I recently found myself having to iteratively perform a complicated series of buffers, intersects, and joins over a large geodataframe. This isn’t necessarily a problem for one-off operations where you can afford to wait a while, but if you plan on running this script often or want to distribute it publicly, it helps to squeeze every ounce of performance you can out of it.

One obvious way to do this is to parallelize it, meaning to run the program simultaneously across more than just one CPU core. Unfortunately, a lot of tools in GIS only utilize a single core, whereas most of us are now are equipped with at least four. Hell, you can get what is effectively a 12 core processor now for ~$200. With that said, I should briefly note that parallelizing code isn’t a silver bullet: many smaller tasks may actually run slower after parallelization, and you should always optimize code in other ways before just stretching it across more CPU cores.

Unfortunately, for the cases where parallelizing makes sense, multiprocessing (the standard Python package for parallelizing code) can be a bit complicated to figure out (it certainly was for me). Hopefully this post helps illustrate how you might use it in a GIS context.

Getting Started

Here’s what you’re going to need to run this tutorial. Start off by importing these five packages. They are all fairly standard for GIS, and multiprocessing should come already installed with any recent Python installation.

import geopandas as gpd
import numpy as np
import pandas as pd
import multiprocessing as mp
from statistics import mean

Next, we need to fetch our data. We’re going to analyze the average distance from each intersection to every other intersection in Vancouver, a task that is relatively straightforward but requires lots of iteration. There are probably specific tools for doing this, but the actual analysis here doesn’t really matter since the main goal is parallelizing whatever real-world analysis we may have.

Intersection data is freely available from the Vancouver Open Data Catalogue, so go ahead and download it as a shapefile and unzip it into wherever you’re running this Python code from.

Next, load the data into geopandas as usual into a geodataframe:

intersections = gpd.read_file('street_intersections.shp')

With the data loaded, there are essentially three broad steps to analyzing it in parallel:

Create a function to process the data
Create a function to parallelize our processing function. This will have to:
- Split the geodataframe into chunks
- Process each chunk
- Reassemble the chunks back into a geodataframe
Run the parallelizing function

Creating the function to be parallelized

First we need to define the function that we want to parallelize. Essentially, we want to be able to call a single function that will house all the tasks that we want each CPU core to run. In this case, we want our function to take in a geodataframe and calculate the distance from each point to every other point, before averaging these measurements and saving that average in a new column.

Since we are calculating the distance between each point to every other point, our function will require two parameters:

a smaller geodataframe (chunk) containing which points each CPU core is responsible for processing
a geodataframe containing the entire set of points that each point in (1) will be measured against

Don’t worry too much about how this function works, as it’s just a placeholder for whatever complicated thing you want to do.

def neighbour_distance(gdf_chunk, gdf_complete):
    
    for index, row in gdf_chunk.iterrows(): # Iterate over the chunk
    
        distances = gdf_complete.geometry.apply(
            lambda x: x.distance(row.geometry)) # Calculate distances from each row in the complete geodataframe to each row in the chunked geodataframe.
        
        gdf_chunk.at[index,'distance'] = mean(distances)  # Enter the mean of the distances into a column called 'distances' in the chunked geodataframe.
        
    return gdf_chunk

Creating the parallelizing function

Now we need to write a separate function that will run our first function in parallel. This isn’t strictly necessary on Linux, but Windows will spit out errors in a never-ending loop unless we do it this way.

Our parallelizing function will need to split our geodataframe into smaller chunks, process those chunks, and reassemble them back into a single geodataframe. The whole function will look like this:

def parallelize():
    cpus = mp.cpu_count()
    
    intersection_chunks = np.array_split(intersections, cpus)
    
    pool = mp.Pool(processes=cpus)
    
    chunk_processes = [pool.apply_async(neighbour_distance, args=(chunk, intersections)) for chunk in intersection_chunks]
    
    intersection_results = [chunk.get() for chunk in chunk_processes]
    
    intersections_dist = gpd.GeoDataFrame(pd.concat(intersection_results), crs=intersections.crs)

    return intersections_dist

The first line within our function uses multiprocessing’s cpu_count() to tell us how many CPU cores our system has. This is how many chunks we will need to create, and how many cpu cores our program will be spread across. We could use a fixed number (e.g. 4), but on systems with more cores we’ll underutilize their hardware.

Next, we need to actually split the geodataframe into chunks, and this is what array_split does. It takes an array (the first argument, i.e. ‘intersections’) and splits it into a set number of chunks/buckets (the second argument, i.e. ‘cpus’). In this case, we’ve split the geodataframe into as many chunks as we have CPU cores. (Note that there are numerous ways to actually do this, including ways that wouldn’t require entering two geodataframes into our neighbour_distance function. I just find this way to be the best and cleanest for most situations).

pool = mp.Pool(processes=cpus) constructs a “Pool” that contains all of our processes, and in this case we’ve specified that we want as many processes as we have CPU cores.

The next two lines are important. What we’re doing here is telling multiprocessing to run our neighbour_distance function in a separate processes for each of the chunks we created using a list comprehension. Notice too that we specified our arguments separately using args=(arg1, arg2). The next line after that retrieves the results of those processes using .get() and adds them a list that we’ve called ‘intersection_results’.

The penultimate row reassembles each of the results back into a single geodataframe in two steps. First, it concatenates each of our chunks into a single pandas dataframe (i.e. not a geodataframe), as geopandas doesn’t support concatenation. Second, it turns this dataframe back into a geodataframe, and sets the coordinate reference system (CRS) to the CRS of our original intersections layer. We have to specify the CRS because it was lost when we used pandas. Finally, we return the ‘intersections_dist’ geodataframe.

Running the Parallelizing Function

To string all of this together and execute it, we’ll do the following, which is again necessary to prevent Python on Windows from getting stuck:

if __name__ == '__main__':
    
    intersections_dist = parallelize()
    
    print(intersections_dist)

Now just go your terminal and execute the script: python script_name.py

If you open the task manager you should now see your CPU pinned at 100% as it crunches these distance calculations across all cores. In my case, it runs on all twelve CPUs cores, which drastically decrease execution time.

To adapt this to your own project, you just need to swap out the neighbour_distance function and respective geodataframes with your own.

To learn more, definitely visit Sebastian Raschka’s blog on multiprocessing, which is where I learned most of this myself. And if you have any suggestions on how to improve this tutorial, go ahead and leave it in the comments.