I recently found myself having to iteratively perform a complicated series of buffers, intersects, and joins over a large geodataframe. This isn’t necessarily a problem for one-off operations where you can afford to wait a while, but if you plan on running this script often or want to distribute it publicly, it helps to squeeze every ounce of performance you can out of it.
One obvious way to do this is to parallelize it, meaning to run the program simultaneously across more than just one CPU core. Unfortunately, a lot of tools in GIS only utilize a single core, whereas most of us are now are equipped with at least four. Hell, you can get what is effectively a 12 core processor now for ~$200. With that said, I should briefly note that parallelizing code isn’t a silver bullet: many smaller tasks may actually run slower after parallelization, and you should always optimize code in other ways before just stretching it across more CPU cores.
Unfortunately, for the cases where parallelizing makes sense, multiprocessing
(the standard Python package for parallelizing code) can be a bit complicated to figure out (it certainly was for me). Hopefully this post helps illustrate how you might use it in a GIS context.
Getting Started
Here’s what you’re going to need to run this tutorial. Start off by importing these five packages. They are all fairly standard for GIS, and multiprocessing should come already installed with any recent Python installation.
import geopandas as gpd
import numpy as np
import pandas as pd
import multiprocessing as mp
from statistics import mean
Next, we need to fetch our data. We’re going to analyze the average distance from each intersection to every other intersection in Vancouver, a task that is relatively straightforward but requires lots of iteration. There are probably specific tools for doing this, but the actual analysis here doesn’t really matter since the main goal is parallelizing whatever real-world analysis we may have.
Intersection data is freely available from the Vancouver Open Data Catalogue, so go ahead and download it as a shapefile and unzip it into wherever you’re running this Python code from.
Next, load the data into geopandas as usual into a geodataframe:
intersections = gpd.read_file('street_intersections.shp')
With the data loaded, there are essentially three broad steps to analyzing it in parallel:
- Create a function to process the data
- Create a function to parallelize our processing function. This will have to:
- Split the geodataframe into chunks
- Process each chunk
- Reassemble the chunks back into a geodataframe
- Run the parallelizing function
Creating the function to be parallelized
First we need to define the function that we want to parallelize. Essentially, we want to be able to call a single function that will house all the tasks that we want each CPU core to run. In this case, we want our function to take in a geodataframe and calculate the distance from each point to every other point, before averaging these measurements and saving that average in a new column.
Since we are calculating the distance between each point to every other point, our function will require two parameters:
- a smaller geodataframe (chunk) containing which points each CPU core is responsible for processing
- a geodataframe containing the entire set of points that each point in (1) will be measured against
Don’t worry too much about how this function works, as it’s just a placeholder for whatever complicated thing you want to do.
def neighbour_distance(gdf_chunk, gdf_complete):
for index, row in gdf_chunk.iterrows(): # Iterate over the chunk
distances = gdf_complete.geometry.apply(
lambda x: x.distance(row.geometry)) # Calculate distances from each row in the complete geodataframe to each row in the chunked geodataframe.
gdf_chunk.at[index,'distance'] = mean(distances) # Enter the mean of the distances into a column called 'distances' in the chunked geodataframe.
return gdf_chunk
Creating the parallelizing function
Now we need to write a separate function that will run our first function in parallel. This isn’t strictly necessary on Linux, but Windows will spit out errors in a never-ending loop unless we do it this way.
Our parallelizing function will need to split our geodataframe into smaller chunks, process those chunks, and reassemble them back into a single geodataframe. The whole function will look like this:
def parallelize():
cpus = mp.cpu_count()
intersection_chunks = np.array_split(intersections, cpus)
pool = mp.Pool(processes=cpus)
chunk_processes = [pool.apply_async(neighbour_distance, args=(chunk, intersections)) for chunk in intersection_chunks]
intersection_results = [chunk.get() for chunk in chunk_processes]
intersections_dist = gpd.GeoDataFrame(pd.concat(intersection_results), crs=intersections.crs)
return intersections_dist
The first line within our function uses multiprocessing’s cpu_count()
to tell us how many CPU cores our system has. This is how many chunks we will need to create, and how many cpu cores our program will be spread across. We could use a fixed number (e.g. 4), but on systems with more cores we’ll underutilize their hardware.
Next, we need to actually split the geodataframe into chunks, and this is what array_split
does. It takes an array (the first argument, i.e. ‘intersections’) and splits it into a set number of chunks/buckets (the second argument, i.e. ‘cpus’). In this case, we’ve split the geodataframe into as many chunks as we have CPU cores. (Note that there are numerous ways to actually do this, including ways that wouldn’t require entering two geodataframes into our neighbour_distance
function. I just find this way to be the best and cleanest for most situations).
pool = mp.Pool(processes=cpus)
constructs a “Pool” that contains all of our processes, and in this case we’ve specified that we want as many processes as we have CPU cores.
The next two lines are important. What we’re doing here is telling multiprocessing to run our neighbour_distance
function in a separate processes for each of the chunks we created using a list comprehension. Notice too that we specified our arguments separately using args=(arg1, arg2)
. The next line after that retrieves the results of those processes using .get()
and adds them a list that we’ve called ‘intersection_results’.
The penultimate row reassembles each of the results back into a single geodataframe in two steps. First, it concatenates each of our chunks into a single pandas dataframe (i.e. not a geodataframe), as geopandas doesn’t support concatenation. Second, it turns this dataframe back into a geodataframe, and sets the coordinate reference system (CRS) to the CRS of our original intersections layer. We have to specify the CRS because it was lost when we used pandas. Finally, we return the ‘intersections_dist’ geodataframe.
Running the Parallelizing Function
To string all of this together and execute it, we’ll do the following, which is again necessary to prevent Python on Windows from getting stuck:
if __name__ == '__main__':
intersections_dist = parallelize()
print(intersections_dist)
Now just go your terminal and execute the script: python script_name.py
If you open the task manager you should now see your CPU pinned at 100% as it crunches these distance calculations across all cores. In my case, it runs on all twelve CPUs cores, which drastically decrease execution time.
To adapt this to your own project, you just need to swap out the neighbour_distance
function and respective geodataframes with your own.
To learn more, definitely visit Sebastian Raschka’s blog on multiprocessing, which is where I learned most of this myself. And if you have any suggestions on how to improve this tutorial, go ahead and leave it in the comments.