Mapnik Tile Generation

Caution: extremely technical blog post ahead! Bonus points for readers who guess what’s on the picture above.

Pre-Prologue

In addition to building probably the best time tracking tool there is, we have a few side/pet projects in the house as well. Some quirks do come up from time to time and this is one of them…

Prologue

A set of tiles needs to be generated from a 80MB+ ESRI Shapefiles, then another set needs to be generated from another 80MB+ file. A blur filter needs to be applied to the first and then merged to the second set of tiles, hence a third, final set is generated. There is a bounding box and tiles are only rendered for three zoom levels, about 7200 images per one shapefile.
Box where test runs are made: iMac Core i5 2.7GHz, 20GB of RAM
Ideas are somewhat inspired from TopOSM/Details and Mapnik wiki, Ideas_composting

Digging in

Step 1 – Out of box, not really reading the manual

Standard Mapnik XML conf, simple TileStache conf, running simply, 2 times (for both shapefiles):

python scripts/tilestache-seed.py -c conf.json -l layer -e png 7 8 9 10 -b 52.75 9 67 31.5

Time for one file: ~40 minutes.
For blur/merge, a simple command (for every file in directory and its subdirectories)

convert A -morphology Smooth Disk -blur 0x4 png:- | convert -alpha set -background transparent - B -layers flatten +repage OUTFILE

Time: ~25 minutes.
Cleanup of ‘temporary’ files (remove both sets of tiles generated from shapefile, actual result is the merged files from Image Magic) – 2-3 minutes

Step 2 – OK, 1:47 is ‘a bit’ slow

There is xargs. And actually there is an example at the header (comments) of tilestache-list.py. So, the tilestache-seed.py evolves into two commands (you need to run the first one only once):

python scripts/tilestache-list.py -b 52.75 9 67 31.5 7 8 9 10 | split -l 500 - template/list-
ls -1 template/list-* | xargs -n1 -P2 python scripts/tilestache-seed.py -c conf.json -l layer -q -p 1 -e png 7 8 9 10 --tile-list

What we do?

  1. Pre generate a list of tiles (and divide them into a batches of 500) that need to be generated
  2. Run tilestache-list in 2 parallel processes (yes, there are 4 cores on the i5, but back to that latter)

Time for one file: ~20 minutes (actually, adding the -q, or quiet, switch cut a whooping 5 minutes)
ImageMagic call remains the same.

1:07 in total, already quite a bit better, but…

Step 3 – Find out what takes time and why

Some simple monitoring of CPU and disk usage yielded that the most time consuming task was writing and reading to disk (the machine has a regular disk, not a SSD one). Solutions? Change the disk? Possible, but shipping and reconfiguring takes time (and I’m lazy). Another one – use ramdisk

Created a 2GB ramdisk, copied shapefiles there, set the TileStache cache path to point there and had the convert command output to there as well. Added a copy to HDD to the end of the script.

Result?

  • TileStache seed: ~17-18 minutes (OK, not a big boost, Mapnik still needs its time)
  • convert: ~15 minutes
  • copy: ~2 minutes
  • deletion: 10 seconds (just eject/unmount the ramdisk)
  • Total: about 53 minutes.
Step 3.1 – Parallel…

This set of tiles is part of a bigger batch (weather data, with an hourly interval, shapefile size is roughly the same for every hour). So, to have a ‘conveyor moving’, data from the shapefiles is generated into tiles and then the merge process is spawned separately and another ‘hour’ is taken next. In order to avoid overloading the machine, TileStache seed is limited to 2 processes only (hence 2 cores) and currently the merge process loops the files through one by one…

Future? Next up?

An hour is still a bit slow, a 20-30 minute generation time seems reasonable and hopefully achievable.

The idea pool currently:

  • Run multiple (2-4) convert processes in parallel to speed up blur and merge.
  • See if using PostGis yields better results (providing shapefile to postgis conversion doesn’t eat up the possible speed boost).
  • Split the generation into multiple chunks and across multiple machines, grid computing

To be continued?