Spatial index performance tuning

Getting a really good spatial index on a geography column took a lot more effort and reading than I'd ever have dreamed of! It's really hard to find good resources on them so I thought I'd type up a few of the things I've learnt over the last few days trying to get some query times down. Our queries ended up running in about 1 tenth the time as they did with the first attempt at a spatial index. However, even that first index sped the queries up by about 10x. Which leads me to my first point: Any spatial index is always better than not having one.

Hang on Rob, what queries?! 

Good question! In the database we have a few thousand administrative boundaries that we want to ask questions about, for the sake of this post (and of tuning the index) our main question is: What boundaries is a point p in?
In SQL this looks like:

DECLARE @p geography
SET @p = geography::Point(-35, 139, 4326)

SELECT BoundaryName
FROM   dbo.Boundaries
WHERE  data.STIntersects(@p)

In the Boundaries table we have a "data" column which is the geography column defining the boundary.

The first index I created was a HHHH index with a cells per object of 16 because I thought more cells = better! To measure the performance of the queries I did 2 things:

  1. Used the sp_help_spatial_geography_index stored procedure (not available on SQL Azure!) 
  2. Wrote a test page to time some real queries
This first index resulted in a primary filter efficiency of about 16% and a query time of just over a second. Not great!

I proceeded to try every combination of Level 1, 2, 3, 4 grids by dropping and recreating the spatial index without ever going over a primary efficiency of  25%. 

I headed over to MSDN and had another read of the spatial index overview. Then something hit me:
If tessellating a cell would exceed the cells-per-object limit, the cell is counted and not tessellated. Otherwise, the cell is tessellated, and the lower-level cells that are touched by the object are counted. The tessellation process continues in this way, breadth-wise, across the level. This process is repeated recursively for the lower-level grids of the tessellated cells until the limit is reached or there are no more cells to count.
 (my emphasis).

Obviously with the grid being HHHH (256 cells per level) was meaning my objects probably weren't being tessellated at all because my cells per object limit was only 16!

For example, take the image below (using a grid of LLLL with level 4 omitted).

It's fairly obvious that tessellating below level 2 would result in more than 16 cells being stored (4 level 2, 16 level 3). So the index only stores that this object touches/covers 12 level 2 cells. That's not great at all! 

I then started experimenting with different cells per object limits. I eventually settled on 256 being a good compromise between speed and space. I also dropped the grid down to HMHH after more experimentation with timings and the stored proc results.

Some more lessons:
It's not all about grid size
Your data and the type of query makes a big difference
SQL Azure doesn't always have every thing you expect it to
Azure websites are awesome!

Test app running in Azure

Further/other reading:


Comments

Popular posts from this blog

Trimming strings in action parameters in ASP.Net Web API

Full text search in Entity Framework 6 using command interception

Composing Expressions in C#