Spatial queries on AWS SimpleDB - amazon-web-services

I would like to know what people suggest as efficient ways of doing a spatial query in an Amazon Web Services SimpleDB?
By spatial query I mean finding objects in a given radius of a latitude and longitude.

SimpleDB doesn't currently offer any built-in spatial search operations but that doesn't mean it can't be done. There's several methods of implementing geospatial searches in non-geospatially aware databases such as SimpleDB and all of them center around the idea of using the database to retrieve a rough first selection based on a geospatial bounding box and then filtering the returned data in your application using more accurate algorithms such as the Haversine formula.
You could store the latitude and longitude as (zero-padded and normalized) numeric attributes and then perform a double range query (lat >= minLat and lat <= maxLat and lon >= minLat and lon <= maxLat) but since neither of theese predicates are selective (each predicate matches a lot of items) it's not ideal (see Tuning Queries).
A better way would be using GeoHashes.
Geohashes offer properties like arbitrary precision, similar prefixes
for nearby positions, and the possibility of gradually removing
characters from the end of the code to reduce its size (and gradually
lose precision).
As a practical example, the Geohash 6gkzwgjzn820 decodes to the
coordinates -25.382708 and -49.265506, while the Geohash 6gkzwgjz will
decode to -25.383 and -49.266, and if we take a similar position in
the same region, such as -25.427 and -49.315, we can see it being
encoded as 6gkzmg1w (note the similar prefix).
From http://geohash.org/site/tips.html
With your item positions as GeoHashes you could use the like operator to search for a bounding box (where GeoHash like '6gkzmg1w%') but since the like operator is expensive (Comparison Operators) a better way would be to denormalize the data by storing each GeoHash prefix level (how many depends on your required search precision) as a separate attribute (GeoHash6 GeoHash8 etc) and then use a simple equality predicate (where Geohash8 = '6gkzmg1w').
Now on to the downside of GeoHashes. Since you can't make any assumption of a GeoHash being centered within your search box you have to search all neighboring prefixes as well. The process is excellently described by geohash-js
Geohash also has the property that as the number of digits decreases
(from the right), accuracy degrades. This property can be used to do
bounding box searches, as points near to one another will share
similar Geohash prefixes.
However, because a given point may appear at the edge of a given
Geohash bounding box, it is necessary to generate a list of Geohash
values in order to perform a true proximity search around a point.
Because the Geohash algorithm uses a base-32 numbering system, it is
possible to derive the Geohash values surrounding any other given
Geohash value using a simple lookup table.
So, for example, 1600 Pennsylvania Avenue, Washington DC resolves to:
38.897, -77.036
Using the geohash algorithm, this latitude and longitude is converted
to: dqcjqcp84c6e
A simple bounding box around this point could be described by
truncating this geohash to: dqcjqc
However, 'dqcjqcp84c6e' is not centered inside 'dqcjqc', and searching
within 'dqcjqc' may miss some desired targets.
So instead, we can use the mathematical properties of the Geohash to
quickly calculate the neighbors of 'dqcjqc'; we find that they are:
'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8'
This gives us a bounding box around 'dqcjqcp84c6e' roughly 2km x 1.5km
and allows for a database search on just 9 keys: SELECT * FROM table
WHERE LEFT(geohash,6) IN ('dqcjqc',
'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8');
Translated to a SimpleDB query that'd be where GeoHash6 in('dqcjqc', 'dqcjqf', 'dqcjqb', 'dqcjr1', 'dqcjq9', 'dqcjqd', 'dqcjr4', 'dqcjr0', 'dqcjq8') and then you'll do your Haversine filtering on the results in order to only get the items that's within your search radius.

I'm going to leave this here because it might help you!
14 years ago we tried to do a geo lookup table of locations within a radius. There was obviously no geospatial indexes or anything like that.
There was literally only standard SQL and Oracle... anyway, we ended up converting all lat/lng into kilometers from a fixed plane field. Essentially what geospatial indexes do these days.
To explain what exactly it does, it turns the world into a flat surface and with a bit of SQL trickery you can even select by radius, you even get the distance from the two points you're selecting. Since it's also raw full integers the queries are blazing fast.
Here is a simple example in PHP and a very complex looking but pretty easy once you understand it SQL query:
https://gist.github.com/tobsn/899413

Related

Django spatial calculations efficiency in PostGIS (RDS) vs. in object manager (EC2)?

From this question, I'd like to decide whether I should use GeoDjango, or roll my own with Python to filter Points within a certain radius of another Point.
There are two excellent answers that take different approaches to the question of how to perform such a calculation here: Django sort by distance
One of them uses GeoDjango to perform the distance calculation in PostGIS. I'm guessing that the compute would be done on the RDS instance?
The other uses a custom manager to implement the Great Circle distance formula. The compute would obviously be done on the EC2 instance.
I would imagine that the PostGIS implementation is more efficient because it's likely that people much smarter than I have optimized it. To what extent have they optimized it? Is there anything special about their implementation?
Assuming I am correct in assuming GeoDjango performs the distance compute using PostGIS on the RDS instance, I would imagine that RDS is not suited for heavy compute tasks, and may end up being slower or more expensive in the end. Are my assumptions correct?
What if I don't need a precise distance, where an octaggon or even a square would suffice? In the case of a square, it would be simply a matter of filtering Points with latitude and longitude within a certain range. Is GeoDjango/PostGIS able to perform estimates like this?
If I do need a precise distance, I could calculate the furthest bounds that can be reached with the given radius, and only perform precise distance calculations on Points within those bounds. Does GeoDjango/PostGIS do this?
I'll try to address you questions:
One of them uses GeoDjango to perform the distance calculation in
PostGIS. I'm guessing that the compute would be done on the RDS
instance?
If you are bringing two django models to memory, and doing the calculation using Django, such as
model_a = Foo.objects.get(id=1)
model_b = Bar.objects.get(id=1)
distance = model_a.geometry.distance(model_b.geometry)
This will be done in Python, using GEOS.
https://docs.djangoproject.com/en/1.9/ref/contrib/gis/geos/#django.contrib.gis.geos.GEOSGeometry.distance
There are distance lookups on Django, such as
foos = Foo.objects.filter(geometry__distance_lte=(Point(0,0,srid=4326), km1))
This calculation will be done by the backend (aka database).
The other uses a custom manager to implement the Great Circle distance
formula. The compute would obviously be done on the EC2 instance.
I would imagine that the PostGIS implementation is more efficient because it's likely that people much smarter than I have optimized it.
To what extent have they optimized it? Is there anything special about
their implementation?
Django has methods to use GCD in queries. This requires a transformation on the PostGIS, if you geometry field, to geography fields. Only EPSG:4326 is supported for now. If that's all you need, I bet the PostGIS implementation is good enough for almost all applications (if not all).
Assuming I am correct in assuming GeoDjango performs the distance compute using PostGIS on the RDS instance, I would imagine that RDS is
not suited for heavy compute tasks, and may end up being slower or
more expensive in the end. Are my assumptions correct?
I don't know much about amazon products, but without an estimate of size (number of rows, types of calculations (cross-product, for example), etc), it's hard to help further.
What if I don't need a precise distance, where an octaggon or even a square would suffice? In the case of a square, it would be simply a
matter of filtering Points with latitude and longitude within a
certain range. Is GeoDjango/PostGIS able to perform estimates like
this?
What kind of data do you have? There are several components in calculating distances and areas, mainly the spatial reference that you use (datum, ellipsoid, projection).
IF you need to do accurate or more accurate distance measurements between two distance sides of the globe, the geography side is more precise and it will yield good results. If you need to do that kind of measurements in a Cartesian plane, your data will yield bad results.
If your data is local, like a few sq km, consider using a more local spatial reference. WGS84 4326 is more suitable for global data. Local spatial references can give you precise results, but in much smaller extents.
If I do need a precise distance, I could calculate the furthest bounds that can be reached with the given radius, and only perform
precise distance calculations on Points within those bounds. Does
GeoDjango/PostGIS do this?
I think you are optimizing too early. I know your question is a bit old, but this is something that you should only care when it starts to hurt. PostGIS and Django have been grinding a lot of data for a long time for me in a govn. system that checks land registry parcels and does tons of queries to check several parameters. It's working for a few years without a hitch.

Finding objects within x miles of a point

I'm working on getting all events within 10 miles of the user's location. My models look something like this:
class User(models.Model):
location = models.PointField()
...
class Event(models.Model):
location = models.PointField()
...
In my tests, when I check the distance between the user and an event, I get the value 11.5122663513:
from geopy.distance import vincenty
print vincenty(request.user.location, event.location).miles # 11.5122663513
Yet, when I query for all events within 10 miles of the user's location, that event is returned:
Event.objects.filter(location__distance_lte=(request.user.location, D(mi=10))).count() # 1
Only when I drop the radius to less than 4 miles does the filter take effect:
Event.objects.filter(location__distance_lte=(request.user.location, D(mi=3))).count() # 0
I'm following the docs' example almost exactly, so I don't think my query is the problem.
What could be causing this discrepancy?
This very much depends on what type of database you are using.
Because cartesian math is much faster than geospatial math, the query likely treats coordinates as if they are on a plane rather than on a sphere.
The docs explain it this way:
Most people are familiar with using latitude and longitude to
reference a location on the earth’s surface. However, latitude and
longitude are angles, not distances. In other words, while the
shortest path between two points on a flat surface is a straight line,
the shortest path between two points on a curved surface (such as the
earth) is an arc of a great circle. Thus, additional computation
is required to obtain distances in planar units (e.g., kilometers and
miles). Using a geographic coordinate system may introduce
complications for the developer later on. For example, Spatialite does
not have the capability to perform distance calculations between
geometries using geographic coordinate systems, e.g. constructing a
query to find all points within 5 miles of a county boundary stored as
WGS84.
Portions of the earth’s surface may projected onto a two-dimensional,
or Cartesian, plane. Projected coordinate systems are especially
convenient for region-specific applications, e.g., if you know that
your database will only cover geometries in North Kansas, then you may
consider using projection system specific to that region. Moreover,
projected coordinate systems are defined in Cartesian units (such as
meters or feet), easing distance calculations.
Furthermore, this may be influenced by your database choice. If you are using Postgres/PostGIS, it has the following note in the docs:
In PostGIS, ST_Distance_Sphere does not limit the geometry types
geographic distance queries are performed with. However, these
queries may take a long time, as great-circle distances must be
calculated on the fly for every row in the query. This is because the
spatial index on traditional geometry fields cannot be used.
For much better performance on WGS84 distance queries, consider using
geography columns in your database instead because they are able to
use their spatial index in distance queries. You can tell GeoDjango to
use a geography column by setting geography=True in your field
definition.
You can check this yourself by printing out the raw SQL:
qs = Event.objects.filter(location__distance_lte=(request.user.location, D(mi=10))
print qs.query
Depending on your database type, and the amount of data you plan to store, you have a couple options:
Filter the points a second time in python
Try setting geography=True
Set an explicit SRID
Take a point, buffer it out into a circle with the given radius and then find points within that circle using contains
Use a different database type
If you share the raw query it'll be easier to figure out what is happening.

how to query the database to return all zip codes with a given distance (ie 5 miles) from a given zip code using geopy

Hi frens I am using geopy to calculate the latitude and longitude. Now I want to get the list of areas given distance from a zipcode.How to get that?
Well, as I can see, geopy doesn't have any built-in capability to get a list of areas around some coordinates.
But you can use a workaround. Take your geocode and calculate coordinates (latitue and longitude). Then imagine a grid on the map with a cell size equal to area of the smallest one you need to find around your location.
Use geopy to get an area name belonging to the each cell corner of your grid. Is that ok for you? It will get you some kind of approximation because a grid is not a circle and you may miss some small areas. But I think in most cases the solution will work fine.
It is much easier to locate zipcodes inside a rectangle than in a circle so I would recommend that you approximate your problem by looking for zipcodes inside a given rectangle.
Here are answers to the question of how to get list of zipcodes in given polygone: Find zipcodes inside polygon shape using google maps api
Summary
You need geometry for each zipcode. Once you have that you need to be able to query it using database that supports geoquery. One such database is Google's Fusion Table and there is already a geometry data table for zipcodes available here: https://www.google.com/fusiontables/DataSource?docid=1AxB511DayCdtmyBuYOIPHIe_WM9iG87Q3jKh6EQ#rows:id=1
Here's the sample query for Fusion Table data.
Another approach is server side code using PHP and CSV data. Here's live demo: http://daim.snm.ku.dk/demo/zip/. The page also has download for code.
If you use any of above technique please make sure to upvote answers of original authors :).

Is there a way to identify regions that are not very similar from a set of images?

Given an image, I would like to extract more subimages from it, but the resulting subimages must not be overly similar to each other. If the center of each ROI should be chosen randomly, then we must make sure that each subimage has at most only a small percentage of area in common with other subimages.
Or we could decompose the image into small regions over a regular grid, then I randomly choose a subimage within each region. This option, however, does not ensure that all subimages are sufficiently different from each other. Obviously I have to choose a good way to compare the resulting subimages, but also a similarity threshold.
The above procedure must be performed on many images: all the extracted subimages should not be too similar. Is there a way to identify regions that are not very similar from a set of images (for eg by inspecting all histograms)?
One possible way is to split your image into n x n squares (save edge cases) as you pointed out, reduce each of them to a single value and group them according to k-nearest values (pertaining to the other pieces). After you group them, then you can select, for example, one image from each group. Something that is potentially better is to use a more relevant metric inside each group, see Comparing image in url to image in filesystem in python for two such metrics. By using this metric, you can select more than one piece from each group.
Here is an example using some duck I found around. It considers n = 128. To reduce each piece to a single number, it calculates the euclidean distance to a pure black piece of n x n.
f = Import["http://fohn.net/duck-pictures-facts/mallard-duck.jpg"];
pieces = Flatten[ImagePartition[ColorConvert[f, "Grayscale"], 128]]
black = Image[ConstantArray[0, {128, 128}]];
dist = Map[ImageDistance[#, black, DistanceFunction -> EuclideanDistance] &,
pieces];
nf = Nearest[dist -> pieces];
Then we can see the grouping by considering k = 2:
GraphPlot[
Flatten[Table[
Thread[pieces[[i]] -> nf[dist[[i]], 2]], {i, Length[pieces]}]],
VertexRenderingFunction -> (Inset[#2, #, Center, .4] &),
SelfLoopStyle -> None]
Now you could use a metric (better than the distance to black) inside each of these groups to select the pieces you want from there.
Since you would like to apply this to a large number of images, and you already suggested it, let's discuss how to solve this problem by selecting different tiles.
The first step could be to define what "similar" is, so a similarity metric is needed. You already mentioned the tiles' histogram as one source of metric, but there may be many more, for example:
mean intensity,
90th percentile of intensity,
10th percentile of intensity,
mode of intensity, as in peak of the histogram,
variance of pixel intensity in the whole tile,
granularity, which you could quickly approximate by the difference between the raw and the Gaussian-filtered image, or by calculating the average variance in small sub-tiles.
If your image has two channels, the above list leaves you already with 12 metric components. Moreover, there are characteristics that you can obtain from the combination of channels, for example the correlation of pixel intensities between channels. With two channels that's only one characteristic, but with three channels it's already three.
To pick different tiles from this high-dimensional cloud, you could consider that some if not many of these metrics will be correlated, so a principal component analysis (PCA) would be a good first step. http://en.wikipedia.org/wiki/Principal_component_analysis
Then, depending on how many sample tiles you would like to chose, you could look at the projection. For seven tiles, for example, I would look at the first three principal components, and chose from the two extremes of each, and then also pick the one tile closest to the center (3 * 2 + 1 = 7).
If you are concerned that chosing from the very extremes of each principal component may not be robust, the 10th and 90th percentiles may be. Alternatively, you could use a clustering algorithm to find separated examples, but this would depend on how your cloud looks like. Good luck.

Select all points in a matrix within 30m of another point

So if you look at my other posts, it's no surprise I'm building a robot that can collect data in a forest, and stick it on a map. We have algorithms that can detect tree centers and trunk diameters and can stick them on a cartesian XY plane.
We're planning to use certain 'key' trees as natural landmarks for localizing the robot, using triangulation and trilateration among other methods, but programming this and keeping data straight and efficient is getting difficult using just Matlab.
Is there a technique for sub-setting an array or matrix of points? Say I have 1000 trees stored over 1km (1000m), is there a way to say, select only points within 30m radius of my current location and work only with those?
I would just use a GIS, but I'm doing this in Matlab and I'm unaware of any GIS plugins for Matlab.
I forgot to mention, this code is going online, meaning it's going on a robot for real-time execution. I don't know if, as the map grows to several miles, using a different data structure will help or if calculating every distance to a random point is what a spatial database is going to do anyway.
I'm thinking of mirroring the array of trees, into two arrays, one sorted by X and the other by Y. Then bubble sorting to determine the 30m range in that. I do the same for both arrays, X and Y, and then have a third cross link table that will select the individual values. But I don't know, what that's called, how to program that and I'm sure someone already has so I don't want to reinvent the wheel.
Cartesian Plane
GIS
You are looking for a spatial database like a quadtree or a kd-tree. I found two kd-tree implementations here and here, but didn't find any quadtree implementations for Matlab.
The simple solution of calculating all the distances and scanning through seems to run almost instantaneously:
lim = 1;
num_trees = 1000;
trees = randn(num_trees,2); %# list of trees as Nx2 matrix
cur = randn(1,2); %# current point as 1x2 vector
dists = hypot(trees(:,1) - cur(1), trees(:,2) - cur(2)); %# distance from all trees to current point
nearby = tree_ary((dists <= lim),:); %# find the nearby trees, pull them from the original matrix
On a 1.2 GHz machine, I can process 1 million trees (1 MTree?) in < 0.4 seconds.
Are you running the Matlab code directly on the robot? Are you using the Real-Time Workshop or something? If you need to translate this to C, you can replace hypot with sqr(trees[i].x - pos.x) + sqr(trees[i].y - pos.y), and replace the limit check with < lim^2. If you really only need to deal with 1 KTree, I don't know that it's worth your while to implement a more complicated data structure.
You can transform you cartesian coordinates into polar coordinates with CART2POL. Then selecting points inside certain radius will be strait-forward.
[THETA,RHO] = cart2pol(X-X0,Y-Y0);
selected = RHO < 30;
where X0, Y0 are coordinates of the current location.
My guess is that trees are distributed roughly evenly through the forest. If that is the case, simply use 30x30 (or 15x15) grid blocks as hash keys into an closed hash table. Look up the keys for all blocks intersecting the search circle, and check all hash entries starting at that key until one is flagged as the last in its "bucket."
0---------10---------20---------30--------40---------50----- address # line
(0,0) (0,30) (0,60) (30,0) (30,30) (30,60) hash key values
(1,3) (10,15) (3,46) (24,9.) (23,65.) (15,55.) tree coordinates + "." flag
For example, to get the trees in (0,0)…(30,30), map (0,0) to the address 0 and read entries (1,3), (10,15), reject (3,46) because it's out of bounds, read (24,9), and stop because it's flagged as the last tree in that sector.
To get trees in (0,60)…(30,90), map (0,60) to address 20. Skip (24, 9), read (23, 65), and stop as it's last.
This will be quite memory efficient as it avoids storing pointers, which would otherwise be of considerable size relative to the actual data. Nevertheless, closed hashing requires leaving some empty space.
The illustration isn't "to scale" as in reality there would be space for several entries between the hash key markers. So you shouldn't have to skip any entries unless there are more trees than average in a local preceding sector.
This does use hash collisions to your advantage, so it's not as random as a hash function typically is. (Not every entry corresponds to a distinct hash value.) However, as dense sections of forest will often be adjacent, you should randomize the mapping of sectors to "buckets," so a given dense sector will hopefully overflow into a less dense one, or the next, or the next.
Additionally, there is the issue of empty sectors and terminating iteration. You could insert a dummy tree into each sector to mark it as empty, or some other simple hack.
Sorry for the long explanation. This kind of thing is simpler to implement than to document. But the performance and the footprint can be excellent.
Use some sort of spatially partitioned data structure. A simple solution would be to simply create a 2d array of lists containing all objects within a 30m x 30m region. Worst case is then that you only need to compare against the objects in four of those lists.
Plenty of more complex (and potentially beneficial) solutions could also be used - something like bi-trees are a bit more complex to implement (not by much though), but could get more optimum performance (especially in cases where the density of objects varies considerably).
You could look at the voronoi diagram support in matlab:
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/voronoi.html
If you base the voronoi polygons on your key trees, and cluster the neighbouring trees into those polygons, that would partition your search space by proximity (finding the enclosing polygon for a given non-key point is fast), but ultimately you're going to get down to computing key to non-key distances by pythagoras or trig and comparing them.
For a few thousand points (trees) brute force might be fast enough if you have a reasonable processor on board. Compute the distance of every other tree from tree n, then select those within 30'. This is the same as having all trees in the same voronoi polygon.
Its been a few years since I worked in GIS but I found the following useful: 'Computational Geometry In C' Joseph O Rourke, ISBN 0-521-44592-2 Paperback.