Calculation of accumulation area - c++

I'm looking for a GIS/Geometric algorithm:
I have 1000 points randomly distributed in a large area(such as a city), How can I find out all the small areas which have more than 15 points? Like this picture below:
Each point has its own latitude and longitude coordinates. The small area less than 200m x 200m.

You should take a look at RTREE structures.
See http://en.wikipedia.org/wiki/R-tree
You've such algorithms implemented e.g. in the SQlite3 engine.
See http://www.sqlite.org/rtree.html
Our Open Source version already includes the RTREE extension for Delphi 6 up to XE, compiled by default since rev. 1.8.

Not sure what your performance requirements are. But a naive implementation would be, for each point, to sum up the inverse of the distance to all other points:
for i := 0 to 999 do
for j := 0 to 999 do
if i<>j then
Point[i].Score := Point[i].Score + ( 1 / Distance(Point[i], Point[j]) );
The points near the center of each accumulation area will have the highest score.

Related

Histogram Binning of Gradient Vectors

I am working on a project that has a small component requiring the comparison of distributions over image gradients. Assume I have computed the image gradients in the x and y directions using a Sobel filter and have for each pixel a 2-vector. Obviously getting the magnitude and direction is reasonably trivial and is as follows:
However, what is not clear to me is how to bin these two components in to a two dimensional histogram for an arbitrary number of bins.
I had considered something along these lines(written in browser):
//Assuming normalised magnitudes.
//Histogram dimensions are bins * bins.
int getHistIdx(float mag, float dir, int bins) {
const int magInt = reinterpret_cast<int>(mag);
const int dirInt = reinterpret_cast<int>(dir);
const int magMod = reinterpret_cast<int>(static_cast<float>(1.0));
const int dirMod = reinterpret_cast<int>(static_cast<float>(TWO_PI));
const int idxMag = (magInt % magMod) & bins
const int idxDir = (dirInt % dirMod) & bins;
return idxMag * bins + idxDir;
}
However, I suspect that the mod operation will introduce a lot of incorrect overlap, i.e. completely different gradients getting placed in to the same bin.
Any insight in to this problem would be very much appreciated.
I would like to avoid using any off the shelf libraries as I want to keep this project as dependency light as possible. Also I intend to implement this in CUDA.
This is more of a what is an histogram question? rather than one of your tags. Two things:
In a 2D plain two directions equal by modulation of 2pi are in fact the same - so it makes sense to modulate.
I see no practical or logical reason of modulating the norms.
Next, you say you want a "two dimensional histogram", but return a single number. A 2D histogram, and what would make sense in your context, is a 3D plot - the plane is theta/R, 2 indexed, while the 3D axis is the "count".
So first suggestion, return
return Pair<int,int>(idxMag,idxDir);
Then you can make a 2D histogram, or 2 2D histograms.
Regarding the "number of bins"
this is use case dependent. You need to define the number of bins you want (maybe different for theta and R). Maybe just some constant 10 bins? Maybe it should depend on the amount of vectors? In any case, you need a function that receives either the number of vectors, or the total set of vectors, and returns the number of bins for each axis. This could be a constant (10 bins) initially, and you can play with it. Once you decide on the number of bins:
Determine the bins
For a bounded case such as 0<theta<2 pi, this is easy. Divide the interval equally into the number of bins, assuming a flat distribution. Your modulation actually handles this well - if you would have actually modulated by 2*pi, which you didn't. You would still need to determine the bin bounds though.
For R this gets trickier, as this is unbounded. Two options here, but both rely on the same tactic - choose a maximal bin. Either arbitrarily (Say R=10), so any vector longer than that is placed in the "longer than max" bin. The rest is divided equally (for example, though you could choose other distributions). Another option is for the longest vector to determine the edge of the maximal bin.
Getting the index
Once you have the bins, you need to search the magnitude/direction of the current vector in your bins. If bins are pairs representing min/max of bin (and maybe an index), say in a linked list, then it would be something like (for mag for example):
bin = histogram.first;
while ( mag > bin.min ) bin = bin.next;
magIdx = bin.index;
If the bin does not hold the index you can just use a counter and increase it in the while. Also, for the magnitude the final bin should hold "infinity" or some large number as a limit. Note this has nothing to do with modulation, though that would work for your direction - as you have coded. I don't see how this makes sense for the norm.
Bottom line though, you have to think a bit about what you want. In any case all the "objects" here are trivial enough to write yourself, or even use small arrays.
I think you should arrange your bins in a square array, and then bin by vx and vy independently.
If your gradients are reasonably even you just need to scan the data first to accumulate the min and max in x and y, and then split the gradients evenly.
If the gradients are very unevenly distributed, you might want to sort the (eg) vx first and arrange that the boundaries between each bin exactly evenly divides the values.
An intermediate solution might be to obtain the min and max ignoring the (eg) 10% most extreme values.

Error in calculating exact nearest neighbors in radius with FLANN

I am trying to find the exact number of neighbour nodes in a big 3D points dataset. The goal is for each point of the dataset to retrieve all the possible neighbours in a region with a given radius. FLANN ensures that for lower dimensional data can retrieve the exact neighbors while comparing with brute force search it seems to not be the case. The neighbors are essential for further calculations and therefore I need the exact number. I tested increasing the radius a little bit but doesn't seem to be this the problem. Is anyone aware how to calculate the exact neighbors with FLANN or other C++ library?
The code:
// All nodes to be tested for inclusion in support domain.
flann::Matrix<double> query_nodes = flann::Matrix<double>(&nodes_pos[0].x, nodes_pos.size(), 3);
// Set default search parameters
flann::SearchParams search_parameters = flann::SearchParams();
search_parameters.checks = -1;
search_parameters.sorted = false;
search_parameters.use_heap = flann::FLANN_True;
flann::KDTreeSingleIndexParams index_parameters = flann::KDTreeSingleIndexParams();
flann::KDTreeSingleIndex<flann::L2_3D<double> > index(query_nodes, index_parameters);
index.buildIndex();
//FLANN uses L2 for radius search.
double l2_radius = (this->support_layer_*grid.spacing)*(this->support_layer_*grid.spacing);
double extension = l2_radius/10.;
l2_radius+= extension;
index.radiusSearch(query_nodes, indices, dists, l2_radius, search_parameters);
Try nanoflann. It is designed for low dimensional spaces and gives exact nearest neighbors. Furthermore, it is just one header file that you can either "install" or just copy to your project.
You should check page 6+ from the flann-manual, to fine-tune your search parameters, such as target_precision, which should be set to 1, for "maximum" accuracy.
That parameter is often found as epsilon (ε) in Approximate Nearest Neighbor Search (ANNS), which is used in high dimensional spaces, in order to (try) to beat the curse of dimensionality. FLANN is usually used in 128 dimensions, not 3, as far as I can tell, which may explain the bad performance you are experiencing.
A c++ library that works well in 3 dimensions is CGAL. However, it's much larger than FLANN, because it is a library for computational geometry, thus it provides functionality for many problems, not just NNS.

Backpropagation 2-Dimensional Neuron Network C++

I am learning about Two Dimensional Neuron Network so I am facing many obstacles but I believe it is worth it and I am really enjoying this learning process.
Here's my plan: To make a 2-D NN work on recognizing images of digits. Images are 5 by 3 grids and I prepared 10 images from zero to nine. For Example this would be number 7:
Number 7 has indexes 0,1,2,5,8,11,14 as 1s (or 3,4,6,7,9,10,12,13 as 0s doesn't matter) and so on. Therefore, my input layer will be a 5 by 3 neuron layer and I will be feeding it zeros OR ones only (not in between and the indexes depends on which image I am feeding the layer).
My output layer however will be one dimensional layer of 10 neurons. Depends on which digit was recognized, a certain neuron will fire a value of one and the rest should be zeros (shouldn't fire).
I am done with implementing everything, I have a problem in computing though and I would really appreciate any help. I am getting an extremely high error rate and an extremely low (negative) output values on all output neurons and values (error and output) do not change even on the 10,000th pass.
I would love to go further and post my Backpropagation methods since I believe the problem is in it. However to break down my work I would love to hear some comments first, I want to know if my design is approachable.
Does my plan make sense?
All the posts are speaking about ranges ( 0->1, -1 ->+1, 0.01 -> 0.5 etc ), will it work for either { 0 | .OR. | 1 } on the output layer and not a range? if yes, how can I control that?
I am using TanHyperbolic as my transfer function. Does it make a difference between this and sigmoid, other functions.. etc?
Any ideas/comments/guidance are appreciated and thanks in advance
Well, by the description given above, I think that the design and approach taken it's correct! With respect to the choice of the activation function, remember that those functions help to get the neurons which have the largest activation number, also, their algebraic properties, such as an easy derivative, help with the definition of Backpropagation. Taking this into account, you should not worry about your choice of activation function.
The ranges that you mention above, correspond to a process of scaling of the input, it is better to have your input images in range 0 to 1. This helps to scale the error surface and help with the speed and convergence of the optimization process. Because your input set is composed of images, and each image is composed of pixels, the minimum value and and the maximum value that a pixel can attain is 0 and 255, respectively. To scale your input in this example, it is essential to divide each value by 255.
Now, with respect to the training problems, Have you tried checking if your gradient calculation routine is correct? i.e., by using the cost function, and evaluating the cost function, J? If not, try generating a toy vector theta that contains all the weight matrices involved in your neural network, and evaluate the gradient at each point, by using the definition of gradient, sorry for the Matlab example, but it should be easy to port to C++:
perturb = zeros(size(theta));
e = 1e-4;
for p = 1:numel(theta)
% Set perturbation vector
perturb(p) = e;
loss1 = J(theta - perturb);
loss2 = J(theta + perturb);
% Compute Numerical Gradient
numgrad(p) = (loss2 - loss1) / (2*e);
perturb(p) = 0;
end
After evaluating the function, compare the numerical gradient, with the gradient calculated by using backpropagation. If the difference between each calculation is less than 3e-9, then your implementation shall be correct.
I recommend to checkout the UFLDL tutorials offered by the Stanford Artificial Intelligence Laboratory, there you can find a lot of information related to neural networks and its paradigms, it's worth to take look at it!
http://ufldl.stanford.edu/wiki/index.php/Main_Page
http://ufldl.stanford.edu/tutorial/

Measure variation of data points from a line; To Catch a Dip

How can I measure this area in C++?
(update: I posted the solution and code as an answer rather than edit the question again)
The ideal line (dashed red) is the plot from starting point with the average rise added with each angle of measurement; this I obtain via average. I measured the test data in black. How can I quantify the area of the dip in blue? X-axis is unitized, so slopes and math are simplified.
I could determine a cutoff for the size of areas like this and then flag this part for retesting or failure. Rarely, there is another dip that appears closer to the right, but setting a cutoff value for standard deviation usually fails those parts.
Update
Diego's answer helped me visualize this. Now that I can see what I'm trying to do, I'll work on the algorithm to implement the "homemade dip detector". :)
Why?
I created a test bench to test throttle position sensors I'm selling. I'm trying to programatically quantify how straight the plot is by analyzing the data collected. This one particular model is vexing me.
Sample plot of a part I prefer not to sell:
The X axis are evenly spaced angles of throttle opening. The stepper motor turns the input shaft, stopping every 0.75° to measure the output on a 10 bit ADC, which gets translated to the Y axis. The plot is the translation of data[idx] to idx,value mapped to (x,y) bitmap coordinates. Then I draw lines between the points within the bitmap using Bresenham's algorithm.
My other TPS products produce amazingly linear output.
The lower (left) portion of the plot is crucial to normal usage of any motor vehicle; it's when you're driving around town, entering parking lots, etc. This particular part has a tendency to develop a dip around 15° opening and I wish to use the program to quantify this "dip" in the curve and rely less upon the tester's intuition. In the above example, the plot dips but doesn't return to what an ideal line might be.
Even though this is an embedded application, printing the report takes 10 seconds, thus I do not consider stepping through an array of 120 points of data multiple times a waste of cycles. Also, since I'm using a uC32 PIC32 microcontroller, there's plenty of memory, so I have the luxury of being able to ponder this problem within the controller.
What I'm trying already
Array of rise between test points: I dismiss the X-axis entirely, considering it unitized, and then make an array of change from one reading to the next. This array is what contributes to the report's "Min rise between points: 0 Max: 14". I call this array deltas.
I've tried using standard deviation on deltas, however, during testing I have found that a low Std Dev is not a reliable measure for this part. If the dip quickly returns to the original line implied by early data points, the Std Dev can be deceptively low (observed to be as low as 2.3) but the part is still something I wouldn't want to use. I tried setting a cutoff at 2.6, but it failed too many parts with great plots. The other, more linear part linked to above can reliably count on Std Dev for quality.
Kurtosis seems not to apply for this situation at all. I learned of Kurtosis today and found a Statistics Library which includes Kurtosis and Skewness. During continued testing, I found that of these two measures, there was not a trend of positive, negative, or amplitude which would correspond to either passing or failing. That same gentleman has shared a linear regression library, but I believe Lin Reg is unrelated to my situation, as I am comfortable with the assumption of the AVG of deltas being my ideal line. Linear Regression and R^2 are more for finding a line from less ideal data or much larger sets.
Comparing each delta to AVG and Std Dev I set up a monitor to check each delta against final average of the deltas's data. Here, too, I couldn't find a reliable metric. Too many good parts would not pass a test restricting any delta to within 2x Std Dev away from the Average. Ultimately, the only variation from AVG I could settle on is to be within AVG+Std Dev difference from the AVG itself. Anything more restrictive would fail otherwise good parts. And the elusive dip around 15° opening can sneak through this test.
Homemade dip detector When feeding deltas to the serial monitor of the computer, I observed consecutive negative deltas during the dip, so I programmed in a dip detector, but it feels very crude to me. If there are 5 or more negative deltas in a row, I sum them. I have seen that if I take that sum the dip's differences from AVG then divide by the number of negative deltas, a value over 2.9 or 3 could mean a fail. I have observed dips lasting from 6 to 15 deltas. Readily observable dips would have their differences from AVG sum up to -35.
Trending accumulated variation from the AVG The above made me think watching the summation of deltas as it wanders away from AVG could be the answer. Meaning, I step through the array and sum the differences of each delta from AVG. I thought I was on to something until a good part blew this theory. I was seeing a trend of the fewer times the running sum varied from AVG by less than 2x AVG, the more straight the line appeared. Many ideal parts would only show 8 or less delta points where the sumOfDiffs would stray from the AVG very far.
float sumOfDiffs=0.0;
for( int idx=0; idx<stop; idx++ ){
float spread = deltas[idx] - line->AdcAvgRise;
sumOfDiffs = sumOfDiffs + spread;
...
testVal = 2*line->AdcAvgRise;
if( sumOfDiffs > testVal || sumOfDiffs < -testVal ){
flag = 'S';
}
...
}
And then a part with a fantastic linear plot came through with 58 data points where sumOfDiffs was more than twice the AVG! I find this amazing, as at the end of the ~120 data points, sumOfDiffs value is -0.000057.
During testing, the final sumOfDiffs result would often register as 0.000000 and only on exceptionally bad parts would it be greater than .000100. I found this quite surprising, actually: how a "bad part" can have accumulated great accuracy.
Sample output from monitoring sumOfDiffs This below output shows a dip happening. The test watches as the running sumOfDiffs is more than 2x the AVG away from the AVG for the whole test. This dip lasts from deltas idx of 23 through 49; starts at 17.25° and lasts for 19.5°.
Avg rise: 6.75 Std dev: 2.577
idx: delta diff from avg sumOfDiffs Flag
23: 5 -1.75 -14.05 S
24: 6 -0.75 -14.80 S
25: 7 0.25 -14.55 S
26: 5 -1.75 -16.30 S
27: 3 -3.75 -20.06 S
28: 3 -3.75 -23.81 S
29: 7 0.25 -23.56 S
30: 4 -2.75 -26.31 S
31: 2 -4.75 -31.06 S
32: 8 1.25 -29.82 S
33: 6 -0.75 -30.57 S
34: 9 2.25 -28.32 S
35: 8 1.25 -27.07 S
36: 5 -1.75 -28.82 S
37: 15 8.25 -20.58 S
38: 7 0.25 -20.33 S
39: 5 -1.75 -22.08 S
40: 9 2.25 -19.83 S
41: 10 3.25 -16.58 S
42: 9 2.25 -14.34 S
43: 3 -3.75 -18.09 S
44: 6 -0.75 -18.84 S
45: 11 4.25 -14.59 S
47: 3 -3.75 -16.10 S
48: 8 1.25 -14.85 S
49: 8 1.25 -13.60 S
Final Sum of diffs: 0.000030
RunningStats analysis:
NumDataValues= 125
Mean= 6.752
StandardDeviation= 2.577
Skewness= 0.251
Kurtosis= -0.277
Sobering note about quality: what started me on this journey was learning how major automotive OEM suppliers consider a 4 point test to be the standard measure for these parts. My first test bench used an Arduino with 8k of RAM, didn't have a TFT display nor a printer, and a mechanical resolution of only 3°! Back then I simply tested deltas being within arbitrary total bounds and choosing a limit of how big any single delta could be. My 120+ point test feels high class compared to that 30 point test from before, but that test had no idea about these dips.
Premises
the mean of a set of data has the mathematical property that the sum of the deviations from the mean is 0.
this explains why both bad and good datasets alwais give almost 0.
basically the result when differs from zero is essentially an accumulations of rounding errors in the diffs and that's why unfortunately cannot hold useful informations
the thing that most clearly define what you're looking for is your image: you're looking for an AREA and this is why you're not finding the solution in this ways:
looking to a metric in the single points is too local to extract that information
looking to global accumulations or parameters (global standard deviation) is too global and you lose the data among too much information and source of variations
kurtosis (you've already told I know but is for completeness) is out of its field of applications since this is not a probability distribution
in the end the more suitable approach of your already tryied ones is the "Homemade dip detector" because thinks in a way that is local but not too much.
Last but not least:
Any Algorithm you're going to choose has its tacit points on which it stands.
So maybe one is looking for a super clever algorithm that with no parametrization and tuning automatically adapts to the problem and self define thereshods and other.
On the other side there is an algorithm that will stand on the knowledge by the writer of the tipical data behavior (good and bad) and that is itself stupid in the way that if there is another different and unespected behavior the results are unpredictable
Ok, the right way is one of this two or is in-between them depending on the application. So if it works also the "Homemade dip detectors" can be a solution. There is not reason to define it crude but it could be that is not sufficient based on applicaton needs and that's an other thing.
How to find the area
Once you have the data the first thing is to clearly define the "theoretical straight line". I give some options:
use RANSAC algorithm (formally the best option IMHO)
this give you the best fit to the aligned points disregarding the not aligned ones
it is quite difficult and maybe oversized for this work (IMHO)
consider the line defined by the first and last point
you told that the dip is almost always in the same position that is not near boundaries so first and last points can be thought as affordable
very easy to implement
this is an example of using the knowledge about expected behaviors as I told before so you need to think if and how much confidence you give to this assumption
consider a linear fit to the first 10 points and last 10 points
is only a more affordable version of previous since using more points you can be less worried that maybe just the first point or the last were affected by any measure problem and so all fails because of this
also quite easy to implement
if I were you I will use this or something inspired to this
calculate the Y value given by the straight line for each X
calculate the area between the two curves (or the areas under the function Y_dev = Y_data - Y_straight that is mathematically the same) with this procedure:
PositiveMax = 0; NegativeMax = 0;
start from first point (value can be positive or negative) and put in a temporary area accumulator tmp_Area
for each next point
if the sign is the same then accumulate the value
if it is different
stop accumulating
check if the accumulated value is the greater than PositiveMax or below NegativeMax and if it is than store as new PositiveMax or NegativeMax
in any case reset the accumulator with tmp_Area = Y_dev; to the current value starting this way a new accumulation
in the end you will have the values of the maximum overvalued contiguous area and maximum undervalued contiguous area that I think are the scores you're looking for.
if you want you can only manage the NegativeMax based on observed and expected data behaviors
you may find useful to put a thereshold so that if a value Y_dev is lower than the thereshold you do not accumulate it.
this in order to not obtain large accumulations from many points close to the straight line that can be similar to the accumulations of few points far from the line
the need of this and and the proper thereshold needs to be evaluated on some sample data
you need to find an appropriate thereshold for this contiguous area and you can have it only from observation of sample data.
again: it can be you observing and deciding the thereshold or you can build a repository of good and bad samples and write a program that automatically learn which thereshold to use. But his is not the algorithm, this is how to find its operative parameters and there is nothing wrong to do by human brain.. ..it only depends if we're looking for a method to separate bad and good things or if we're looking for and autoadaptive algorithm that does this.. ..you decide the target.
It turns out the result of my gut feeling and Diego's method is an average of the integral. I still don't like that name, so I have described the algorithm and have asked on Math.SE what to call this, which got migrated to "Cross Validated", Stats.SE .
I Updated graphs after a massive edit of my Math.SE question. It turns out I'm taking the average of a closed integral of the derivative of the data. :P First, we gather the data:
Next is the "derivative": step through the original data array to form the deltas array which is the rise of ADC values from one 0.75° step to the next. "Rise" or "slope" is what the derivative is: dy/dx.
With the "slope" or average leveled out, I can find multiple negative deltas in a row, sum them, then divide by the count at the end of the dip. The sum is an integral of the area between average and the deltas and when the dip goes back positive, I can divide the sum by the count of the dips.
During testing, I came up with a cutoff value for this average of the integral at 2.6. That was a great measure of my "gut instinct" looking at the plot thinking a part was good or bad.
In case someone else finds themselves trying to quantify this, here's the code I implemented. Note that it is only looking for negative dips. Also, dipCountLimit is defined elsewhere as 5. In addition to the dip detector/accumulator (ie Numerical Integrator) I also have a spike detector that arbitrarily flags the test as bad if any data points stray from the average by the amount of average + standard deviation. AVG+STD DEV as a spike limit was chosen arbitrarily based on the observed plots of the parts it would fail.
int dipdx=0;
// inDipFlag also counts the length of this dip
int inDipFlag=0;
float dips[140] = { 0.0 };
for( int idx=0; idx<stop; idx++ ){
const float diffFromAvg = deltas[idx] - line->AdcAvgRise;
// state machine to monitor dips
const int _stop = stop-1;
if( diffFromAvg < 0 && idx < _stop ) {
// check NEXT data point for negative diff & set dipFlag to put state in dip
const float nextDiff = deltas[idx+1] - line->AdcAvgRise;
if( nextDiff < 0 && inDipFlag == 0 )
inDipFlag = 1;
// already IN a dip, and next diff is negative
if( nextDiff < 0 && inDipFlag > 0 ) {
inDipFlag++;
}
// accumulate this dip
dips[dipdx]+= diffFromAvg;
// next data point ends this dip and we advance dipdx to next dip
if( inDipFlag > 0 && nextDiff > 0 ) {
if( inDipFlag < dipCountLimit ){
// reset the accumulator, do not advance dipdx to next entry
dips[dipdx]=0.0;
} else {
// change this entry's value from dip sum to its ratio
dips[dipdx] = -dips[dipdx]/inDipFlag;
// advance dipdx to next entry
dipdx++;
}
// Next diff isn't negative, so the dip is done
inDipFlag = 0;
}
}
}

Designing a grid overlay based on longitudes and latitudes

I'm trying to figure out the best way to approach the following:
Say I have a flat representation of the earth. I would like to create a grid that overlays this with each square on the grid corresponding to about 3 square kilometers. Each square would have a unique region id. This grid would just be stored in a database table that would have a region id and then probably the long/lat coordinates of the four corners of the region, right? Any suggestions on how to generate this table easily? I know I would first need to find out the width and height of this "flattened earth" in kms, calculate the number of regions, and then somehow assign the long/lats to each intersection of vertical/horizontal line; however, this sounds like a lot of manual work.
Secondly, once I have that grid table created, I need to design a fxn that takes a long/lat pair and then determines which logical "region" it is in. I'm not sure how to go about this.
Any help would be appreciated.
Thanks.
Assume the Earth is a sphere with radius R = 6371 km.
Start at (lat, long) = (0, 0) deg. Around the equator, 3km corresponds to a change in longitude of
dlong = 3 / (2 * pi * R) * 360
= 0.0269796482 degrees
If we walk around the equator and put a marker every 3km, there will be about (2 * pi * R) / 3 = 13343.3912 of them. "About" because it's your decision how to handle the extra 0.3912.
From (0, 0), we walk North 3 km to (lat, long) (0.0269796482, 0). We will walk around the Earth again on a path that is locally parallel to the first path we walked. Because it is a little closer to the N Pole, the radius of this circle is a bit smaller than that of the first circle we walked. Let's use lower case r for this radius
r = R * cos(lat)
= 6371 * cos(0.0269796482)
= 6 368.68141 km
We calculate dlong again using the smaller radius,
dlong = 3 / (2 * pi * r) * 360
= 0.0269894704 deg
We put down the second set of flags. This time there are about (2 * pi * r) / 3 = 13 338.5352 of them. There were 13,343 before, but now there are 13,338. What's that? five less.
How do we draw a ribbon of squares when there are five less corners in the top line? In fact, as we walked around the Earth, we'd find that we started off with pretty good squares, but that the shape of the regions sheared out into pretty extreme parallelograms.
We need a different strategy that gives us the same number of corners above and below. If the lower boundary (SW-SE) is 3 km long, then the top should be a little shorter, to make a ribbon of trapeziums.
There are many ways to craft a compromise that approximates your ideal square grid. This wikipedia article on map projections that preserve a metric property, links to several dozen such strategies.
The specifics of your app may allow you to simplify things considerably, especially if you don't really need to map the entire globe.
Microsoft has been investing in spatial data types in their SQL Server 2008 offering. It could help you out here. Because it has data types to represent your flattened earth regions, operators to determine when a set of coordinates is inside a geometry, etc. Even if you choose not to use this, consider checking out the following links. The second one in particular has a lot of good background information on the problem and a discussion on some of the industry standard data formats for spatial data.
http://www.microsoft.com/sqlserver/2008/en/us/spatial-data.aspx
http://jasonfollas.com/blog/archive/2008/03/14/sql-server-2008-spatial-data-part-1.aspx
First, Paul is right. Unfortunately the earth is round which really complicates the heck out of this stuff.
I created a grid similar to this for a topographical mapping server many years ago. I just recoreded the coordinates of the upper left coder of each region. I also used UTM coordinates instead of lat/long. If you know that each region covers 3 square kilometers and since UTM is based on meters, it is straight forward to do a range query to discover the right region.
You do realize that because the earth is a sphere that "3 square km" is going to be a different number of degrees near the poles than near the equator, right? And that at the top and bottom of the map your grid squares will actually represent pie-shaped parts of the world, right?
I've done something similar with my database - I've broken it up into quad cells. So what I did was divide the earth into four quarters (-180,-90)-(0,0), (-180,0)-(0,90) and so on. As I added point entities to my database, if the "cell" got more than X entries, I split the cell into 4. That means that in areas of the world with lots of point entities, I have a lot of quad cells, but in other parts of the world I have very few.
My database for the quad tree looks like:
\d areaids;
Table "public.areaids"
Column | Type | Modifiers
--------------+-----------------------------+-----------
areaid | integer | not null
supercededon | timestamp without time zone |
supercedes | integer |
numpoints | integer | not null
rectangle | geometry |
Indexes:
"areaids_pk" PRIMARY KEY, btree (areaid)
"areaids_rect_idx" gist (rectangle)
Check constraints:
"enforce_dims_rectangle" CHECK (ndims(rectangle) = 2)
"enforce_geotype_rectangle" CHECK (geometrytype(rectangle) = 'POLYGON'::text OR rectangle IS NULL)
"enforce_srid_rectangle" CHECK (srid(rectangle) = 4326)
I'm using PostGIS to help find points in a cell. If I look at a cell, I can tell if it's been split because supercededon is not null. I can find its children by looking for ones that have supercedes equal to its id. And I can dig down from top to bottom until I find the ones that cover the area I'm concerned about by looking for ones with supercedeson null and whose rectangle overlaps my area of interest (using the PostGIS '&' operator).
There's no way you'll be able to do this with rectangular cells, but I've just finished an R package dggridR which would make this easy to do using a grid of hexagonal cells. However, the 3km cell requirement might yield so many cells as to overload your machine.
You can use R to generate the grid:
install.packages('devtools')
install.packages('rgdal')
library(devtools)
devools.install_github('r-barnes/dggridR')
library(dggridR)
library(rgdal)
#Construct a discrete global grid (geodesic) with cells of ~3 km^2
dggs <- dgconstruct(area=100000, metric=FALSE, resround='nearest')
#Get a hexagonal grid for the whole earth based on this dggs
grid <- dgearthgrid(dggs,frame=FALSE)
#Save the grid
writeOGR(grid, "grid_3km_cells.kml", "cells", "KML")
The KML file then contains the ids and edge vertex coordinates of every cell.
The grid looks a little like this:
My package is based on Kevin Sahr's DGGRID which can generate this same grid to KML directly, though you'll need to figure out how to compile it yourself.