Population-weighted center of a state - sas

I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.

The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf

To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;

Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org

I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.

Related

How does CfsSubsetEva (Correlation-based Feature Selection) works in Weka

I have a dataset which is categorical dataset. I am using WEKA software for feature selection. I have used CfsSubsetEval as attribute evaluator with Greedystepwise method. I came to know this link that CFS uses Pearson correlation to find the strong correlation between the dataset. I also found out how to calculate Pearson correlation coefficient using this link. As per the link the data values need to be numerical for evaluation. Then how can WEKA did the evaluation on my categorical dataset?
The strange result is that Among 70 attributes CFS selects only 10 attributes. Is it because of the categorical dataset? Additionally my dataset is a highly imbalanced dataset where imbalanced ration 1:9(yes:no).
A Quick question
If you go through the link you can found the statement the correlation coefficient to measure the strength and direction of the linear relationship between two numerical variables X and Y. Now I can understand the strength of the correlation coefficient which is varied in between +1 to -1 but what about the direction? How can I get that? I mean the variable is not a vector so it should not have a direction.
The method correlate in the CfsSubsetEval class is used to compute the correlation between two attributes. It calls other methods, depending on the attribute types, which I've linked here:
two numeric attributes: num_num
numeric/nominal attributes: num_nom2
two nominal attributes: nom_nom

Finding objects within x miles of a point

I'm working on getting all events within 10 miles of the user's location. My models look something like this:
class User(models.Model):
location = models.PointField()
...
class Event(models.Model):
location = models.PointField()
...
In my tests, when I check the distance between the user and an event, I get the value 11.5122663513:
from geopy.distance import vincenty
print vincenty(request.user.location, event.location).miles # 11.5122663513
Yet, when I query for all events within 10 miles of the user's location, that event is returned:
Event.objects.filter(location__distance_lte=(request.user.location, D(mi=10))).count() # 1
Only when I drop the radius to less than 4 miles does the filter take effect:
Event.objects.filter(location__distance_lte=(request.user.location, D(mi=3))).count() # 0
I'm following the docs' example almost exactly, so I don't think my query is the problem.
What could be causing this discrepancy?
This very much depends on what type of database you are using.
Because cartesian math is much faster than geospatial math, the query likely treats coordinates as if they are on a plane rather than on a sphere.
The docs explain it this way:
Most people are familiar with using latitude and longitude to
reference a location on the earth’s surface. However, latitude and
longitude are angles, not distances. In other words, while the
shortest path between two points on a flat surface is a straight line,
the shortest path between two points on a curved surface (such as the
earth) is an arc of a great circle. Thus, additional computation
is required to obtain distances in planar units (e.g., kilometers and
miles). Using a geographic coordinate system may introduce
complications for the developer later on. For example, Spatialite does
not have the capability to perform distance calculations between
geometries using geographic coordinate systems, e.g. constructing a
query to find all points within 5 miles of a county boundary stored as
WGS84.
Portions of the earth’s surface may projected onto a two-dimensional,
or Cartesian, plane. Projected coordinate systems are especially
convenient for region-specific applications, e.g., if you know that
your database will only cover geometries in North Kansas, then you may
consider using projection system specific to that region. Moreover,
projected coordinate systems are defined in Cartesian units (such as
meters or feet), easing distance calculations.
Furthermore, this may be influenced by your database choice. If you are using Postgres/PostGIS, it has the following note in the docs:
In PostGIS, ST_Distance_Sphere does not limit the geometry types
geographic distance queries are performed with. However, these
queries may take a long time, as great-circle distances must be
calculated on the fly for every row in the query. This is because the
spatial index on traditional geometry fields cannot be used.
For much better performance on WGS84 distance queries, consider using
geography columns in your database instead because they are able to
use their spatial index in distance queries. You can tell GeoDjango to
use a geography column by setting geography=True in your field
definition.
You can check this yourself by printing out the raw SQL:
qs = Event.objects.filter(location__distance_lte=(request.user.location, D(mi=10))
print qs.query
Depending on your database type, and the amount of data you plan to store, you have a couple options:
Filter the points a second time in python
Try setting geography=True
Set an explicit SRID
Take a point, buffer it out into a circle with the given radius and then find points within that circle using contains
Use a different database type
If you share the raw query it'll be easier to figure out what is happening.

Finding the Distance Between Two Lines that represent GPS routes (MATLAB, Java, C++, or Python)

I have been researching and trying to figure this one out to no avail. I have found many ways not to solve this...
The gist of the problem: I am looking for a method to calculate the deviance from an original path traveled by way of GPS coordinates. I have multiple csv files that contain latitude, longitude, and UTC time. I have created KML files from this information for a visual viewing of the deviance and now would like to put a value on this deviation. I ahve chosen a route as a reference and would like to measure the other routes against the reference route. There are multiple routes each having it's own reference route, each of which has many runs. No two runs are the same, and some of the routes deviate more than the next. I cannot use time, only lat and lon since the runs were completed over many weeks of data collection.
What I have tried thus far:
Haversine and Equirectangular formulas (looping through and measuring point to point).
Outcome: The coordinates only line up for a short period of time and the difference in the number of points varies greatly.
Area under each curve: was going to find the difference of the two routes by this method.
Outcome: Really unsure how to proceed, nor find equations suitable for this calculation.
There were a couple more feeble attempts, but have been working on this for a few weeks now, with not much to show for and still unsure on how to proceed.
Any help or ideas would be greatly appreciated.
Possible solution 1: Instead of calculating the "sideways" deviation between the two routes, just compare the respective arc lengths (Matlab: arclength).
Possible solution 2: To compare two routes, each going from the same start A to the same end point B: Draw a straight line between A and B, place a number of equidistant points along AB, and then average the perpendicular distance from these points on AB to the paths you want to compare. The absolute difference between the cumulative deviations from the straight-line reference is your deviation.
Possible solution 3: Calculate the arc length of each route. Place a number of equidistant points along each route. Average the distance between these points.
Both solution 2 and 3 will depend on the number of points you place, but with a higher number of points, the average deviation will converge. Note that these solutions are both related to calculating the area under each curve.

When calculating a distance from a city, how can I factor in the approximate size (physical area) of the city?

I'm building a store locator based on in-house geocoding data. Effectively I need to query stories near City X or Zip Y within a certain radius. The data sets I'm working with are relatively comprehensive and include things such as population.
One issue is that large cities (Los Angeles for example) are many miles in radius so you could be within the city but miles from the coordinate we have loaded.
Is there a rule of thumb, or a free data feed which would list an approximate radius of a city, or perhaps even outlines of the city points?
Also, assuming I have a shape defining the city what calculation would I use to say "stores within X miles of this area"?
Why don't you use the zip codes and latitude/longitude of the stores, instead of the cities? You know the addresses of the stores, so use its zip code, look up the coordinates, and calculate the distance from the origin zip code. Then it wouldn't matter how big the city is, because big cities have many zip codes, but each store has its own zip code.
It would only be a problem for states with big zip codes like Texas, but then there is likely not more than 1 store per zip code anyways so not a big deal.
Ultimately we didn't implement this feature, but before it was cancelled I had a fair amount of success using the below approach:
Finding coordinates for the city itself, as well as all zip codes of the city
"Connecting the dots" of all the above coordinates to create a polygon of the (very rough shape of the city)
Checking if the user's input coordinate was within the given range of the polygon
The above approach worked relatively well and may have ultimately developed into a sound solution with some more enhancements and tuning.

How to exploit periodicity to reduce noise of a signal?

100 periods have been collected from a 3 dimensional periodic signal. The wavelength slightly varies. The noise of the wavelength follows Gaussian distribution with zero mean. A good estimate of the wavelength is known, that is not an issue here. The noise of the amplitude may not be Gaussian and may be contaminated with outliers.
How can I compute a single period that approximates 'best' all of the collected 100 periods?
Time-series, ARMA, ARIMA, Kalman Filter, autoregression and autocorrelation seem to be keywords here.
UPDATE 1: I have no idea how time-series models work. Are they prepared for varying wavelengths? Can they handle non-smooth true signals? If a time-series model is fitted, can I compute a 'best estimate' for a single period? How?
UPDATE 2: A related question is this. Speed is not an issue in my case. Processing is done off-line, after all periods have been collected.
Origin of the problem: I am measuring acceleration during human steps at 200 Hz. After that I am trying to double integrate the data to get the vertical displacement of the center of gravity. Of course the noise introduces a HUGE error when you integrate twice. I would like to exploit periodicity to reduce this noise. Here is a crude graph of the actual data (y: acceleration in g, x: time in second) of 6 steps corresponding to 3 periods (1 left and 1 right step is a period):
My interest is now purely theoretical, as http://jap.physiology.org/content/39/1/174.abstract gives a pretty good recipe what to do.
We have used wavelets for noise suppression with similar signal measured from cows during walking.
I'm don't think the noise is so much of a problem here and the biggest peaks represent actual changes in the acceleration during walking.
I suppose that the angle of the leg and thus accelerometer changes during your experiment and you need to account for that in order to calculate the distance i.e you need to know what is the orientation of the accelerometer in each time step. See e.g this technical note for one to account for angle.
If you need get accurate measures of the position the best solution would be to get an accelerometer with a magnetometer, which also measures orientation. Something like this should work: http://www.sparkfun.com/products/10321.
EDIT: I have looked into this a bit more in the last few days because a similar project is in my to do list as well... We have not used gyros in the past, but we are doing so in the next project.
The inaccuracy in the positioning doesn't come from the white noise, but from the inaccuracy and drift of the gyro. And the error then accumulates very quickly due to the double integration. Intersense has a product called Navshoe, that addresses this problem by zeroing the error after each step (see this paper). And this is a good introduction to inertial navigation.
Periodic signal without noise has the following property:
f(a) = f(a+k), where k is the wavelength.
Next bit of information that is needed is that your signal is composed of separate samples. Every bit of information you've collected are based on samples, which are values of f() function. From 100 samples, you can get the mean value:
1/n * sum(s_i), where i is in range [0..n-1] and n = 100.
This needs to be done for every dimension of your data. If you use 3d data, it will be applied 3 times. Result would be (x,y,z) points. You can find value of s_i from the periodic signal equation simply by doing
s_i(a).x = f(a+k*i).x
s_i(a).y = f(a+k*i).y
s_i(a).z = f(a+k*i).z
If the wavelength is not accurate, this will give you additional source of error or you'll need to adjust it to match the real wavelength of each period. Since
k*i = k+k+...+k
if the wavelength varies, you'll need to use
k_1+k_2+k_3+...+k_i
instead of k*i.
Unfortunately with errors in wavelength, there will be big problems keeping this k_1..k_i chain in sync with the actual data. You'd actually need to know how to regognize the starting position of each period from your actual data. Possibly need to mark them by hand.
Now, all the mean values you calculated would be functions like this:
m(a) :: R->(x,y,z)
Now this is a curve in 3d space. More complex error models will be left as an excersize for the reader.
If you have a copy of Curve Fitting Toolbox, localized regression might be a good choice.
Curve Fitting Toolbox supports both lowess and loess localized regression models for curve and curve fitting.
There is an option for robust localized regression
The following blog post shows how to use cross validation to estimate an optimzal spaning parameter for a localized regression model, as well as techniques to estimate confidence intervals using a bootstrap.
http://blogs.mathworks.com/loren/2011/01/13/data-driven-fitting/