Stata Scatter Sort by Y-Axis - stata

I want to plot the average weight (y-axis) by make (x-axis) and sort it so the heaviest make is the leftmost on the x-axis and the lightest is the rightmost on the x-axis. I thought the sort option would work.
sysuse auto, clear
keep if foreign
sort mpg
gen obsno = _n
scatter weight obsno, xla(1/22) sort(weight)

The sort() option is allowed with scatter because one of the possibilities of scatter is to connect points with a line. But it refers only to the order in which points are connected. The default is to connect points in the current sort order of the dataset. In practice the most common example is that observations are in some time order or follow some other sequence but even then scatter would respect the order of a time or other variable only if the data were in the same order -- unless, to complete the circle, a sort order were specified with this option.
You are not asking for any connnection. In that circumstance, sort() remains legal but it is ignored as of no relevance to what you're asking for.
There is no circumstance in which sort() has another effect with scatter and it will not change either axis variable to something else.
A way to get what I think you want is with the undocumented vertical option of graph dot. Some of the small choices here are just my personal idea of what looks good. For example, I have found that the default dotted grid lines often copy poorly to other software, so I use thin light grey continuous lines as a grid.
sysuse auto, clear
keep if foreign
sort mpg
gen obsno = _n
graph dot (asis) weight, over(obsno, sort(1) descending) vertical ///
linetype(line) lines(lcolor(gs12) lw(vthin)) yla(, ang(h))
It's perfectly possible to get a similar graph using scatter, just more work as you have to arrange that the observation numbers become the value labels of a variable defining the sort order.
See also quantile and qplot (Stata Journal).

Related

Putting a Regression Line When Using Pandas scatter_matrix

I'm using scatter_matrix for correlation visualization and calculating correlation values using corr(). Is it possible to have the scatter_matrix visualization draw the regression line in the scatter plots?
I think this is a misleading question/thought process.
If you think of data in strictly 2 dimension then a regression line on a scatter plot makes sense. But let's say you have 5 dimensions of data you are plotting in your scatter matrix. In this case the regression for each pair of dimensions is not an accurate representation of the global regression.
I would be wary presenting that to anyone as I can easily see where it could create confusion.
That being said if you don't care about a regression across all of your dimensions then you could write your own function to do this. A quick walk through of steps may be:
1. Identify number of dimensions N
2. Create figure
3. Double for loop on N, first will walk down rows, second will walk across rows
4. At each point add subplot, calculate regression (if not kde/hist position), plot scatter cloud and regression line or kde/hist

How do I remove the leftmost zero (on the x-axis) when graphing a categorical variable?

hist body, discrete freq xlabel(#5, labsize(small) angle(forty_five) valuelabel) produces:
I'm graphing a categorical variable, but I can't figure out how to drop the zero from the x-axis. I've tried the documentation for xlabel() and xscale() but didn't find any winners.
The short answer is to spell out that you only want xla(1/5, stuff ). How to spell out precisely which labels you want is documented.
Not the question, but this is in my view a poor graph. Go with a horizontal bar chart in which (1) the discreteness of the variable is respected;(2) the category labels are properly and readably horizontal, instead of using a most awkward device of text at 45 degrees. catplot (SSC) is one way to go. Also in Stata 13 (updated) upwards, graph hbar will do as well. You should also split the title in two lines. Even further off-topic: most consumers of this research should not care two hoots about the variable name or its question number in your survey.

How do I suppress markers in a Stata graph?

I'm using lowess to plot average cholesterol over time. Each participant had their cholesterol measured at random dates - usually not at the same time as others.
Anyway, I want the smoothed line, but I don't want the markers, especially since the markers seem to prevent scaling the y-axis from 0-500 to 0-250. Even when I go to the Graph Editor to remove the markers by hand, I still cannot rescale the y-axis.
How do I remove the markers using code only? And will doing this allow me to rescale the y-axis? Or, should I use a different command than lowess?
Graph commands have many, many, options. It's a matter of going through them very carefully using help <command> and the manual. The following graph suppresses the markers.
clear all
set more off
sysuse auto
lowess mpg weight, mean msymbol(i)
Writing here more as a statistics user rather than a statistical programmer:
Suppressing the data sounds a very bad idea, regardless of your implication that it is what you need.
lowess isn't one thing: even with one implementation (Stata), there's still the question of what bandwidth was used. Note that there are several lowess (loess, locfit) algorithms around in different programs.
That said, the short answer is that twoway lowess rather than lowess does what you ask.

Shape-matching of plots using non-linear least squares

What would b the best way to implement a simple shape-matching algorithm to match a plot interpolated from just 8 points (x, y) against a database of similar plots (> 12 000 entries), each plot having >100 nodes. The database has 6 categories of plots (signals measured under 6 different conditions), and the main aim is to find the right category (so for every category there's around 2000 plots to compare against).
The 8-node plot would represent actual data from measurement, but for now I am simulating this by selecting a random plot from the database, then 8 points from it, then smearing it using gaussian random number generator.
What would be the best way to implement non-linear least-squares to compare the shape of the 8-node plot against each plot from the database? Are there any c++ libraries you know of that could help with this?
Is it necessary to find the actual formula (f(x)) of the 8-node plot to use it with least squares, or will it be sufficient to use interpolation in requested points, such as interpolation from the gsl library?
You can certainly use least squares without knowing the actual formula. If all of your plots are measured at the same x value, then this is easy -- you simply compute the sum in the normal way:
where y_i is a point in your 8-node plot, sigma_i is the error on the point and Y(x_i) is the value of the plot from the database at the same x position as y_i. You can see why this is trivial if all your plots are measured at the same x value.
If they're not, you can get Y(x_i) either by fitting the plot from the database with some function (if you know it) or by interpolating between the points (if you don't know it). The simplest interpolation is just to connect the points with straight lines and find the value of the straight lines at the x_i that you want. Other interpolations might do better.
In my field, we use ROOT for these kind of things. However, scipy has a great collections of functions, and it might be easier to get started with -- if you don't mind using Python.
One major problem you could have would be that the two plots are not independent. Wikipedia suggests McNemar's test in this case.
Another problem you could have is that you don't have much information in your test plot, so your results will be affected greatly by statistical fluctuations. In other words, if you only have 8 test points and two plots match, how will you know if the underlying functions are really the same, or if the 8 points simply jumped around (inside their error bars) in such a way that it looks like the plot from the database -- purely by chance! ... I'm afraid you won't really know. So the plots that test well will include false positives (low purity), and some of the plots that don't happen to test well were probably actually good matches (low efficiency).
To solve that, you would need to either use a test plot with more points or else bring in other information. If you can throw away plots from the database that you know can't match for other reasons, that will help a lot.

Population-weighted center of a state

I have a list of states, major cities in each state, their populations, and lat/long coordinates for each. Using this, I need to calculate the latitude and longitude that corresponds to the center of a state, weighted by where the population lives.
For example, if a state has two cities, A (population 100) and B (population 200), I want the coordinates of the point that lies 2/3rds of the way between A and B.
I'm using the SAS dataset that comes installed called maps.uscity. It also has some variables called "Projected Logitude/Latitude from Radians", which I think might allow me just to take a simple average of the numbers, but I'm not sure how to get them back into unprojected coordinates.
More generally, if anyone can suggest of a straightforward approach to calculate this it would be much appreciated.
The Census Bureau has actually done these calculations, and posted the results here: http://www.census.gov/geo/www/cenpop/statecenters.txt
Details on the calculation are in this pdf: http://www.census.gov/geo/www/cenpop/calculate2k.pdf
To answer the question that was asked, it sounds like you might be looking for a weighted mean. Just use PROC MEANS and take a weighted average of each coordinate:
/* data from http://www.world-gazetteer.com/ */
data AL;
input city $10 pop lat lon;
datalines;
Birmingham 242452 33.53 86.80
Huntsville 159912 34.71 86.63
Mobile 199186 30.68 88.09
Montgomery 201726 32.35 86.28
;
proc means data=AL;
weight pop;
var lat lon;
run;
Itzy's answer is correct. The US Census's lat/lng centroids are based on population. In constrast, the USGS GNIS data's lat/lng averages are based on administrative boundaries.
The files referenced by Itzy are the 2000 US Census data. The Census bureau is in the processing of rolling our the 2010 data. The following link is a segway to all of this data.
http://www.census.gov/geo/www/tiger/
I can answer a lot of geospatial questions. I am part of a public domain geospatial team at OpenGeoCode.Org
I believe you can do this using the same method used for calculating the center of gravity of an airplane:
Establish a reference point southwest of any part of the state. Actually it doesn't matter where the reference point is, but putting it SW will keep all numbers positive in the usual x-y send we tend to think of things.
Logically extend N-S and E-W lines from this point.
Also extend such lines from the cities.
For each city get the distance from its lines to the reference lines. These are the moment arms.
Multiply each of the distance values by the population of the city. Effectively you're getting the moment for each city.
Add all of the moments.
Add all of the populations.
Divide the total of the moments by the total of the populations and you have the center of gravity with respect for the reference point of the populations involved.