Modifying the scale of X axis in graph - stata

I'm trying to plot a graph in Stata.
This is my code:
scatter logpgp95 avexpr || lfit logpgp95 avexpr, ylabel(4(2)10) xscale(range(4 10))
It gives me a graph like this:
I want a graph with X-axis starting at 4 and continuing to 10 and the 4 should be where the 2 is.
Something like this(as an example):
If I use the xlabel code only:
scatter logpgp95 avexpr || lfit logpgp95 avexpr, ylabel(4(2)10) xlabel(4(2)10)
I get this:
The problem is that I don't want 4 to be so far from the starting point.

My guess is that you have a value below 2 for avexpr which has a missing value for logpgp95 for the same observation. In this case the graph will still take that value as a minimum for the x axis but doesn't plot it as it doesn't have its y-value.
Try this:
scatter logpgp95 avexpr if !missing(logpgp95, avexpr) || lfit logpgp95 avexpr, ylabel(4(2)10)

Your problem is not reproducible.
I simulated some data with x axis range from 3.5 to 10. Even without asking the x axis labels appear as 4 6 8 10 and there is no enormous space to the left.
clear
set obs 100
set seed 2803
range x 3.5 10
gen y = x + rnormal()
scatter y x || lfit y x
I have to guess that your real code differs from what we can see. Other way round, we need your data to check what is happening.

Related

How to simulate pairs from a joint distribution

I have two normal distributions X and Y and with a given covariance between them and variances for both X and Y, I want to simulate (say 200 points) of pairs of points from the joint distribution, but I can't seem to find a command/way to do this. I want to eventually plot these points in a scatter plot.
so far, I have
set obs 100
set seed 1
gen y = 64*rnormal(1, 5/64)
gen x = 64*rnromal(1, 5/64)
matrix D = (1, .5 | .5, 1)
drawnorm x, y, cov(D)
but this makes an error saying that x and y already exist.
Also, once I have a sample, how would I plot the drawnorm output as a scatter?
A related approach for generating correlated data is to use the corr2data command:
clear
set obs 100
set seed 1
matrix D = (1, .5 \ .5, 1)
drawnorm x1 y1, cov(D)
corr2data x2 y2, cov(D)
. summarize x*
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
x1 | 100 .0630304 1.036762 -2.808194 2.280756
x2 | 100 1.83e-09 1 -2.332422 2.238905
. summarize y*
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y1 | 100 -.0767662 .9529448 -2.046532 2.726873
y2 | 100 3.40e-09 1 -2.492884 2.797518
It is important to note that unlike drawnorm, the corr2data approach does not generate data that is a sample from an underlying population.
You can then create a scatter plot as follows:
scatter x1 y1
Or to compare the two approaches in a single graph:
twoway scatter x1 y1 || scatter x2 y2
EDIT:
For specific means and variances you need to specify the mean vector μ and covariance matrix Σ in drawnorm. For example, to draw two random variables that are jointly normally distributed with means of 8 and 12, and variances 5 and 8 respectively, you type:
matrix mu = (8, 12)
scalar cov = 0.4 * sqrt(5 * 8) // assuming a correlation of 0.4
matrix sigma = (5, cov \ cov, 8)
drawnorm double x y, means(mu) cov(sigma)
The mean and cov options of drawnorm are both documented in the help file.
Here is an almost minimal example:
. clear
. set obs 100
number of observations (_N) was 0, now 100
. set seed 1
. matrix D = (1, .5 \ .5, 1)
. drawnorm x y, cov(D)
As the help for drawnorm explains, you must supply new variable names. As x and y already exist, drawnorm threw you out. You also had a superfluous comma that would have triggered a syntax error.
help scatter tells you about scatter plots.

How to use pymc to parameterize a probabilistic graphical model?

How can one use pymc to parameterize a probabilistic graphical model?
Suppose I have a PGM with two nodes X and Y.
Lets say X->Y is the graph.
And X takes two values {0,1}, and
Y also takes two values {0,1}.
I want to use pymc to learn the parameters of the distribution and populate the
graphical model with it for running inferences.
The way I could think of is as follows:
X_p = pm.Uniform("X_p", 0, 1)
X = pm.Bernoulli("X", X_p, values=X_Vals, observed=True)
Y0_p = pm.Uniform("Y0_p", 0, 1)
Y0 = pm.Bernoulli("Y0", Y0_p, values=Y0Vals, observed=True)
Y1_p = pm.Uniform("Y1_p", 0, 1)
Y1 = pm.Bernoulli("Y1", Y1_p, values=Y1Vals, observed=True)
Here Y0Vals are values of Y corresponding to X values = 0
And Y1Vals are values of Y corresponding to X values = 1.
The plan is to draw MCMC samples from these and use the means of Y0_p and Y1_p
to populate the discrete bayesian network's probability... So the probability table
for P(X) = (X_p,1-X_p) while that of P(Y/X):
Y 0 1
X
0 Y0_p 1-Y0_p
1 Y1_p 1-Y1_p
Questions:
Is this the correct way of doing this?
Does not this get clumsy, especially if I have X having 100s of discrete values?
or if a variable has two parents X and Y with 10 discrete values each?
Is there something better I can do?
Are there any good books that detail how we can do this kind of interconnection.

Cycle doesn't work

In Stata, I have the following variables: latitude, longitude, avg_luminosity. For each observation (1547 total), I need to find a sum (let's call this variable sum_lum) of average luminosities of "neighbours" of this particular pair of latitude and longitude, those whose latitude and longitude lie within 0.5 radius. I have tried the following code:
tempvar sum_temp
forvalues i=1/1547 {
egen `sum_temp' = sum(avg_luminosity) if (latitude<latitude[_n]+0.5 & latitude>latitude[_n]-0.5 & longitude<longitude[_n]+0.5 & longitude>longitude[_n]+0.5)
replace sum_lum[_n]= sum_temp
drop `sum_temp'
}
But the code doesn't work (weights not allowed). Could anyone please help me on this issue?
We don't here have a very good question, as no sample data are given with which to run the code. See https://stackoverflow.com/help/mcve for how to ask a good question. We have that 1547 is the number of observations.
But nevertheless there are various problems identifiable with this code.
First, consider the if qualifier:
if (latitude<latitude[_n]+0.5 & latitude>latitude[_n]-0.5 & longitude<longitude[_n]+0.5 & longitude>longitude[_n]+0.5)
We need to correct a typo there: the last +0.5 should evidently be -0.5.
To focus on the main problem, replace latitude with y and longitude with x
if (y < y[_n]+0.5 & y > y[_n]-0.5 & x < x[_n]+0.5 & x > x[_n]-0.5)
The subscript [_n] just means the current observation and is superfluous:
if (y < y+0.5 & y > y-0.5 & x < x+0.5 & x > x-0.5)
from which it can be seen that the qualification is no qualification: it is always true that (using mathematical notation now) y - 0.5 < y < y + 0.5 and similarly for x.
The intent of this code is to compare any y and any x with the current y and x, but that is not what it does in Stata.
Otherwise put, the guess may be that [_n] has a different interpretation each time round a loop, but that is not the case.
Second, the effect of the loop 1/1547 would, if the code were otherwise correct, would be to repeat exactly the same calculation 1547 times. The intent of the code is no doubt otherwise, but nothing inside the loop uses the loop index i in any way.
Third, neither of these is the problem reported.
replace sum_lum[_n]= sum_temp
fails because of the subscript, which is not allowed with replace before the equals sign: the error message about weights is Stata's guess that you are trying to specify weights. The statement would also fail (to do what you want, or very likely to work at all), because the variable on the right-side should be the temporary variable you have just created.
Fourth, although this is style not syntax, using egen to calculate a sum is overkill. No new variable need be re-created 1547 times only to be droppred.
Here's a guess at what will work:
gen sum_lum = .
local y latitude
local x longitude
quietly forval i = 1/1547 {
summarize avg_luminosity if inrange(`y', `y'[`i'] - 0.5, `y'[`i'] + 0.5) & ///
inrange(`x', `x'[`i'] - 0.5, `x'[`i'] + 0.5), meanonly
replace sum_lum = r(sum) in `i'
}
That loop uses the current observation's latitude and longitude.

Calculate a difference in stata with if command

I want to calculate something like
by group: egen x if y==1 - x if y==2
Of course this is not a real stata code but I'm kind of lost. In R this is simply passed by a "[]" behind the variable of intrest but I'm not sure about stata
R would be
x[y==1] - x[y==2]
I would use reshape.
clear
version 11.2
set seed 2001
* generate data
set obs 100
generate y = 1 + mod(_n - 1, 2)
generate x = rnormal()
generate group = 1 + floor((_n - 1) / 2)
list in 1/10
* reshape to wide and difference
reshape wide x, i(group) j(y)
generate x_diff = x1 - x2
list in 1/5
I would use reshape in R, also. Otherwise can you be sure that everything is properly ordered to give you the difference you want?
There is likely a neat Mata solution, but I know very little Mata. You may find preserve and restore helpful if you're averse to reshapeing.
Richard Herron makes a good point that a reshape to a different structure might be worthwhile. Here I focus on how to do it with the existing structure.
Assuming that there are precisely two observations for each group of group, one with y == 1 and one with y == 2, then
bysort group (y) : gen diff = x[1] - x[2]
gives the difference between values of x, necessarily repeated for each observation of two in a group. An assumption-free method is
bysort group: egen mean_1 = mean(x / (y == 1))
by group: egen mean_2 = mean(x / (y == 2))
gen diff = mean_1 - mean_2
Consider expressions such as x / (y == 1). Here the denominator y == 1 is 1 when y is indeed 1 and 0 otherwise. Division by 0 yields missing in Stata, but the egen command here ignores those. So the first command of the three commands above yields the mean of x for observations for which y == 1 and the second the mean of x for observations for which y == 2. Other values of y (even missings) will be ignored. This method should agree with the first method when the first method is valid.
For a review of similar problems, see http://stata-journal.com/article.html?article=dm0055
In Stata the if referred to here is a qualifier (not a command).

Calculate the gradient for an histogram in c++

I calculated the histogram(a simple 1d array) for an 3D grayscale Image.
Now I would like to calculate the gradient for the this histogram at each point. So this would actually mean I have to calculate the gradient for a 1D function at certain points. However I do not have a function. So how can I calculate it with concrete x and y values?
For the sake of simplicity could you probably explain this to me on an example histogram - for example with the following values (x is the intensity, and y the frequency of this intensity):
x1 = 1; y1 = 3
x2 = 2; y2 = 6
x3 = 3; y3 = 8
x4 = 4; y4 = 5
x5 = 5; y5 = 9
x6 = 6; y6 = 12
x7 = 7; y7 = 5
x8 = 8; y8 = 3
x9 = 9; y9 = 5
x10 = 10; y10 = 2
I know that this is also a math problem, but since I need to solve it in c++ I though you could help me here.
Thank you for your advice
Marc
I think you can calculate your gradient using the same approach used in image border detection (which is a gradient calculus). If your histogram is in a vector you can calculate an approximation of the gradient as*:
for each point in the histogram compute
gradient[x] = (hist[x+1] - hist[x])
This is a very simple way to do it, but I'm not sure if is the most accurate.
approximation because you are working with discrete data instead of continuous
Edited:
Other operators will may emphasize small differences (small gradients will became more emphasized). Roberts algorithm derives from the derivative calculus:
lim delta -> 0 = f(x + delta) - f(x) / delta
delta tends infinitely to 0 (in order to avoid 0 division) but is never zero. As in computer's memory this is impossible, the smallest we can get of delta is 1 (because 1 is the smallest distance from to points in an image (or histogram)).
Substituting
lim delta -> 0 to lim delta -> 1
we get
f(x + 1) - f(x) / 1 = f(x + 1) - f(x) => vet[x+1] - vet[x]
Two generally approaches here:
a discrete approximation to the derivative
take the real derivative of a fitted function
In the first case try:
g = (y_(i+1) - y_(i-1))/2*dx
at all the points except the ends, or one of
g_left-end = (y_(i+1) - y_i)/dx
g_right-end = (y_i - y_(i-1))/dx
where dx is the spacing between x points. (Unlike the equally correct definition Andres suggested, this one is symmetric. Whether it matters or not depends on you use case.)
In the second case, fit a spline to your data[*], and ask the spline library the derivative at the point you want.
[*] Use a library! Do not implement this yourself unless this is a learning project. I'd use ROOT because I already have it on my machine, but it is a pretty heavy package just to get a spline...
Finally, if you data is noisy, you ma want to smooth it before doing slope detection. That was you avoid chasing the noise, and only look at large scale slopes.
Take some squared paper and draw on it your histogram. Draw also vertical and horizontal axes through the 0,0 point of your histogram.
Take a straight edge and, at each point you are interested in, rotate the straight edge until it accords with your idea of what the gradient at that point is. It is most important that you do this, your definition of gradient is the one you want.
Once the straight edge is at the angle you desire draw a line at that angle.
Drop perpendiculars from any 2 points on the line you just drew. It will be easier to take the following step if the horizontal distance between the 2 points you choose is about 25% or more of the width of your histogram. From the same 2 points draw horizontal lines to intersect the vertical axis of your histogram.
Your lines now define an x-distance and a y-distance, ie the length of the horizontal/ vertical (respectively) axes marked out by their intersections with the perpendiculars/horizontal lines. The gradient you want is the y-distance divided by the x-distance.
Now, to translate this into code is very straightforward, apart from step 2. You have to define what the criteria are for determining what the gradient at any point on the histogram is. Simple choices include:
a) at each point, set down your straight edge to pass through the point and the next one to its right;
b) at each point, set down your straight edge to pass through the point and the next one to its left;
c) at each point, set down your straight edge to pass through the point to the left and the point to the right.
You may want to investigate more complex choices such as fitting a curve (such as a quadratic or higher-order polynomial) through a number of points on your histogram and using the derivative of that to represent the gradient.
Until you understand the question on paper avoid coding in C++ or anything else. Once you do understand it, coding should be trivial.