Cycle doesn't work - stata

In Stata, I have the following variables: latitude, longitude, avg_luminosity. For each observation (1547 total), I need to find a sum (let's call this variable sum_lum) of average luminosities of "neighbours" of this particular pair of latitude and longitude, those whose latitude and longitude lie within 0.5 radius. I have tried the following code:
tempvar sum_temp
forvalues i=1/1547 {
egen `sum_temp' = sum(avg_luminosity) if (latitude<latitude[_n]+0.5 & latitude>latitude[_n]-0.5 & longitude<longitude[_n]+0.5 & longitude>longitude[_n]+0.5)
replace sum_lum[_n]= sum_temp
drop `sum_temp'
}
But the code doesn't work (weights not allowed). Could anyone please help me on this issue?

We don't here have a very good question, as no sample data are given with which to run the code. See https://stackoverflow.com/help/mcve for how to ask a good question. We have that 1547 is the number of observations.
But nevertheless there are various problems identifiable with this code.
First, consider the if qualifier:
if (latitude<latitude[_n]+0.5 & latitude>latitude[_n]-0.5 & longitude<longitude[_n]+0.5 & longitude>longitude[_n]+0.5)
We need to correct a typo there: the last +0.5 should evidently be -0.5.
To focus on the main problem, replace latitude with y and longitude with x
if (y < y[_n]+0.5 & y > y[_n]-0.5 & x < x[_n]+0.5 & x > x[_n]-0.5)
The subscript [_n] just means the current observation and is superfluous:
if (y < y+0.5 & y > y-0.5 & x < x+0.5 & x > x-0.5)
from which it can be seen that the qualification is no qualification: it is always true that (using mathematical notation now) y - 0.5 < y < y + 0.5 and similarly for x.
The intent of this code is to compare any y and any x with the current y and x, but that is not what it does in Stata.
Otherwise put, the guess may be that [_n] has a different interpretation each time round a loop, but that is not the case.
Second, the effect of the loop 1/1547 would, if the code were otherwise correct, would be to repeat exactly the same calculation 1547 times. The intent of the code is no doubt otherwise, but nothing inside the loop uses the loop index i in any way.
Third, neither of these is the problem reported.
replace sum_lum[_n]= sum_temp
fails because of the subscript, which is not allowed with replace before the equals sign: the error message about weights is Stata's guess that you are trying to specify weights. The statement would also fail (to do what you want, or very likely to work at all), because the variable on the right-side should be the temporary variable you have just created.
Fourth, although this is style not syntax, using egen to calculate a sum is overkill. No new variable need be re-created 1547 times only to be droppred.
Here's a guess at what will work:
gen sum_lum = .
local y latitude
local x longitude
quietly forval i = 1/1547 {
summarize avg_luminosity if inrange(`y', `y'[`i'] - 0.5, `y'[`i'] + 0.5) & ///
inrange(`x', `x'[`i'] - 0.5, `x'[`i'] + 0.5), meanonly
replace sum_lum = r(sum) in `i'
}
That loop uses the current observation's latitude and longitude.

Related

Pulp - LP Objective function formulation

I am working on solving a set covering problem for electric vehicle charging stations. My objective is to maximize the demand covered by the radius of a charging station.
I have two variables to make up the objective function.
Yij denotes the demand location i is covered by the radius of charging station j.
Similarly, Xj denotes if charging station j is open.
I am looking to create an objective function such as the following:
Maximize OF = ((Y11 + Y21+ Y31 + .... Yn1) * X1) + ((Y12 + Y22+ Y32 + .... Yn2) * X2) + ....
I tried the following, but am running into issues:
OptModel += lpSum(((Y[i,j] for i in range (I)) * X[j]) for j in range(J))
Any idea on how to formulate this?
It isn't clear from your description why Y is a variable? It sounds like it should be a parameter (known value) if the demand location is within the radius of some source.... There must be some other nuance as to why this isn't known. (If it can be calculated, do it and make it a parameter and your problem is solved.)
The statement you propose is illegal because you are multiplying variables together and that makes the statement non-linear. You need to reformulate....
You have an implicit "and" condition in there in that you only want to receive credit if both Y and X are true, so you will need an additional variable or be clever in how you relate X and Y because you can't multiply them.
Why don't you just add this constraint:
Y[i, j] <= X[j] for each j
that would essentially change the meaning of Y to "in range of an operating charger".
Also realize if you sum up all of these Y vars (or as you appear to try to do in your objective) you will get double counting of any demand that can be charged from multiple stations--not sure if that is intent or not.

Confusion about formula for Linear regression with gradient descent, (Pseudocode)

I'm made a program that calculates the line of best fit of a set of data points using gradient descent. I generate a 1000 random points and then it calculates the line of best fit training on these 1000 points. My confusion lies in the theory of my code.
In the part of my code where the training function is, by using the current m and b values for y= mx +b, the function makes a guess of the y values when it goes through the training points x values. This is supervised learning, so I know what the actual y value is, the function calculates the error and using that error adjusts the m and b values. <-- What is happening in the program when adjusting the line of best fit
I get everything above ^. what I'm confused about is the part of the code that calculates how to adjust these m and b values. Here it is:
guess = m * x + b;
error = y - guess;
m = m + (error * x) * learningrate;
b = b + error * learningrate;
Im confused about why we add instead of subtract that delta m (the (error * x) *learningrate)) part. Ignoring the learningrate, the error * x part is the partial derivative of the error with respect to m. But if we took the partial derivative of something with respect to something, wouldn't it give us the direction of the steepest ascent? Shouldn't we go the opposite direction (subtract the delta m) to get the proper m value? Isn't our goal to reduce the error?
Surprisingly to me, the above code works, if you add the delta m, it adjusts the m and b values in the right direction. So basically my question is: Why aren't we subtracting the delta m part (error *x) as it is pointing in the direction of steepest ascent, and we want to get the opposite of that?
Thanks!

Modifying a value on a logarithmic curve

I have one value that is a floating point percentage from 0-100, x, and another value that is a floating point from 0-1, y. As y gets closer to zero, it should reduce the value of x on a logarithmic curve.
So for example, say x = 28.0f and y = 0.8f. Since 0.8f isn't that far from 1.0f it should only reduce the value of x by a small amount, say bringing it down to x = 25.0f or something like that. As y gets closer to zero it should more and more drastically reduce the value of x. The only way I can think of doing this is with a logarithmic curve. I know what I want it to do, but I cannot for the life of me figure out how to implement this in C++. What would this algorithm look like in C++?
It sounds like you want this:
new_x = x * ln((e - 1) * y + 1)
I'm assuming you have the natural log function ln and the constant e. The number multiplied by x is a logarithmic function of y which is 0 when y = 0 and 1 when y = 1.
Here's the logic behind that function (this is basically a math problem, not a programming problem). You want something that looks like the ln function, rising steeply at first and then leveling off. But you want it to start at (0, 0) and then pass through (1, 1), and ln starts at (1, 0) and passes through (e, 1). That suggests that before you do the ln, you do a simple linear shift that takes 0 to 1 and 1 to e: ((e - 1) * y + 1.
We can try with the following assumption: we need a function f(y) so that f(0)=0 and f(1)=1 which follows some logarithmic curve, may be something like f(y)=Alog(B+Cy), with A, B and C constants to be determined.
f(0)=0, so B=1
f(1)=1, so A=1/log(1+C)
So now, just need to find a C value so that f(0.8) is roughly equal to 25/28. A few experiment shows that C=4 is rather close. You can find closer if you want.
So one possibility would be: f(y) = log(1.0 + 4.0*y) / log(5.0)

Calculate a difference in stata with if command

I want to calculate something like
by group: egen x if y==1 - x if y==2
Of course this is not a real stata code but I'm kind of lost. In R this is simply passed by a "[]" behind the variable of intrest but I'm not sure about stata
R would be
x[y==1] - x[y==2]
I would use reshape.
clear
version 11.2
set seed 2001
* generate data
set obs 100
generate y = 1 + mod(_n - 1, 2)
generate x = rnormal()
generate group = 1 + floor((_n - 1) / 2)
list in 1/10
* reshape to wide and difference
reshape wide x, i(group) j(y)
generate x_diff = x1 - x2
list in 1/5
I would use reshape in R, also. Otherwise can you be sure that everything is properly ordered to give you the difference you want?
There is likely a neat Mata solution, but I know very little Mata. You may find preserve and restore helpful if you're averse to reshapeing.
Richard Herron makes a good point that a reshape to a different structure might be worthwhile. Here I focus on how to do it with the existing structure.
Assuming that there are precisely two observations for each group of group, one with y == 1 and one with y == 2, then
bysort group (y) : gen diff = x[1] - x[2]
gives the difference between values of x, necessarily repeated for each observation of two in a group. An assumption-free method is
bysort group: egen mean_1 = mean(x / (y == 1))
by group: egen mean_2 = mean(x / (y == 2))
gen diff = mean_1 - mean_2
Consider expressions such as x / (y == 1). Here the denominator y == 1 is 1 when y is indeed 1 and 0 otherwise. Division by 0 yields missing in Stata, but the egen command here ignores those. So the first command of the three commands above yields the mean of x for observations for which y == 1 and the second the mean of x for observations for which y == 2. Other values of y (even missings) will be ignored. This method should agree with the first method when the first method is valid.
For a review of similar problems, see http://stata-journal.com/article.html?article=dm0055
In Stata the if referred to here is a qualifier (not a command).

Probability density function from a paper, implemented using C++, not working as intended

So i'm implementing a heuristic algorithm, and i've come across this function.
I have an array of 1 to n (0 to n-1 on C, w/e). I want to choose a number of elements i'll copy to another array. Given a parameter y, (0 < y <= 1), i want to have a distribution of numbers whose average is (y * n). That means that whenever i call this function, it gives me a number, between 0 and n, and the average of these numbers is y*n.
According to the author, "l" is a random number: 0 < l < n . On my test code its currently generating 0 <= l <= n. And i had the right code, but i'm messing with this for hours now, and i'm lazy to code it back.
So i coded the first part of the function, for y <= 0.5
I set y to 0.2, and n to 100. That means it had to return a number between 0 and 99, with average 20.
And the results aren't between 0 and n, but some floats. And the bigger n is, smaller this float is.
This is the C test code. "x" is the "l" parameter.
//hate how code tag works, it's not even working now
int n = 100;
float y = 0.2;
float n_copy;
for(int i = 0 ; i < 20 ; i++)
{
float x = (float) (rand()/(float)RAND_MAX); // 0 <= x <= 1
x = x * n; // 0 <= x <= n
float p1 = (1 - y) / (n*y);
float p2 = (1 - ( x / n ));
float exp = (1 - (2*y)) / y;
p2 = pow(p2, exp);
n_copy = p1 * p2;
printf("%.5f\n", n_copy);
}
And here are some results (5 decimals truncated):
0.03354
0.00484
0.00003
0.00029
0.00020
0.00028
0.00263
0.01619
0.00032
0.00000
0.03598
0.03975
0.00704
0.00176
0.00001
0.01333
0.03396
0.02795
0.00005
0.00860
The article is:
http://www.scribd.com/doc/3097936/cAS-The-Cunning-Ant-System
pages 6 and 7.
or search "cAS: cunning ant system" on google.
So what am i doing wrong? i don't believe the author is wrong, because there are more than 5 papers describing this same function.
all my internets to whoever helps me. This is important to my work.
Thanks :)
You may misunderstand what is expected of you.
Given a (properly normalized) PDF, and wanting to throw a random distribution consistent with it, you form the Cumulative Probability Distribution (CDF) by integrating the PDF, then invert the CDF, and use a uniform random predicate as the argument of the inverted function.
A little more detail.
f_s(l) is the PDF, and has been normalized on [0,n).
Now you integrate it to form the CDF
g_s(l') = \int_0^{l'} dl f_s(l)
Note that this is a definite integral to an unspecified endpoint which I have called l'. The CDF is accordingly a function of l'. Assuming we have the normalization right, g_s(N) = 1.0. If this is not so we apply a simple coefficient to fix it.
Next invert the CDF and call the result G^{-1}(x). For this you'll probably want to choose a particular value of gamma.
Then throw uniform random number on [0,n), and use those as the argument, x, to G^{-1}. The result should lie between [0,1), and should be distributed according to f_s.
Like Justin said, you can use a computer algebra system for the math.
dmckee is actually correct, but I thought that I would elaborate more and try to explain away some of the confusion here. I could definitely fail. f_s(l), the function you have in your pretty formula above, is the probability distribution function. It tells you, for a given input l between 0 and n, the probability that l is the segment length. The sum (integral) for all values between 0 and n should be equal to 1.
The graph at the top of page 7 confuses this point. It plots l vs. f_s(l), but you have to watch out for the stray factors it puts on the side. You notice that the values on the bottom go from 0 to 1, but there is a factor of x n on the side, which means that the l values actually go from 0 to n. Also, on the y-axis there is a x 1/n which means these values don't actually go up to about 3, they go to 3/n.
So what do you do now? Well, you need to solve for the cumulative distribution function by integrating the probability distribution function over l which actually turns out to be not too bad (I did it with the Wolfram Mathematica Online Integrator by using x for l and using only the equation for y <= .5). That however was using an indefinite integral and you are really integration along x from 0 to l. If we set the resulting equation equal to some variable (z for instance), the goal now is to solve for l as a function of z. z here is a random number between 0 and 1. You can try using a symbolic solver for this part if you would like (I would). Then you have not only achieved your goal of being able to pick random ls from this distribution, you have also achieved nirvana.
A little more work done
I'll help a little bit more. I tried doing what I said about for y <= .5, but the symbolic algebra system I was using wasn't able to do the inversion (some other system might be able to). However, then I decided to try using the equation for .5 < y <= 1. This turns out to be much easier. If I change l to x in f_s(l) I get
y / n / (1 - y) * (x / n)^((2 * y - 1) / (1 - y))
Integrating this over x from 0 to l I got (using Mathematica's Online Integrator):
(l / n)^(y / (1 - y))
It doesn't get much nicer than that with this sort of thing. If I set this equal to z and solve for l I get:
l = n * z^(1 / y - 1) for .5 < y <= 1
One quick check is for y = 1. In this case, we get l = n no matter what z is. So far so good. Now, you just generate z (a random number between 0 and 1) and you get an l that is distributed as you desired for .5 < y <= 1. But wait, looking at the graph on page 7 you notice that the probability distribution function is symmetric. That means that we can use the above result to find the value for 0 < y <= .5. We just change l -> n-l and y -> 1-y and get
n - l = n * z^(1 / (1 - y) - 1)
l = n * (1 - z^(1 / (1 - y) - 1)) for 0 < y <= .5
Anyway, that should solve your problem unless I made some error somewhere. Good luck.
Given that for any values l, y, n as described, the terms you call p1 and p2 are both in [0,1) and exp is in [1,..) making pow(p2, exp) also in [0,1) thus I don't see how you'd ever get an output with the range [0,n)