Calculating cumulative multivariate normal distribution - stata

I have 1000 observations and 3 variables in Stata that are associated with 1000 people. Let's say the data looks something like this (I just make up the numbers)
Observation
B1
B2
B3
1
-3
5
3
2
2
-3
2
3
6
-2
5
4
5
3
3
...
..
...
...
1000
..
..
...
Which has a correlation matrix (again made up numbers)
R = (1, 0.5, 0.5
0.5, 1, 0.5
0.5, 0.5, 1
0.5, 0.5, 0.5)
I want to calculate the CDF of the multivariate normal distribution of variables B1, B2 and B3 for each of the 1000 persons, using the same correlation matrix. Basically, it is similar to Example 3 in this document: https://www.stata.com/manuals/m-5mvnormal.pdf, but with 3 variables, and rather than multiple limits, multiple correlation matrix, I will do multiple limits and single correlation matrix. So basically, I will have 1000 CDF values for 1000 people. I have tried mvnormal(U,R). Specifically, I wrote:
mkmat B1 B2 B3, matrix(U)
matrix define R = (1, 0.5, 0.5 \
0.5, 1, 0.5 \
0.5, 0.5, 1 \
0.5, 0.5, 0.5)
gen CDF = mvnormal(U,R)
But this doesn't work. Apparently this function is not on Stata anymore. I believe Stata has binormal for calculating the CDF of bivariate normal. But is it able to do the CDF of more than 2 variables?

Related

Random lottery numbers from macro to variable

I have the following lottery numbers in a macro:
global lottery 6 9 4 32 98
How can I simulate a variable with 50 observations, where each observation is randomly obtained from the numbers stored in the macro?
The code below produces an error:
set obs 50
global lottery 6 9 4 32 98
g lot=$lottery
invalid '9'
r(198);
Here are two similar methods:
clear
set obs 50
set seed 2803
local lottery 6 9 4 32 98
* method 1
generate x1 = ceil(5 * runiform())
tabulate x1
generate y1 = .
forvalues j = 1/5 {
replace y1 = real(word("`lottery'", `j')) if x1 == `j'
}
* method2
set seed 2803
generate x2 = runiform()
generate y2 = cond(x2 <= 0.2, 6, cond(x2 <= 0.4, 9, cond(x2 <= 0.6, 4, cond(x2 <= 0.8, 32, 98))))
tab1 y?
I am assuming that you want equal probabilities, which is not explicit. The principle of setting the seed is crucial for reproducibility. As in #Pearly Spencer's answer, using a local macro is widely considered (much) better style.
To spell it out: in this answer the probabilities are equal, but the frequencies will fluctuate from sample to sample. In #Pearly’s answer the frequencies are guaranteed equal, given that 50 is a multiple of 5, and randomness is manifest only in the order in which they arrive. This answer is like tossing a die with five faces 50 times; #Pearly’s answer is like drawing from a deck of 50 cards with 5 distinct types.

pulp shadow price difference with gurobi

I am comparing the values for shadow price (pi) calculated with gurobi and pulp. I get different values for the same input and I am not sure how to do it with pulp. Here is the lp file that I use:
Minimize
x[0] + x[1] + x[2] + x[3]
Subject To
C[0]: 7 x[0] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
For the above lp file, gurobi gives me shadow prices:
[0.14285714285714285, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
But If I execute the following lp model:
Minimize
x[0] + x[1] + x[2] + x[3] + x[4]
Subject To
C[0]: 7 x[0] + 2 x[4] >= 211
C[1]: 3 x[1] >= 395
C[2]: 2 x[2] + 2 x[4] >= 610
C[3]: 2 x[3] >= 97
Bounds
End
With gurobi I get:
[0.0, 0.3333333333333333, 0.5, 0.5]
and with pulp I get:
[0.14285714, 0.33333333, 0.5, 0.5]
The correct value is the one that gurobi returns (I think ?).
Why I get the same shadow prices with pulp for different models ? How I can get the same results as gurobi ?
(I did not supply the source code because the question will be too long, I think the lp models are enough)
In the second example, there are two dual solutions that are optimal: the one PuLP gives you, and the one you get by calling Gurobi directly. The unique optimal primal solution is [0.0, 131.67, 199.5, 48.5, 105.5], which makes the slacks for all the constraints are 0 in the optimal primal solution. For c[0] if you reduce the right hand side, you get no reduction in the objective, but if you increase it, the cheapest way to make the constraint feasible is by increasing x[0]. Gurobi only guarantees that you will produce an optimal primal and dual solution. The specific optimal solution you get is arbitrary.
The first example is just a precision issue.

Sqlite (C API) and query (select) on cyclic/symmetric values with user defined functions

I'm using Sqlite with C++ and have two similar problems :
1) I need to select 4 entries to make an interpolation.
For example, my table could look like this :
angle (double) | color (double)
0 0.1
30 0.5
60 0.9
90 1.5
... ...
300 2.9
330 3.5
If I want to interpolate the value corresponding to 95°, I will use the entries 60°, 90°, 120° and 150°.
To get those entries, my request will be SELECT color FORM mytable WHERE angle BETWEEN 60 and 150, no big deal.
Now if I want 335°, I will need 300°, 330°, 360°(=0°) and 390°(=30°).
My query will then be SELECT color FORM mytable WHERE angle BETWEEN 300 and 330 OR angle BETWEEN 0 and 30.
I can't use SELECT color FORM mytable WHERE angle BETWEEN 300 and 390 because this will only return 2 colors.
Can I use the C API and user defined functions to include some kind of modulo meaning in my queries ?
It would be nice if I could use a user defined function to use the query [...] BETWEEN 300 and 390 and get as result the rows 300, 330, 0 and 30.
2) An other table looks like this :
speed (double) | color (double) | var (double)
0 0.1 0
10 0.5 1
20 0.9 2
30 1.5 3
... ... ...
In reality due to symmetry, color(speed) = color(-speed) but var(-speed) = myfunc(var(speed)).
I would like to make queries such as SELECT * FROM mytable WHERE speed BETWEEN -20 and 10 and be able to make a few operations with the API on the "virtual" rows with a negative speed and return them as a regular result.
For example I would like the result of the query SELECT * FROM mytable WHERE speed BETWEEN -20 and 10 to be like this :
speed (double) | color (double) | var (double)
-20 0.9 myfunc(2)
-10 0.5 myfunc(1)
0 0.1 0
10 0.5 1
Is that possible ?
Thanks for your help :)
I would suggest to use a query with two intervals :
SELECT * from mytable WHERE (speed >= MIN(?1,?2) AND speed <= MAX(?1,?2)) OR ((MAX(?1,?2) > 360) AND (speed >= 0 AND speed <= MAX(?1,?2)%360));
This example works fine if ?1 and ?2 are positive.

Small python quirk: Finding the index of first and last number within boundaries in a list

I have encountered a small annoying problem. My problem is this:
I have a series of number between 0 and 1:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
and two boundaries, say 0.25 and 0.75.
I need a quick and pretty way to find the index of the first number and last number in the series, that are within the boundaries, in this case (2, 6)
I have so far only come up with a clumsy way using for loops and the break command.
thanks in advance for any help!
If your series of numbers is always sorted, you can use the bisect module to perform binary search for the endpoints:
>>> a = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
>>> import bisect
>>> bisect.bisect_left(a, 0.25)
2
>>> bisect.bisect_right(a, 0.75) - 1
6
bisect_left(a, x) returns the position p such that every element of a[:p] is less than x, and every element of a[p:] is greater than or equal to x; this is exactly what you want for the lower bound.
bisect_right returns the position p such that every element of a[:p] is less than or equal to x, and a[p:] are all greater than x. So for the right bound, you need to subtract one to get the largest position <= x.
if you can use numpy:
import numpy as np
data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
max_b = .75
min_b = .25
wh = np.where((data < max_b)*(data > min_b))[0]
left, right = wh[0], wh[-1] + 1
or simply (thanks to dougal):
left, right = np.searchsorted(data, [min_b, max_b])
if you can't:
import bisect
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
max_b = .75
min_b = .25
left = bisect.bisect_left(data, min_b)
right = bisect.bisect_right(data, max_b)
plus or minus 1 on the right depending on if you want data[right] to be in the set, or data[left:right] to give you the set.

Making a list of evenly spaced numbers in a certain range in python

What is a pythonic way of making list of arbitrary length containing evenly spaced numbers (not just whole integers) between given bounds? For instance:
my_func(0,5,10) # ( lower_bound , upper_bound , length )
# [ 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5 ]
Note the Range() function only deals with integers. And this:
def my_func(low,up,leng):
list = []
step = (up - low) / float(leng)
for i in range(leng):
list.append(low)
low = low + step
return list
seems too complicated. Any ideas?
Given numpy, you could use linspace:
Including the right endpoint (5):
In [46]: import numpy as np
In [47]: np.linspace(0,5,10)
Out[47]:
array([ 0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
Excluding the right endpoint:
In [48]: np.linspace(0,5,10,endpoint=False)
Out[48]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
You can use the following approach:
[lower + x*(upper-lower)/length for x in range(length)]
lower and/or upper must be assigned as floats for this approach to work.
Similar to unutbu's answer, you can use numpy's arange function, which is analog to Python's intrinsic function range. Notice that the end point is not included, as in range:
>>> import numpy as np
>>> a = np.arange(0,5, 0.5)
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a = np.arange(0,5, 0.5) # returns a numpy array
>>> a
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
>>> a.tolist() # if you prefer it as a list
[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]
f = 0.5
a = 0
b = 9
d = [x * f for x in range(a, b)]
would be a way to do it.
Numpy's r_ convenience function can also create evenly spaced lists with syntax np.r_[start:stop:steps]. If steps is a real number (ending on j), then the end point is included, equivalent to np.linspace(start, stop, step, endpoint=1), otherwise not.
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1.])
You can also directly concatente other arrays and also scalars:
>>> np.r_[-1:1:6j, [0]*3, 5, 6]
array([-1. , -0.6, -0.2, 0.2, 0.6, 1. , 0. , 0. , 0. , 5. , 6. ])
You can use the folowing code:
def float_range(initVal, itemCount, step):
for x in xrange(itemCount):
yield initVal
initVal += step
[x for x in float_range(1, 3, 0.1)]
Similar to Howard's answer but a bit more efficient:
def my_func(low, up, leng):
step = ((up-low) * 1.0 / leng)
return [low+i*step for i in xrange(leng)]