I want to find to find extreme values produced by a macro as a function of input parameters:
For example purposes here is some data:
data Input_data;
input X Y;
cards;
10 15
20 18
30 27
33 41
;
run;
The following is not the actual formula (as the minimum for this is easily found analytically.) and I want computational method.
For examples sake I have semi-empirical rule, which takes three parameters in:
%macro Function(const1, const2 const3);
data output;
set input;
X_eff1=((X > &const1.)*(X - &const1.)**2);
X_eff2=((X > &const2.)*(X - &const2.)**2);
X_eff3=((X > &const3.)*(X - &const3.)**2);
Residual= Y - (1.3*X_eff1 - 2.7*X_eff2+ 3.1*X_eff3);
run;
%mend;
I want the find const1, const2 const3 which produce the minimum value for variable 'Residual'. Can SAS do this?
Few options that are to be considered:
a) Find global minimum
b) Find local minimum within boundary conditions, for example: const1[0,1] const2[10,17], const3[20, 22]
I could generate huge table and brute force feed it to the function. However, this is not feasible if the number of input parameter rise.
I could scratch program something?
I could do this in numpy (fmin from cipy.optimize comes to mind)or R if it proves hard for SAS.
Any thoughts on how to solve it??
Related
I have variables:
set obs 1000
g X= rnormal(0,1)
egen t=fill(1 2)
I need to generate a new variable, that would consist of one value: the first value of X. I tried:
separate X, by(_n <= 1)
and
gen X1 = X if t<=1
But these options give me a vector 100x1 with the first value - the value I need and 99 of empty cells. How can I generate simply a one value variable:1x1?
you have to write two lines of codes my friend
gen X1 = X if t<=1
replace X1=X1[_n-1] if missing(X1[_n])
and
local my_parameter=X1[1]
and then you happily use your `my_parameter' macro in your arma regressions
. di `my_parameter'
-.44087866
remember, to use a macro (more usually called a parameter in other languages) in a regression in stata you need to embed its name with `'
I don't disagree with the other two helpful answers already posted, but when I read "How can I generate simply a one value variable:1x1?", I cant help but think you are looking for a scalar or a macro.
If that is true they you might be better off with
sum X in 1
di r(mean)
From here, storing this value to use later is trivial:
sca MyVar = r(mean)
From help summarize, you will see that sum stores the mean, min and max among many other useful measures.
To see yourself, run return list after the call to sum to see what is returned.
By using in 1 you are restricting the summarize command to only run for the first observation. Naturally then, many of the scalars returned by summarize will equal the value you desire.
If you wish, you may also precede sum with quietly to suppress the output, or add the meanonly option to calculate only the mean along with suppressing the display.
Perhaps this will point you in a helpful direction
generate X1 = X[1]
The point is that X[1] is the value of X in the first observation. Now having said that, what do you want to do with that value? Your dataset has 1000 observations. Do you want a local or global macro? A scalar? If you intend to use it in a formula applied to all 1000 observations, then perhaps a variable with the same value for every observation will be sufficient.
(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent
I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.
I ran 500 simulations in Stata, i.e. I draw 500 samples, and each sample contains 10 observations. I want to generate a mean for each sample and combine all the 500 means into one variable, because I need to plot a histogram of the means. Currently I have 500 samples, named X1, X2, ... X500, where each X has 10 elements in it. I want to get a mean for each X and plot a histogram of the means. Can someone please show me how to do that? I tried to generate a new variable for the mean, i.e. X1mean = mean(X1), but this wouldn't work, because all 10 empty elements would be filled with the mean.
"Please tell me the code" questions are widely considered off-topic here. See https://stackoverflow.com/help/on-topic : "Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results."
There are various ways to do this. One is to collapse and then xpose or reshape long. In fact, you could have produced a combined sample of 500 x 10 in the first place.
Another is to loop over variables like this
set obs 500
gen mean = .
quietly forval j = 1/500 {
su X`j', meanonly
replace mean = r(mean) in `j'
}
histogram mean
What you are presumably alluding to is code such as
egen X1mean = mean(X1)
That would be no use, but not for the reason you mention, as identical values can always be ignored: it would be no use because similar code would just produce 500 more variables. Note that mean() would not work with generate as mean() is an egen function.
The terminology you seek is observations, not elements.
Background: When we test the significance of a categorical variable that has been coded as dummy variables, we need to simultaneously test all dummy variables are 0. For example, if X takes on values of 0, 1, 2, 3 and 4, I would fit dummy variables for levels 1-4 (assuming I want 0 to be baseline), then want to simultaneously test B1=B2=B3=B4=0.
If this is the only variable in my data set, I can use the overall F-statistic to achieve this. However, if I have other covariates, the overall F-test doesn't work.
In Stata, for example, this is (very, very) simply carried out by the testparm command as:
testparm i.x (after fitting the desired regression model), where the i. prefix tells Stata X is a categorical data to be treated as dummy variables.
Question/issue: I'm wondering how I can do this in SAS with a CONTRAST (or ESTIMATE?) statement while fitting a regression model with PROC GLM. Since I have scoured the internet and haven't found what I'm looking for, I'm guessing I'm missing something very obvious. However, all of the examples I've seen are NOT for categorical (class) variables, but rather two separate (say continuous) variables. The contrast statement in that case would simply be something like
CONTRAST 'Contrast1' y 1 z 1;
Otherwise, they're for calculating hypotheses like H_0: B1-B2=0.
I feel like I need to breakdown the hypotheses into smaller pieces and determine that set that defines the whole relationship, but I'm not doing it correctly. For example, for B1=B2=B3=B4=0, I thought I might say B1=B2=B3=-B4, then define (1) B1=-B4, (2) B2=-B4 and (3) B2=B3. I was trying to code this as a CONTRAST statement as (say X is in descending order in data set: 4-0):
CONTRAST 'Contrast' x -1 0 0 1 0
x -1 0 1 0 0
x 0 1 1 0 0;
I know this is not correct, and I tried many, many variations and whatever random logic I could come up with. My problem is I have relatively novice-level knowledge of CONTRAST (and unfortunately have not found great documentation to help with this) and also of how this hypothesis test should really be formulated for the sake of estimation (do I try to split it up into pieces as I did above, or...?).
From my note above, you actually can get SAS to do this for you with PROC GENMOD and the CLASS statement and a TYPE3 specification.
proc genmod data=input;
class classvar ;
model slope= classvar othervar/ type3;
run;
quit;
In the example above, my class levels are in the classvar variable. The othervar is my other covariate.
At the end of the output, you see a table labeled LR Statistics For Type 3 Analysis. The row for classvar is the LR test of all the class effects=0.
Another case where PROC REG with TEST works (TEST x1=0, x2=0, x3=0, x4=0, e.g.), which isn't answering my initial question for PROC GLM, but is an option if PROC REG gets the job done for your type of model.