Saving percentiles from centile command - stata

I am using the command centile in Stata. And I want to store the result in a local matrix or vector.
I know I can get the result of a single centile using r(c_#). But I am calculating 50 centiles, and want to store all of them in one local vector/matrix without having to write all of them out.
Update:
I have found the following solution, but I would still appreciate a more elegant answer.
centile var, centile(0(5)100)
forval i=1/21{
local cut_`i'= r(c_`i')
}
display `cut_1' `cut_2'

Here is some small technique. Note that a matrix or vector isn't considered "local", although its name could be held in a local macro.
sysuse auto, clear
centile mpg, centile(0(5)100)
matrix results = J(21,1, .)
forval i = 1/21 {
matrix results[`i', 1] = r(c_`i')
}

Related

Iterate over list produced with levelsof

The example below reproduces my problem. There is a string variable which takes several values. I want to create a global list and iterate over it in a loop. But it does not work. I've tried several versions without success. Here is the example code:
webuse auto, clear
levelsof make // list of car makes
global MAKE r(levels) // save levels in global list
foreach i in $MAKE { // loop some command over saved list
sum if make == "`$MAKE'" // ERROR 198, invalid 'Concord'
}
Using "`$MAKE'" or $MAKE does not yield desired output.
Any ideas of what am I doing wrong?
Normally, for lists to work, they should be saved as in A B C D [...]. In my case, levelsof produces a list of the following kind:
di $MAKE
`"AMC Concord"' `"AMC Pacer"' `"AMC Spirit"' `"Audi 5000"' `"Audi Fox"' `"BMW 320i"' [...]
So clearly not what is needed. But not sure how to get what I need.
Here is a solution. Note that I am using a local instead of a global. The difference is only scope. Only use global if you need to reference the value across do-files. You can remove the display lines below.
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the local
levelsof make
local all_makes = r(levels)
*Loop over the local like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the local one per iteration
foreach this_make of local all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
If global is what you need, then you simply change it to this:
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the global
levelsof make
global all_makes = r(levels)
*Loop over the global like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the global one per iteration
foreach this_make of global all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
There is a fine accepted answer but plenty more can be said. See for example this FAQ.
I am positive about levelsof as its original author, but for the purpose specified, to loop over the levels of a variable, it can be a lot cleaner to use egen, group() and loop over the integer levels of that variable. See the FAQ just linked for more. The example in the original question is a case in point, as looping over distinct string values can be tricky with a need to use double quotes " " and to watch out for spaces and so forth.
The underlying problem is not revealed but an extra comment is to underline that by: and its sibling commands such as statsby or commands similar in spirit such as rangestat from SSC offer, in effect, looping without looping.

Stata - Cohort Study - Display Crude Risk Ratio (like r(rr_crude))

I'm starting to use Stata 14. I'm trying to do some basic risk ratio analysis, but I don't know how to extract single results. Given the following code:
clear all
webuse ugdp
cs case exposed [fw=pop], by(age)
we get an output with four risk ratios, for both age categories, a crude one and a M-H one. With
dis r(rr)
I get the last (?) ratio, but is it possible to specify it? Like
dis r(rr_crude)
dis r(rr_mh)
or something like that? I haven't found a solution. Or is it possible to do something like saving the output in a matrix and indicating it with row and column indices?
I haven't found a solution in the documentation.
Edit:
Just create scalars, which persist in memory
clear all
webuse ugdp
cs case exposed [fw=pop], by(age)
scalar rr_mh = r(rr)
Then use glm:
glm case exposed [fw = pop], family(binomial) link(log)
scalar rr_crude = exp(_b[exposed])
or
cs case exposed [fw = pop]
scalar rr_crude = r(rr)
In either case:
di rr_crude
di rr_mh

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

create a set of continuous variables from a factor variable and a continuous variable

In Stata, I have a factor variable with 50 levels (state) and an integer-valued variable (year). I want to create 50 new variables: 50 interactions of state indicators with the year variable. Is there a way to do this without writing 50 lines of code?
I can produce the 50 state dummies with tabulate state, generate (state), but I don't know how to get further than that without writing a line to create each individual state-year variable.
I want to use the new state-year variables in a regression. Stata's factor notation makes it easy to include the state-year variables as regressors without creating them beforehand (e.g., with a command like regress y i.state#c.year), but some add-on functions don't support factor notation.
You can try using xi, both as a stand-alone command to create indicator and interaction terms, and as a command prefix. A nonsensical example:
clear all
set more off
sysuse auto
* stand-alone
xi i.rep78*mpg
* as prefix
xi: regress price i.rep78*mpg
Run help xi for all the details.
Edit
To make this a bit clearer, suppose the regress command did not admit the use of either factor variable notation or the xi: prefix. Then using the xi stand-alone syntax you could create the indicator and interaction terms (which answers your original question) and then use those terms with the regress command:
sysuse auto, clear
xi i.rep78*mpg
regress price mpg _Irep78* _IrepXmpg*
(Remember to use Stata's help capabilities. Running search interactions, for example, leads you to xi......Interaction expansion.)

Correlation with one variable and a lot of others

In Stata, is there a quick way to show the correlation between a variable and a bunch of dummies. In my data I have an independent variable, goals_scored in a game, and a bunch of dummies for stadium played. How can I show the correlation between the goals_scored and i.stadium in one table, without getting the correlations between stadiums, which I do not care about.
Here's one way:
#delimit;
quietly tab stadium, gen(D); // create dummies
foreach var of varlist D* {;
quietly corr goals_scored `var';
di as text "`: variable label `var'': " as result r(rho);
};
drop D*; // get rid of dummies
cpcorr from SSC (install with ssc inst cpcorr) supports minimal correlation tables, i.e. only the correlations between one set and another set, without the others. But it's an old program (2001) and doesn't support factor variables directly. The indicator variables (a.k.a. dummy variables) would have to exist first.
If you store all of the stadium variables in a local, you would probably loop through them to pull the correlations.
1.
If all stadium variables are placed next to each other in the dataset:
foreach s of varlist stadium1-stadium150 {
// do whatever
}
2a.
If the stadium variables are not next to each other, use order to get there.
2b.
If the variable names follow a pattern, there might be another workaround.
3.
I would not use correlation for this. Depending on the distribution of goals, I would consider something else.