This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.
Related
I'm trying to write a loop to conduct individual t-tests for a list of variables (tab1) and export the means and p-values to Excel using the putexcel command. Right now my code looks like this:
putexcel set "Ttests.xlsx", sheet("t_test") replace
local n_models: word count `tab1'
forval i=1/`n_models' {
mat T=J(`n_models',4,.)
foreach x of tab1 {
ttest `x', by(var)
mat T[`i',1] = r(mu_1)
mat T[`i',2] = r(mu_2)
mat T[`'i,3] = r(mu_1) - r(mu_2)
mat T[`i',4] = r(p)
}
}
putexcel A1= matrix(T)
Unfortunately right now I'm only getting the means/p-values for the first variable of tab1. What am I doing wrong?
This isn't a minimal, complete or verifiable example. See https://stackoverflow.com/help/mcve for the standard here.
There is no data example.
The local macro tab1 isn't defined.
Even with fixes to those, the code is illegal as of tab1 is illegal in the foreach loop and the instruction to put stuff in the third column of the matrix is mangled as the punctuation is wrong. That comes from paraphrasing code, not showing us a version that does what you say.
So, what's key is to show us code that does what you say.
All that said, your problems are conceptual:
Bigger deal 1 You have two nested loops, over a counter and over the variables you want as response. But you need just one loop. You want to loop over the variables and at the same time bump up a counter. Or, you want to loop over a counter and at the same time pick another variable. It doesn't matter which way you do it, at least to Stata.
Bigger deal 2 What happens with (what I guess is) your code? Trace it through. Suppose your number of variables is 4. So, last time around the outer loop, you reset the entire matrix to missings, and then in the inner loop cycle around your variables and each time try to put results for each variable in the 4th row of your matrix (i.e. the same row!). Contrary to your report, I guess that what you see in the spreadsheet is results for the last variable named, not the first.
I think you want code more like this:
sysuse auto, clear
local tab1 mpg price
local var foreign
putexcel set "Ttests.xlsx", sheet("t_test") replace
local n_models: word count `tab1'
mat T = J(`n_models', 4, .)
tokenize `tab1'
forval i=1/`n_models' {
ttest ``i'', by(`var')
mat T[`i',1] = r(mu_1)
mat T[`i',2] = r(mu_2)
mat T[`i',3] = r(mu_1) - r(mu_2)
mat T[`i',4] = r(p)
}
putexcel A1 = matrix(T)
i'd like to add the number of groups to a esttab output after fitting a multilevel model in Stata (xtmixed). I found this post and understand that the number of groups is stores in the matrix e(N_g). However, I do not understand, how to use the value in this matrix and add it to the output. I tried estadd but did not manage to get it work. I also did not find something about using a single cell of a matrix in an output in the estout manual.
Any ideas on how to add the number of gorups to a esttab output? Or maybe experiences with adding single values from a matrix?
Thx!!!
P.S. This is the output of matrix list e(N_g)
symmetric e(N_g)[1,1]
c1
r1 117
Estadd is the way to go, here is an example:
clear all
webuse pig
// model
eststo: xtmixed weight week || id:, vce(robust)
// estadd portion - pull the single element of the matrix into a local and estadd it
matrix N_g = e(N_g)
local groups = N_g[1,1]
estadd local groups `groups'
// add groups in the stats option and it will appear at the bottom along
// with the number of observations in this case
esttab ., stats(N groups)
Actually, esttab, stats(N_g, labels("N")) is sufficient. estout can recognize N_g.
I am trying to run quantile regressions across deciles, and so I use the sqreg command to get bootstrap standard errors for every decile. However, after I run the regression (so Stata runs 9 different regressions - one for each decile except the 100th) I want to store the coefficients in locals. Normally, this is what I would do:
reg y x, r
local coeff = _b[x]
And things would work well. However, here my command is:
sqreg y x, q(0.1 0.2 0.3)
So, I will have three different coefficients here that I want to store as three different locals. Something like:
local coeff10 = _b[x] //Where _b[x] is the coefficient on x for the 10th quantile.
How do I do this? I tried:
local coeff10 = _b[[q10]x]
But this gives me an error. Please help!
Thank you!
Simply save matrix of coefficients from postestimation scalars and reference the outputted variable by row and column.
The reason you could not do the same as the OLS is the sqreg matrix holds multiple named instances of coefficient names:
* OUTPUTS MATRIX OF COEFFICIENTS (1 X 6)
matrix list e(b)
* SAVE COEFF. MATRIX TO REGULAR MATRIX VARIABLE
mat b = e(b)
* EXTRACT BY ROW/COLUMN INTO OTHER VARIABLES
local coeff10 = b[1,1]
local coeff20 = b[1,3]
local coeff30 = b[1,5]
Lets have the following dataframe inside R:
df <- data.frame(sample=rnorm(1,0,1),params=I(list(list(mean=0,sd=1,dist="Normal"))))
df <- rbind(df,data.frame(sample=rgamma(1,5,5),params=I(list(list(shape=5,rate=5,dist="Gamma")))))
df <- rbind(df,data.frame(sample=rbinom(1,7,0.7),params=I(list(list(size=7,prob=0.7,dist="Binomial")))))
df <- rbind(df,data.frame(sample=rnorm(1,2,3),params=I(list(list(mean=2,sd=3,dist="Normal")))))
df <- rbind(df,data.frame(sample=rt(1,3),params=I(list(list(df=3,dist="Student-T")))))
The first column contains a random number of a probability distribution and the second column stores a list with its parameters and name.
The dataframe df looks like:
sample params
1 0.85102972 0, 1, Normal
2 0.67313218 5, 5, Gamma
3 3.00000000 7, 0.7, ....
4 0.08488487 2, 3, Normal
5 0.95025523 3, Student-T
Q1: How can I have the list of name distributions for all records? df$params$dist does not work. For a single record is easy, for example the third one: df$params[[3]]$dist
Q2: Is there any alternative way of storing data like this? something like a multi-dimensional dataframe? I do not want to add columns for each parameter because it will scatter the dataframe with missing values.
It's probably more natural to store information like this in a pure list structure, than in a data frame:
distList <- list(normal = list(sample=rnorm(1,0,1),params=list(mean=0,sd=1,dist="Normal")),
gamma = list(sample=rgamma(1,5,5),params=list(shape=5,rate=5,dist="Gamma")),
binom = list(sample=rbinom(1,7,0.7),params=list(size=7,prob=0.7,dist="Binomial")),
normal2 = list(sample=rnorm(1,2,3),params=list(mean=2,sd=3,dist="Normal")),
tdist = list(sample=rt(1,3),params=list(df=3,dist="Student-T")))
And then if you want to extract just the distribution name from each, we can use sapply to loop over the list and extract just that piece:
sapply(distList,function(x) x[[2]]$dist)
normal gamma binom normal2 tdist
"Normal" "Gamma" "Binomial" "Normal" "Student-T"
If you absolutely must store this information in a data frame, one way of doing so springs to mind. You're currently using a params column in your data frame to store the parameters associated with the distributions. Perhaps a better way of doing this would be to (i) identify the maximum number of parameters that you'll need for any distribution, (ii) store the distribution names in a field called df$distribution, and (iii) store the parameters in dedicated parameter columns, the meaning of which will have to be decided upon based on the type of distribution.
For instance, any row with df$distribution = 'Normal' should have df$param1 = and df$param2 = . A row with df$distribution='Student' should have df$param1 = and df$param2 = NA. Something like the following:
dg <- data.frame(sample=rnorm(1, 0, 1), distribution='Normal',
param1=0, param2=1)
dg <- rbind(dg, data.frame(sample=rgamma(1, 5, 5),
distribution='Gamma', param1=5, param2=5))
dg <- rbind(dg, data.frame(sample=rt(1, 3), distribution='Student',
param1=3, param2=NA))
It's ugly, but it will give you what you want. And don't worry about the missing values; missing values are a fact of life when dealing with non-trivial data frames. They can be dealt with easily in R by appropriate use of things like na.rm and complete.cases().
Based on the data frame you have above,
sapply(df$params,"[[","dist")
(or lapply if you prefer) would work.
I would probably put at least the names of the distributions in their own column, even if you want the parameters to be in variable-length lists.
I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.