I'm trying to write a loop to conduct individual t-tests for a list of variables (tab1) and export the means and p-values to Excel using the putexcel command. Right now my code looks like this:
putexcel set "Ttests.xlsx", sheet("t_test") replace
local n_models: word count `tab1'
forval i=1/`n_models' {
mat T=J(`n_models',4,.)
foreach x of tab1 {
ttest `x', by(var)
mat T[`i',1] = r(mu_1)
mat T[`i',2] = r(mu_2)
mat T[`'i,3] = r(mu_1) - r(mu_2)
mat T[`i',4] = r(p)
}
}
putexcel A1= matrix(T)
Unfortunately right now I'm only getting the means/p-values for the first variable of tab1. What am I doing wrong?
This isn't a minimal, complete or verifiable example. See https://stackoverflow.com/help/mcve for the standard here.
There is no data example.
The local macro tab1 isn't defined.
Even with fixes to those, the code is illegal as of tab1 is illegal in the foreach loop and the instruction to put stuff in the third column of the matrix is mangled as the punctuation is wrong. That comes from paraphrasing code, not showing us a version that does what you say.
So, what's key is to show us code that does what you say.
All that said, your problems are conceptual:
Bigger deal 1 You have two nested loops, over a counter and over the variables you want as response. But you need just one loop. You want to loop over the variables and at the same time bump up a counter. Or, you want to loop over a counter and at the same time pick another variable. It doesn't matter which way you do it, at least to Stata.
Bigger deal 2 What happens with (what I guess is) your code? Trace it through. Suppose your number of variables is 4. So, last time around the outer loop, you reset the entire matrix to missings, and then in the inner loop cycle around your variables and each time try to put results for each variable in the 4th row of your matrix (i.e. the same row!). Contrary to your report, I guess that what you see in the spreadsheet is results for the last variable named, not the first.
I think you want code more like this:
sysuse auto, clear
local tab1 mpg price
local var foreign
putexcel set "Ttests.xlsx", sheet("t_test") replace
local n_models: word count `tab1'
mat T = J(`n_models', 4, .)
tokenize `tab1'
forval i=1/`n_models' {
ttest ``i'', by(`var')
mat T[`i',1] = r(mu_1)
mat T[`i',2] = r(mu_2)
mat T[`i',3] = r(mu_1) - r(mu_2)
mat T[`i',4] = r(p)
}
putexcel A1 = matrix(T)
Related
I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).
This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.
I am able to extract the mean into a matrix as follows:
svy: mean age, over(villageid)
matrix villagemean = e(b)'
clear
svmat village
However, I also want to merge this mean back to the villageid. My current thinking is to extract the rownames of the matrix villagemean like so:
local names : rownames villagemean
Then try to turn this macro names into variable
foreach v in names {
gen `v' = "``v''"
}
However, the variable names is empty. What did I do wrong? Since a lot of this is copied from Stata mailing list, I particularly don't understand the meaning of local names : rownames villagemean.
It's not completely clear to me what you want, but I think this might be it:
clear
set more off
*----- example data -----
webuse nhanes2f
svyset [pweight=finalwgt]
svy: mean zinc, over(sex)
matrix eb = e(b)
*----- what you want -----
levelsof sex, local(levsex)
local wc: word count `levsex'
gen avgsex = .
forvalues i = 1/`wc' {
replace avgsex = eb[1,`i'] if sex == `:word `i' of `levsex''
}
list sex zinc avgsex in 1/10
I make use of two extended macro functions:
local wc: word count `levsex'
and
`:word `i' of `levsex''
The first one returns the number of words in a string; the second returns the nth token of a string. The help entry for extended macro functions is help extended_fcn. Better yet, read the manuals, starting with: [U] 18.3 Macros. You will see there (18.3.8) that I use an abbreviated form.
Some notes on your original post
Your loop doesn't do what you intend (although again, not crystal clear to me) because you are supplying a list (with one element: the text name). You can see it running and comparing:
local names 1 2 3
foreach v in names {
display "`v'"
}
foreach v in `names' {
display "`v'"
}
foreach v of local names {
display "`v'"
}
You need to read the corresponding help files to set that right.
As for the question in your original post, : rownames is another extended macro function but for matrices. See help matrix, #11.
My impression is that for the kind of things you are trying to achieve, you need to dig deeper into the manuals. Furthermore, If you have not read the initial chapters of the Stata User's Guide, then you must do so.
I have this problem.
My dataset has variables like:
sec20_var1 sec22_var1 sec30_var1
sec20_var2 sec22_var2 sec30_var2 sec31_var2
(~102 sectors, ~60 variables, not all of the cominations are complete or even existent)
My intention is to build an indicator that do an average of variables within sector. So it is an "aggregated sector" that contains sectors belonging to a class in a high-med-low technology fashion. I already have the definitions of what sectors should include in each category. Let's say, in high technology I should put sec20 and sec31.
The problem: the list of sectors belonging to a class and the actual sectors available for each variable doesn't match. So I'm stucked with this problem and started to do it manually. My best approach was:
set more off
foreach v in _var02 {
ds *`v'
di "`r(varlist)'"
local sects`v' `r(varlist)'
foreach s in sec26 sec28 sec37 {
capture confirm local sects`v'
if !_rc {
egen oecd_medhigh_avg_`v'=rowmean(`s'`v' sec28`v' sec37`v' sec40`v' sec59`v' sec92`v' sec54`v' sec55`v' sec48`v' sec50`v' sec53`v' sec4`v' sec5`v' sec6`v')
else {
di "`v' didnt existed"
}
}
}
}
I got it work only with those variables that has all the sectors present in the totalrow (which is simpler since I dont have to store the varlist in a macro). I would like to do an average of the AVAILABLE sectors, even if they are only two per variable.
I also noticed that the macro storage could be helpful but I don't know how to put it into my code. I'm totally stucked in here.
Thanks for your help! :)
Thank you #SOConnell. As I said in my comment, I went to the same direction, but I'm still searching for the solution I expected (that I don't how to program it or even if it's possible).
I used this code, that goes in the same direction that the one made by #SOConnell, but I found this one more clear. The trick is the _rc==111 that catches the missing combinations of sector_X_variable and complete them, with the objective of beeing used in the second part. Everything worked. It's not elegant, but it has some practical use. :) The third part erases the missing variables created.
*COMPLETING THE LIST OF COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 _... {
foreach s in sec27 sec35 sec42 sec43 sec45 sec46 sec39 sec52 sec67 {
capture confirm variable s'v'
if _rc==111 {
gen s'v'=.
}
}
}
*GENERATING THE INDICATOR WITH ALL POSSIBLE COMBINATIONS
set more off
foreach v in _var02 _var03 _var08 _var13 ... {
egen oecd_high_avg_v'=rowmean(sec27v' sec35v' sec42v' sec43v' sec45v' sec46v' sec39v' sec52v' sec67v')
}
*DROPPING MISSING VARIABLES CREATED TO DO THE INDICATOR.
set more off
foreach v of varlist * {
gen TEMP=.
replace TEMP=1 if !missing(v')
egen TEMPSUM=sum(TEMP)
if TEMPSUM==0 {
di " >>> Dropping empty variable:v'"
drop `v'
}
drop TEMP TEMPSUM
}
Note that I cutted the list of variables.
I will call what you are referring to as variables as "accounts".
The workaround would be to create empty variables in the dataset for all sectorXaccount combinations. From a point where you already have your dataset loaded into memory:
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
Then apply the rowmean operation to the full definition of each indicator. The missings won't be calculated into your rowmean, so it will effectively be an average of available cells without you having to do the selection manually. You could then probably automate deleting the empty variables you created if you do something like:
g start=.
forval sec = 1/102 {
forval account = 1/60 {
cap gen sec`sec'_var`account'=. /*this will skip over generating the secXaccount combination if it already exists in the dataset */
}
}
g end=.
[indicator calculations go here]
drop start-end
However, it seems like you would be creating averages that might not be comparable (some will have 2 underlying values, some 3, some 4, etc.) so you need to be careful there (but you are probably already aware of that).
Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.