I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.
Related
I have a large data set in Stata.
There are several item batteries in this data set.
One item battery consists of 8 items (v1 - v8), each scaled from 1 to 7.
I want to code all items that take the value 1 in all items as missing values.
If v1 to v8 have the value "1", all rows to which this applies are to be replaced with missings.
I know how to code missing values with the if qualifier, but the selection with the complex condition causes me difficulties.
The code for R would probably solve this via rowSums, but I need the solution for Stata.
(I assume in R it would work like this:
df[rowSums(df[,c("v1", ... "v8")]!=1)==0, c("v1", .... "v8")] <- NA
But I need a solution for Stata.
If I understood this correctly, you want
egen rowall = concat(v1-v8)
mvdecode v1-v8 if rowall == 8 * "1", mv(1)
That is, all instances in v1-v8 of 1 are recoded as missing if and only if the values of those variables are all 1 in any observation.
I will present the simplified version of what I want to do. I know how to do it easily in SAS but not in Stata.
Let's say I am trying to create a "poor" binary variable = 1 if an observation is classified as poor and 0 otherwise. I want to have two classifications, one is based on real income, and another based on real consumption (these are variables in the dataset).
The SAS macro would be
%MACRO poverty_bin(type=, measure=)
DATA dataset;
SET dataset;
IF &measure. <= poverty_line THEN poor&type. = 1 ELSE poor&type. = 0;
RUN;
%MEND
%poverty_bin(type=con, measure=real_consumption);
%poverty_bin(type=inc, measure=real_income);
which should create two binary variables poor_con and poor_inc.
I have no idea how to do this in Stata. I tried doing something like this just to see if nested foreach is what I'm looking for:
foreach x of newlist con inc {
foreach y of newlist real_income real_consumption{
display "`x' and `y'"
}
}
But it gives an error message saying "variable real_income already defined"
The error message you cite implies that earlier code you do not show us created a variable real_income.
I do not know SAS but I can tell you that given a numeric variable x
gen y = x <= 42
will create a variable y with value 1 if x <= 42 and 0 otherwise.
For another such variable, use another similar statement. In Stata and perhaps any other language, setting up a nested loop or defining a program instead of making two statements directly seems overkill. For a number of new variables much larger than 2, that might not be true.
foreach v in x y {
gen new`v' = `v' <= 42
}
For completely arbitrary existing names, new names and thresholds it is likely to be easier to write out statements individually.
This is documented. See for example 13.2.2 in [U] or this FAQ.
This may be a trivial question, but as an R user coming to Stata I have so far failed to find the correct Google terms to find the answer. I want to do the following steps:
Do a bunch of tests (e.g. lrtest results in a foreach loop)
Extract the p-value from each test and save them in a list of some kind
Have a list I can do further operations on (e.g. perform multiple comparison correction)
So I am wondering how to extract p-values (or similar) from command results and how to save them into a vector-like object that I can work with. Here is some R code that does something similar:
myData <- data.frame(a=rnorm(10), b=rnorm(10), c=rnorm(10)) ## generate some data
pValue <- c()
for (variableName in c("b", "c")) {
myModel <- lm(as.formula(paste("a ~", variableName)), data=myData) ## fit model
pValue <- c(pValue, coef(summary(myModel))[2, "Pr(>|t|)"]) ## extract p-value and save in vector
}
pValue * 2 ## do amazing multiple comparison correction
To me it seems like Stata has much less of a 'programming' mindset to it than R. If you have any general Stata literature recommendations for an R user who can program, that would also be appreciated.
Here is an approach that would save the p-values in a matrix and then you can manipulate the matrix, maybe using Mata or standard matrix manipulation in Stata.
matrix storeMyP = J(2, 1, .) //create empty matrix with 2 (as many variables as we are looping over) rows, 1 column
matrix list storeMyP //look at the matrix
loc n = 0 //count the iterations
foreach variableName of varlist b c {
loc n = `n' + 1 //each iteration, adjust the count
reg a `variableName'
test `variableName' //this does an F-test, but for one variable it's equivalent to a t-test (check: -help test- there is lots this can do
matrix storeMyP[`n', 1] = `r(p)' //save the p-value in the matrix
}
matrix list storeMyP //look at your p-values
matrix storeMyP_2 = 2*storeMyP //replicating your example above
What's going on this that Stata automatically stores certain quantities after estimation and test commands. When the help files say this command stores the following values in r(), you refer to them in single quotes.
It could also be interesting for you to convert the matrix column(s) into variables using svmat storeMyP, or see help svmat for more info.
From this Stata FAQ, I know the answer to the first part of my question. But here I'd like to go a step further. Suppose I have the following data (already sorted by a variable not shown):
id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
To calculate the minimum in this sequence, I do
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
To get
id v1 minsofar
A 9 9
B 8 9
C 7 8
B 7 7
A 5 7
C 4 5
A 3 4
A 2 3
Now I'd like to generate a variable, call it id_min that gives me the ID associated with minsofar, so something like
id v1 minsofar id_min
A 9 9 A
B 8 9 A
C 7 8 B
B 7 7 C
A 5 7 C
C 4 5 A
A 3 4 C
A 2 3 A
Note that C is associated with 7, because 7 is first associated with C in the current sorting. And just to be clear, my ID variable here shows as a string variable just for the sake of readability -- it's actually numeric.
Ideas?
EDIT:
I suppose
gen id_min = id if _n<=2
replace id_min = id[_n-1] if v1[_n-1]<minsofar[_n-1] & missing(id_min)
replace id_min = id_min[_n-1] if missing(id_min)
does the job at least for the data in this example. Don't know if it would work for more complex cases.
This works for your example. It uses the user-written command vlookup, which you can install running findit vlookup and following through the link that appears.
clear
set more off
input ///
str1 id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
end
encode id, gen(id2)
order id2
drop id
list
*----- what you want -----
// your code
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
// save original sort
gen osort = _n
// group values of v1 but respecting original sort so values of
// id2 don't jump around
sort v1 osort
// set obs after first as missing so id2 is unique within v1
gen v2 = v1
by v1: replace v2 = . if _n > 1
// lookup
vlookup minsofar, gen(idmin) key(v2) value(id2)
// list
sort osort
drop osort v2
list, sep(0)
Your code has generate minsofar = v1 if _n==1 which is better coded as generate minsofar = v1 in 1, because it is more efficient.
Your minsofar variable is just a displaced copy of v1, so if this is always the case, there should be simpler ways of handling your problem. I suspect your problem is easier than you have acknowledged until now, and that has come through your post. Perhaps giving more context, expanded example data, etc. could get you better advice.
This is both easier and a little more challenging than implied so far. Given value (a little more evocative than the OP's v1) and a desire to keep track of minimum so far, that's for example
generate min_so_far = value[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
where the second statement exploits the unsurprising fact that Stata replaces in the current order of observations. [_n-1] is the index of the previous observation and in 2/L implies a loop over all observations from the second to the last.
Note that the OP's version is buggy: by always looking at the previous observation, the code never looks at the very last value and will overlook that if it is a new minimum. It may be that the OP really wants "minimum before now" but that is not what I understand by "minimum so far".
If we have missing values in value they will not enter the comparison in any malign way: missing is always regarded as arbitrarily large by Stata, so missings will be recorded if and only if no non-missings are present so far, which is as it should be.
The identifier of that minimum at first sight yields to the same logic
generate min_so_far = value[1]
gen id_min = id[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
replace id_min = id if value < min_so_far[_n-1] in 2/L
There are at least two twists that might bite. The OP mentions a possibility that the identifier might be missing so that we might have a new minimum but not know its identifier. The code just given will use a missing identifier, but if the desire is to keep separate track of the identifier of the minimum value with known identifiers, different code is needed.
A twist not mentioned to date is that observations with different identifier might all have the same minimum so far. The code above replaces the identifier only the first time a particular minimum is seen; if the desire is to record the identifier of the last occurrence the < in the last code line above should be replaced with <=. If the desire is to keep track of the all the identifiers of the minimum so far, then a string variable is needed to concatenate all the identifiers.
With a structure of panel or longitudinal data the whole thing is done under the aegis of by:.
I can't see a need to resort to user-written extensions here.
I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.