Generate difference between observations as new variable in Stata - stata

I am trying to get the difference between the natural logarithms of two consecutive observations for a set of variables.
My approach is as follows
. gen abandon_qry_ln = ln(abandon_qry) - ln(abandon_qry) [_n-1]
But I get the error weights not allowed.
Any idea what could be the issue?

You could work with
gen difference = ln(abandon_qry) - ln(abandon_qry[_n-1])
or
gen ln_abandon_qry = ln(abandon_qry)
gen difference = ln_abandon_qry - ln_abandon_qry[_n-1]
You were trying to subscript an expression. You may subscript a variable or a matrix in Stata, but not in general an expression.

Related

Is there a way to extract year range from wide data?

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

What is the Stata-equivalent of this SAS macro?

I will present the simplified version of what I want to do. I know how to do it easily in SAS but not in Stata.
Let's say I am trying to create a "poor" binary variable = 1 if an observation is classified as poor and 0 otherwise. I want to have two classifications, one is based on real income, and another based on real consumption (these are variables in the dataset).
The SAS macro would be
%MACRO poverty_bin(type=, measure=)
DATA dataset;
SET dataset;
IF &measure. <= poverty_line THEN poor&type. = 1 ELSE poor&type. = 0;
RUN;
%MEND
%poverty_bin(type=con, measure=real_consumption);
%poverty_bin(type=inc, measure=real_income);
which should create two binary variables poor_con and poor_inc.
I have no idea how to do this in Stata. I tried doing something like this just to see if nested foreach is what I'm looking for:
foreach x of newlist con inc {
foreach y of newlist real_income real_consumption{
display "`x' and `y'"
}
}
But it gives an error message saying "variable real_income already defined"
The error message you cite implies that earlier code you do not show us created a variable real_income.
I do not know SAS but I can tell you that given a numeric variable x
gen y = x <= 42
will create a variable y with value 1 if x <= 42 and 0 otherwise.
For another such variable, use another similar statement. In Stata and perhaps any other language, setting up a nested loop or defining a program instead of making two statements directly seems overkill. For a number of new variables much larger than 2, that might not be true.
foreach v in x y {
gen new`v' = `v' <= 42
}
For completely arbitrary existing names, new names and thresholds it is likely to be easier to write out statements individually.
This is documented. See for example 13.2.2 in [U] or this FAQ.

Why is my forvalues loop in stata not working?

I am trying to find outliers in a variable called nbrs, generating an interquartile (iqr) range called nbrs_iqr. I then want to loop (for practice with this looping concept) the values 1.5, 2, 5, and 10 in to multiply them by the iqr.
I keep getting a syntax error (invalid syntax r(198);) on the loop. I have seen something about not being able to do the forvalues loop when the values are not a range but have seen examples where it is a non-range without being explicit that that is permitted. I figured the spaces worked to separate the non-range values but I've thrown up my hands from there.
sum nbrs, detail
return list
gen nbrs_iqr = r(p75)-r(p25)
tab nbrs_iqr
forvalues i = 1.5 2 5 10 {
gen nbrs_out_`i'=`i'*nbrs_iqr
}
help forvalues is clear on the syntax you can use. Yours is not a valid range. You can work with foreach, but notice that a . in a variable name is not allowed.
One solution is to use strtoname():
clear
set more off
sysuse auto
keep price
sum price, detail
gen nbrs_iqr = r(p75)-r(p25)
foreach i of numlist 1.5 2 5 10 {
local newi = strtoname("`i'")
gen nbrs_out`newi' = `i' * nbrs_iqr
}
describe
My advice: familiarize yourself with help help.

Getting unknown function mean() in a forvalues loop

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.

Perform Fisher Exact Test from aggregated using Stata

I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.