Generate new variable using min/max in Stata - stata

After many years away from Stata I am currently editing code which repeatedly does something like this:
egen min = min(x)
egen max = max(x)
generate xn = (x - min) / (max - min)
drop min max
I want to reduce this code to one line. But neither of the two "natural" ways that come to my mind work.
gen x_index = (x - min(x)) / (max(x)- min(x))
egen x_index = (x - min(x)) / (max(x)- min(x))
What pieces of the Stata logic am I missing?

The Stata functions max() and min() require two or more arguments and operate rowwise (across observations) if given a variable as any one of the arguments. Documented at e.g. help max().
The egen functions max() and min() can only be used within egen calls. They could be applied with single variables, but their use to calculate single maxima or minima is grossly inefficient unless exceptionally it is essential to store the single result in a variable. Documented except for the warnings at help egen.
Neither approach you consider will work without becoming more roundabout. Consider
su x, meanonly
gen x_index = (x - r(min)) / (r(max)- r(min))
In some circumstances it might be more efficient to calculate the range just once:
su x, meanonly
scalar range = r(max) - r(min)
gen x_index = (x - r(min)) / range
In a program it would usually be better to give the scalar a temporary name.
Within egen calls, an egen function can be called only once.

Related

How to take maximum of rolling window without SSC packages

How can I create a variable in Stata that contains the maximum of a dynamic rolling window applied to another variable? The rolling window must be able to change iteratively within a loop.
max(numvar, L1.numvar, L2.numvar) will give me what I want for a single window size, but how can I change the window size iteratively within the loop?
My current code for calculating the rolling sum (credit to #Nick Cox for the algorithm):
generate var1lagged = 0
forval k = -2(1)2 {
if `k' < 0 {
local k1 = -(`k')
replace var1lagged = var1lagged + L`k1'.var1
}
else replace var1lagged = var1lagged + F`k'.var1
}
How would I achieve the same sort of flexibility but with the maximum, minimum, or average of the window?
In the simplest case suppose K at least 1 is given as the number of lags in the window
local arg L1.numvar
forval k = 2/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
If the window includes the present value, that is just a twist
local arg numvar
forval k = 1/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
More generally, numvar would not be a specific variable name, but would be a local macro including such a name.
EDIT 1
This returns missing as a result only if all arguments are missing. If you wanted to insist on missing as a result if any argument is missing, then go
gen wanted = cond(missing(`arg'), ., max(`arg'))
EDIT 2
Check out rolling more generally. Otherwise for a rolling mean you calculate directly you need to work out (1) the sum as in the question (2) the number of non-missing values.
The working context of the OP evidently rules out installing community-contributed commands; otherwise I would recommend rangestat and rangerun (SSC). Note that many community-contributed commands have been published via the Stata Journal, GitHub or user sites.

Generate difference between observations as new variable in Stata

I am trying to get the difference between the natural logarithms of two consecutive observations for a set of variables.
My approach is as follows
. gen abandon_qry_ln = ln(abandon_qry) - ln(abandon_qry) [_n-1]
But I get the error weights not allowed.
Any idea what could be the issue?
You could work with
gen difference = ln(abandon_qry) - ln(abandon_qry[_n-1])
or
gen ln_abandon_qry = ln(abandon_qry)
gen difference = ln_abandon_qry - ln_abandon_qry[_n-1]
You were trying to subscript an expression. You may subscript a variable or a matrix in Stata, but not in general an expression.

How do I take data from one observation and apply it to one other observation within a group?

An unmarried couple is living together in a house with other people. To isolate how much that couple makes I need to add the two incomes together. I am using variables that act as pointers that give the partners_id. Using the partners_id, id , and individual_income how do I apply partner's income to his/her partner?
This was my attempt below:
summarize id, meanonly
capture gen partners_income = 0
forvalue ln = 1/`r(max)' {
bys household (id): ///
egen link_`ln' = total(individual_income) if partners_location==`ln')
replace partners_income = link_`ln' if link_`ln' > 0 & id == `ln'
drop link_*
}
There is general advice in this FAQ.
It can take longer to write a smart way to do this than to use a quick-and-dirty approach.
However, there is a smarter way.
Brute solution
Quick here means relatively quick to code; this isn't guaranteed quick for a very large dataset.
gen partners_income = .
gen problem = 0
The proper initialisation of the partner's income variable is to missing, not zero. Not knowing an income and the income being zero are different conditions. For example, if someone doesn't have a partner, the income will certainly be missing. (If at a later stage, you want to treat missings as zeros, that's up to you, but you should keep them distinct at this stage.)
The reason for the problem variable will become apparent.
I can't see a reason for your capture.
Now we can loop:
quietly forval i = 1/`=_N' {
su individual_income if id == partners_id[`i'], meanonly
replace partners_income = r(max) in `i'
if r(N) > 1 replace problem = r(N) in `i'
}
So, the logic is
foreach observation
find the partner's identifier
find that income: summarize, meanonly is fast
that should be one value, so it should be immaterial whether we pick it up from the results of summarize as the maximum, minimum, or mean
but if summarize finds more than one value, something is not as assumed (mistakes over identifiers, or multiple partners); later we edit if problem and look at those observations.
Notes:
We can make comparison safer by restricting computations to the same household by modifying
if id == partners_id[`i']
to
if id == partners_id[`i'] & household == household[`i']
In one place you have the variable partners_location which looks like a typo for partners_id.
Cute solution
Assuming that partners name each other as partner (and this is not the forum to explore exceptions), then couples have a joint identity which we obtain by sorting "John Joanna" and "Joanna John" to "Joanna John" or the equivalent with numeric identifiers:
gen first = cond(id < partner_id, id, partner_id)
gen second = cond(id < partner_id, partner_id, id)
egen joint = concat(first second), p(" ")
first and second just mean in numeric or alphanumeric order; this works for numeric and string identifiers. You may need to slap on an exclusion clause such as
if !missing(partner_id)
Now
bysort household joint : gen partners_income = income[3 - _n] if _N == 2
Get it? Each distinct combination of household and joint should be precisely 2 observations for us to be interested (hence the qualifier if _N == 2). If that's true then 3 - _n gives us the subscript of the other partner as if _n is 1 then 3 - _n is 2 and vice versa. Under by: subscripts are always applied within groups, so that _n runs 1, 2, and so forth in each distinct group.
If this seems cryptic, it is all spelled out in Cox, N.J. 2008. The problem of split identity, or how to group dyads. Stata Journal 8(4): 588-591 which is accessible as a .pdf.

Getting unknown function mean() in a forvalues loop

Getting unknown function mean for this. Can't use egen because it has to be calculated for each value. A little confused.
edu_mov_avg=.
forvalues current_year = 2/133 {
local current_mean = mean(higra) if longitbirthqtr >= current_year - 2 & longitbirthqtr >= current_year + 2
replace edu_mov_avg = current_mean if longitbirthqtr =
}
Your code is a long way from working. This should be closer.
gen edu_mov_avg = .
qui forvalues current_qtr = 2/133 {
su higra if inrange(longitbirthqtr, `current_qtr' - 2, `current_qtr' + 2), meanonly
replace edu_mov_avg = r(mean) if longitbirthqtr == `current_qtr'
}
You need to use a command generate to produce a new variable.
You need to reference local macro values with quotation marks.
egen has its own mean() function, but it produces a variable, whereas you need a constant here. Using summarize, meanonly is the most efficient method. There is in Stata no mean() function that can be applied anywhere. Once you use summarize, there is no need to use a local macro to hold its results. Here r(mean) can be used directly.
You have >= twice, but presumably don't mean that. Using inrange() is not essential in writing your condition, but gives shorter code.
You can't use if qualifiers to qualify assignment of local macros in the way you did. They make no sense to Stata, as such macros are constants.
longitbirthqtr looks like a quarterly date. Hence I didn't use the name current_year.
With a window this short, there is an alternative using time series operators
tsset current_qtr
gen edu_mov_avg = (L2.higra + L1.higra + higra + F1.higra + F2.higra) / 5
That is not exactly equivalent as missings will be returned for the first two observations and the last two.
Your code may need further work if your data are panel data. But the time series operators approach remains easy so long as you declare the panel identifier, e.g.
tsset panelid current_qtr
after which the generate call is the same as above.
All that said, rolling offers a framework for such calculations.

Perform Fisher Exact Test from aggregated using Stata

I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.