I created a new variable from the mean of another variable using egen:
egen afd_lr2 = mean(afd_lire2w) if ost == 0
Now I would like to replace the values with the mean of another variable if ost == 1:
replace afd_lr2 = mean(afd_lireo) if ost ==1
This is not possible, as the mean function cannot be used with the replace command.
How can I achieve my goal?
The following works for me:
sysuse auto, clear
generate price2 = price + 5345
egen a_price = mean(price) if foreign == 0
egen b_price = mean(price2) if foreign == 1
replace a_price = b_price if foreign == 1
This should work
egen afd_lr2 = mean(cond(ost == 0, afd_lire2w, cond(ost == 1, afd_lireo, .))), by(ost)
Here is a test:
clear
input float(group y1 y2)
1 42 .
1 42 .
2 . 999
2 . 999
end
egen mean = mean(cond(group == 1, y1, cond(group == 2, y2, .))), by(group)
tabdisp group, c(mean)
----------------------
group | mean
----------+-----------
1 | 42
2 | 999
----------------------
The key is that the mean() function of egen feeds on an expression, which can be more complicated than a single variable name. That said, this is trickier than I would generally advise, as
generate work = afd_lire2w if ost == 0
replace work = afd_lireo if ost == 1
egen mean = mean(work), by(ost)
is easier to understand and should occur to a programmer any way.
Related
I'm trying to divide the data by a certain datetime.
I've created e_timefrom what was originally a string "2019-10-15 20:33:04" for example.
To obtain all the information from the string containing h:m:s, I uses the following command to create a double
gen double e_time = clock(event_timestamp, "YMDhms")
Now I get the result I want from format e_time %tc (human readable),
I want to generate a new variable anything that is greater than 2019-10-15 as 1 and anything less than that as 0 .
I've tried
// 1
gen new_d = 0 if e_time < "1.887e+12"
replace new_d = 1 if e_time >= "1.887e+12"
// 2
gen new_d = 0 if e_time < "2019-10-15"
replace new_d = 1 if e_time > "2019-10-15"
However, I get an error message type mismatch.
I tried converting a string "2019-10-15" to double \to check if 1.887e+12 really meant 2019-10-15 using display, but I'm not sure how the command really works here.
Anyhow I tried
// 3
di clock("2019-10-15", "YMDhms")
but it didn't work.
Can anyone give advice on comparing dates that are in a double format properly?
Your post is a little hard to follow (a reproducible data example would help a lot) but the error type mismatch is because e_time is numeric, and "2019-10-15" is a string.
I suggest the following:
clear
input str20 datetime
"2019-10-14 20:33:04"
"2019-10-16 20:33:04"
end
* Keep first 10 characters
gen date = substr(datetime,1,10)
* Check that all strings are 10 characters
assert length(date) == 10
* Convert from string to numeric date variable
gen m = substr(date,6,2)
gen d = substr(date,9,2)
gen y = substr(date,1,4)
destring m d y, replace
gen newdate = mdy(m,d,y)
format newdate %d
gen wanted = newdate >= mdy(10,15,2019) & !missing(newdate)
drop date m d y
list
+------------------------------------------+
| datetime newdate wanted |
|------------------------------------------|
1. | 2019-10-14 20:33:04 14oct2019 0 |
2. | 2019-10-16 20:33:04 16oct2019 1 |
+------------------------------------------+
I'm looking at the Current Population Survey in Stata, although this question could apply to any survey with individual weights.
It's straightforward to generate a table showing the mean of a variable -- say wages -- over time given individual weights:
table qtr [aw=pworwgt], contents(mean wage)
What I'd like to do automatically is show the average level of, in this example, wages, but with the proportions of certain categories fixed to a date.
So for example, let's say I have 6 educational categories (Less than HS, HS, Some College, AA, BA/BS, Grad School)... I'd want to see how wages would be different if I fixed the educational proportions of the workforce to their, say, 2005 levels.
Ideally, the solution would not be resource intensive for large-numbered categories. For example, I might want to do something similar with the CPS's detail occupational metric, which has hundreds of levels.
My gut tells me "margins" may be part of the solution but I'm not familiar enough with that command... also, I'd like to be able to generate table output so I can graph in other software.
ETA: Here's the way I tried to do this for fixing weights by age and sex: by cycling through all the data, comparing the contemporaneous proportions to the base quarter proportions, and then adjusting the individual weights accordingly. This takes a really long time to cycle through however.
local start = tq(1994q1)
local end = tq(2014q4)
local base = tq(2006q1)
tempvar pop2006
tempvar cohort2006
tempvar poptemp
gen pworwgt_a = pworwgt
levelsof pesex, local(sex)
sum pworwgt if qtr == `base'
gen `pop2006' = r(N)*r(mean)
gen `cohort2006' = .
gen `poptemp' = .
forvalues age = 16/85 {
foreach s in `sex' {
sum pworwgt if age == `age' & pesex == `s' & qtr == `base'
replace `cohort2006' = r(N)*r(mean)/`pop2006'
forvalues q = `start'/`end' {
sum pworwgt if qtr == `q'
replace `poptemp' = r(N)*r(mean)
sum pworwgt if age == `age' & pesex == `s' & qtr == `q'
replace pworwgt_a = pworwgt_a*`cohort2006'/((r(N)*r(mean))/`poptemp') if age == `age' & pesex == `s' & qtr == `q'
}
}
}
I don't have scope to test this, but here are suggested simplifications to the code segment. I don't address the main question, which I don't understand, partly because there is no precise description of data structure in the question.
To summarize suggestions:
Use summarize, meanonly when that is all you need and use r(sum) ditto.
Use scalars not variables for constants.
Shift repeated calculations to once-and-for-all calculations of variables. I think you can do even more of this, but I will stop here.
local start = tq(1994q1)
local end = tq(2014q4)
local base = tq(2006q1)
tempname pop2006 cohort2006
tempvar qassum qsum
// quarter-age-sex sums in a single variable
bysort qtr age pesex : gen double `qassum` = sum(pworwgt)
by qtr age pesex : replace `qassum` = `qassum`[_N]
// quarterly sums in a single variable
by qtr: gen double `qsum' = sum(pworwgt)
by qtr: replace `qsum` = `qsum'[_N]
gen pworwgt_a = pworwgt
levelsof pesex, local(sex)
sum pworwgt if qtr == `base', meanonly
scalar `pop2006' = r(sum)
forvalues age = 16/85 {
foreach s in `sex' {
sum pworwgt if age == `age' & pesex == `s' & qtr == `base', meanonly
scalar `cohort2006' = r(sum)/`pop2006'
replace pworwgt_a = pworwgt_a*`cohort2006'/`qassum'/`qsum' if age == `age' & pesex == `s'
}
}
I have three string variables of the length 2 and I need to get (a) all possible permutations of the three variables (keeping the order of strings within each variable fixed), (b) all possible variable pairs. Small number of variables allows me to do it manually, but I was wondering if there is a more elegant and concise way of solving this.
It is currently coded as:
egen perm1 = concat(x1 x5 x9)
egen perm2 = concat(x1 x9 x5)
egen perm3 = concat(x5 x1 x9)
egen perm4 = concat(x5 x9 x1)
egen perm5 = concat(x9 x5 x1)
egen perm6 = concat(x9 x1 x5)
gen tuple1 = substr(perm1,1,4)
gen tuple2 = substr(perm2,3,4)
gen tuple3 = substr(perm3,1,4)
gen tuple4 = substr(perm4,3,4)...
An abstract from a resulting table illustrates the desired outcome:
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| x1 | x5 | x9 | perm1 | perm2 | perm3 | perm4 | perm5 | perm6 | tuple1 | tuple2 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
| 01 | 05 | 09 | 010509 | 010905 | 050109 | 050901 | 090501 | 090105 | 0105 | 0509 |
+----+----+----+--------+--------+--------+--------+--------+--------+--------+--------+
Neat question. I don't know if there's a "built in" way to do permutations, but the following should do it.
You want to loop over all your variables, but make sure that don't get duplicates. As the dimensions increase this gets tricky. What I do it loop over the same list and each time remove the current counter from counter space of the nested loop.
Unfortunately, this still requires you to write each loop structure, but this should be easy enough to cut-paste-find-replace.
clear
set obs 100
generate x1 = "01"
generate x5 = "05"
generate x9 = "09"
local vars x1 x5 x9
local i = 0
foreach a of varlist `vars' {
local bs : list vars - a
foreach b of varlist `bs' {
local cs : list bs - b
foreach c of varlist `cs' {
local ++i
egen perm`i' = concat(`a' `b' `c')
}
}
}
Edit: Re-reading the question, I'm not clear on what you want (since row1_1 isn't one of your concated variables. Note that if you really want the "drop one" permutations, then just remove one variable from the concat call. This is because "n permute n" is the same as "n permute n-1". That is, there are 6 3-item permutations of 3 items. There are also 6 2-item permutations of 3 items. So
egen perm`i' = concat(`a' `b')
Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}
I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+
Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.
Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
EDIT: Thank to Joe's advice, I will make my question more specific. Actually I need to code a function in Stata which takes variables A,B,C,D,... as inputs and a variable Y as output which can be evaluated with usual Stata functions/commands like "generate dummy=2*myfun(X) if ..."
The function itself contains numerical calculations. A pseudo Stata code will look like
myfun(X)
gen Y=0.5*X if X==1
replace Y=31-X if X==2
replace Y=X-2 if X==3
.... a long list
return(Y)
Notice that X can be a huge set of different Stata variables and the numerical calculations are rather long inside the function. That's why I would like to use a function. I guess that the native "program" command in Stata is not suitable for this type of problem because it cannot take variables as input/output.
(ANSWER TO ORIGINAL QUESTION)
I have never used SAS, but at a wild guess you want something like
foreach v in A B C D {
gen test`v' = 0.5 * (`v' == 1) + 0.6 * (`v' == 2) + 0.7 * (`v' == 3)
}
or
foreach v in A B C D {
gen test`v' = cond(`v' == 1, 0.5, cond(`v' == 2, 0.6, cond(`v' == 3, 0.7, .)))
}
But hang on; that middle line also looks like
gen test`v' = (4 + `v') / 10
(ANSWER TO COMPLETELY DIFFERENT REVISED QUESTION)
This can be done in various ways. As above you could have a loop
foreach v in A B C D {
gen test`v' = 0.5 * `v' if `v' == 1
replace test`v' = 31 - `v' if `v' == 2
replace test`v' = `v' - 2 if `v' == 3
}
The question says "I guess that the native "program" command in Stata is not suitable for this type of problem because it cannot take variables as input/output." That guess is completely incorrect. You could write a program to do this too. This example is schematic, not definitive. A real program would include more checks and error messages to match any incorrect input. For detailed advice, you really need to read the documentation. One answer on SO can't teach you all you need to know even to write simple Stata programs. In any case, the example is evidently frivolous and/or incomplete, so a complete working example would be pointless or impossible.
program myweirdexample
version 13
syntax varlist(numeric), Generate(namelist)
local nold : word count `varlist'
local nnew : word count `generate'
if `nold' != `nnew' {
di as err "`generate' does not match `varlist'"
exit 198
}
local i = 1
quietly foreach v of local varlist {
local new : word `i' of `generate'
gen `new' = 0.5 * `v' if `v' == 1
replace `new' = 31 - `v' if `v' == 2
replace `new' = `v' - 2 if `v' == 3
local ++i
}
end
Footnote on terminology: The question uses the term function more broadly than it is used in Stata. In Stata, commands and functions are distinct; "function" is not a synonym for command.
Second footnote: Check out recode. It may be what you need, but it is best for mapping integer codes to other integer codes.
Third footnote: An example of a needed check is that the argument of generate() should be variable names that are legal and new.