I want to calculate the average of all members of the group I am in, but not include myself in the average. Suppose that the group variable is called group and I want to take the average of val1 by Group, excluding myself. The new column I wish to create is avg. The data looks as follows (with the correct values of avg inputed so you can see what I mean).
Obs Group val1 avg
1 A 6 8
2 A 8 6
3 B 10 13
4 C 4 4
5 C 2 5
6 C 6 3
7 B 12 12
8 B 14 11
If I wanted to include myself in the calculation it would be straightforward. I would just do:
bysort Group: egen avg = mean(val1)
But how do I implement this with the wrinkle that I don't include myself?
One way is looping through all observations:
clear
set more off
*----- example data -----
input ///
Obs str1 Group val1 avg
1 A 6 8
2 A 8 6
3 B 10 13
4 C 4 4
5 C 2 5
6 C 6 3
7 B 12 12
8 B 14 11
end
list, sepby(Group)
*----- what you want -----
encode Group, gen(group)
gen avg2 = .
forvalues j = 1/`=_N' {
summarize val1 if group == group[`j'] & _n != `j', meanonly
replace avg2 = r(mean) in `j'
}
list, sepby(group)
Another way is using egen functions:
<snip>
*----- what you want -----
encode Group, gen(group)
bysort group: egen totval = total(val1)
by group: egen cval = count(val1)
generate avg2 = (totval - val1) / (cval - 1)
list, sepby(group)
There is a nice article available on the web that covers this topic:
The Stata Journal (2014)
14, Number 2, pp. 432–444,
Speaking Stata: Self and others, by
Nicholas J. Cox.
Related
Im trying to set any5 = 'Yes' if there is a number 5 in any of the columns Q1 to Q5. However my code below only shows for the last column.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
array score{5} _temporary_ (5,5,5,5,5);
array Ques{5} Q1-Q5;
do i =1 to 5;
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
drop i;
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Keep it simple :-)
data survey;
infile datalines;
input ID 3. Q1-Q5;
array Ques{*} Q1 - Q5;
any5 = ifc(5 in Ques, 'Yes', 'No');
datalines;
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
Use the COUNTC function to compute the number of times 5 is repeated in your Q 1-Q5 columns then use the IFC function to return a character value based on whether the expression is true, false, or missing.
data survey;
infile datalines firstobs=2;
input ID 3. Q1-Q5;
any5 = ifc(countc(cats(of Q:),'5')>0,'Yes','No');
datalines;
ID Q1 Q2 Q3 Q4 Q5
535 1 3 5 4 2
12 5 5 4 4 3
723 2 1 2 1 1
7 3 5 1 4 2
;
run;
Result:
535 1 3 5 4 2 Yes
12 5 5 4 4 3 Yes
723 2 1 2 1 1 No
7 3 5 1 4 2 Yes
Use the WHICHN function to determine the index of the target value in a list of values.
In your case assign the test for any index matching
any5 = whichn (5, of ques(*)) > 0;
From the documentation:
WHICHN Function
Searches for a numeric value that is equal to the first argument, and
returns the index of the first matching value.
Syntax
WHICHN(argument, value-1 <, value-2, ...>)
It is a simple mistake in your logic. You are setting ANY5 to YES or NO on every time through the loop. Since you continue going through the loop even after the match is found you overwrite the results from the previous times through the loop, so only the results of the last test survive.
Here is one way. Set the answer to NO before the loop and remove the ELSE clause.
any5='No ';
do i =1 to 5;
if Ques{i} = 5 then any5='Yes';
end;
Or stop when you have your answer.
do i =1 to 5 until(any5='Yes');
if Ques{i} = score{i} then any5='Yes';
else any5='No';
end;
Or skip the looping altogether.
if whichn(5, of Q1-Q5) then any5='Yes';
else any5='No';
Or even easier create any5 as numeric instead of character. SAS will return 1 for TRUE and 0 for FALSE as the result of a boolean expression.
any5 = ( 0 < whichn(5, of Q1-Q5) );
My data
I am working on a spell dataset in the following format:
cls
clear all
set more off
input id spellnr str7 bdate_str str7 edate_str employed
1 1 2008m1 2008m9 1
1 2 2008m12 2009m8 0
1 3 2009m11 2010m9 1
1 4 2010m10 2011m9 0
///
2 1 2007m4 2009m12 1
2 2 2010m4 2011m4 1
2 3 2011m6 2011m8 0
end
* translate to Stata monthly dates
gen bdate = monthly(bdate_str,"YM")
gen edate = monthly(edate_str,"YM")
drop *_str
format %tm bdate edate
list, sepby(id)
Corresponding to:
+---------------------------------------------+
| id spellnr employed bdate edate |
|---------------------------------------------|
1. | 1 1 1 2008m1 2008m9 |
2. | 1 2 0 2008m12 2009m8 |
3. | 1 3 1 2009m11 2010m9 |
4. | 1 4 0 2010m10 2011m9 |
|---------------------------------------------|
5. | 2 1 1 2007m4 2009m12 |
6. | 2 2 1 2010m4 2011m4 |
7. | 2 3 0 2011m6 2011m8 |
+---------------------------------------------+
Here a given person (id) can have multiple spells (spellnr) of two types (unempl: 1 for unemployment; 0 for employment). the start-end dates of each spell are definied by bdate and edate, respectively.
Imagine the data was already cleaned, and is such that no spells overlap with each other.
There might be "missing" periods in between any two spells though.
This is captured by the dummy dataset above.
My question:
For each unemployment spell, I need to compute the number of months spent in employment in the last 6 months, 12 months, and 24 months.
Note that, importantly, each id can go in and out from employment, and all past employment spells should be taken into account (not just the last one).
In my example, this would lead to the following desired output:
+--------------------------------------------------------------+
| id spellnr employed bdate edate m6 m24 m48 |
|--------------------------------------------------------------|
1. | 1 1 1 2008m1 2008m9 . . . |
2. | 1 2 0 2008m12 2009m8 4 9 9 |
3. | 1 3 1 2009m11 2010m9 . . . |
4. | 1 4 0 2010m10 2011m9 6 11 20 |
|--------------------------------------------------------------|
5. | 2 1 1 2007m4 2009m12 . . . |
6. | 2 2 1 2010m4 2011m4 . . . |
7. | 2 3 0 2011m6 2011m8 5 20 44 |
+--------------------------------------------------------------+
My (working) attempt:
The following code returns the desired result.
* expand each spell to one observation per time unit (here "months"; works also for days)
expand edate-bdate+1
bysort id spellnr: gen spell_date = bdate + _n - 1
format %tm spell_date
list, sepby(id spellnr)
* fill-in empty months (not covered by spells)
xtset id spell_date, monthly
tsfill
* compute cumulative time spent in employment and lagged values
bysort id (spell_date): gen cum_empl = sum(employed) if employed==1
bysort id (spell_date): replace cum_empl = cum_empl[_n-1] if cum_empl==.
bysort id (spell_date): gen lag_7 = L7.cum_empl if employed==0
bysort id (spell_date): gen lag_24 = L25.cum_empl if employed==0
bysort id (spell_date): gen lag_48 = L49.cum_empl if employed==0
qui replace lag_7=0 if lag_7==. & employed==0 // fix computation for first spell of each "id" (if not enough time to go back with "L.")
qui replace lag_24=0 if lag_24==. & employed==0
qui replace lag_48=0 if lag_48==. & employed==0
* compute time spent in employment in the last 6, 24, 48 months, at the beginning of each unemployment spell
bysort id (spell_date): gen m6 = cum_empl - lag_7 if employed==0
bysort id (spell_date): gen m24 = cum_empl - lag_24 if employed==0
bysort id (spell_date): gen m48 = cum_empl - lag_48 if employed==0
qui drop if (spellnr==.)
qui bysort id spellnr (spell_date): keep if _n == 1
drop spell_date cum_empl lag_*
list
This works fine, but becomes quite inefficient when using (several millions of) daily data. Can you suggest any alternative approach that does not involve expanding the dataset?
In words what I do above is:
I expand data to have one row per month;
I fill-in the "gaps" in between the spells with -tsfill-
I Compute the running time spent in employment, and use lag operators to get the three quantities of interest.
This is in the vein of what done here, in a past question that I posted. However the working example there was unnecessarily complicated and with some mistakes.
SOLUTIONS PERFORMANCE
I tried different approaches suggested in the accepted answer below (including using joinby as suggested in an earlier version of the answer). In order to create a larger dataset I used:
expand 500000
bysort id spellnr: gen new_id = _n
drop id
rename new_id id
which creates a dataset with 500,000 id's (for a total of 3,500,000 spells).
The first solution largely dominates the ones that use joinby or rangejoin (see also the comments to the accepted answer below).
Below code might save some running time.
bys id (employed): gen tag = _n if !employed
sum tag, meanonly
local maxtag = `r(max)'
foreach i in 6 24 48 {
gen m`i' = .
forval d = 1/`maxtag' {
by id: gen x = 1 + min(bdate[`d'],edate) - max(bdate[`d']-`i',bdate) if employed
egen y = total(x*(x>0)), by(id)
replace m`i' = y if tag == `d'
drop x y
}
}
sort id bdate
The same logic, along with -rangejoin- (ssc) should also deserve a try. Please kindly provide some feedback after testing with your (large) actual data.
preserve
keep if employed
replace employed = 0
tempfile em
save `em'
restore
foreach i in 6 24 48 {
gen _bd = bdate - `i'
rangejoin edate _bd bdate using `em', by(id employed) p(_)
egen m`i' = total(_edate - max(_bd,_bdate)+1) if !employed, by(id bdate)
bys id bdate: keep if _n==1
drop _*
}
In my data, income was asked only to one person of the group.
householdID memberID income
1 1 4
2 2 .
1 2 .
2 3 .
2 1 3
But obviously, I need to fill them up like
householdID memberID income
1 1 4
2 2 3
1 2 4
2 3 3
2 1 3
How can I do this in Stata?
This is an elementary application of by:
bysort householdID (income) : replace income = income[1] if missing(income)
See for related material this FAQ
A more circumspect approach would check that at most one non-missing value has been supplied for each household:
bysort householdID (income) : gen OK = missing(income) | (income == income[1])
list if !OK
Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)
I have data like this
year month X Y weight
2013 1 1 0 1000
2001 12 0 1 2000
I want to create a variable Z based on the X and Y variables, conditional on year. I have two formulas for year before and after 2002. If I use egen with if,
if year > 2002 {
bysort year month :egen Z= total( x*weight)
}
else {
bysort year month : egen Z= total(y*weight*0.5)
}
this code is not going to work, because if year <2002 , Stata would report that z has already been created. Is there any way to achieve the goal?
I used a very crude and brute force way to solve this problem. I create two variables for z, namely z and z_2002. Then I replace z with z_2002 if the year is less than 2002.
If I understand correctly, this should work.
Compute the products in a first step (conditional on the year) and the sums in a second step.
As other answers already note, there's a difference between the if qualifier and the if programming command. There's a short FAQ on this: http://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/.
(I use code provided by #NickCox in a comment to another answer.)
clear all
set more off
*----- example data -----
input year month x y weight
2013 1 1 0 1000
2013 1 1 0 800
2013 2 0 1 1200
2013 2 1 0 1400
2001 12 1 0 1500
2001 12 0 1 2000
2001 11 1 1 4000
end
sort year month
list, sepby(year month)
*----- computations -----
gen Z = cond(year > 2002, x * weight, y * weight * 0.5)
bysort year month: egen totZ = total(Z) // already sorted so -by- should be enough
list, sepby(year month)
clear
input year month x y weight
2013 1 1 0 1000
2001 12 0 1 2000
end
preserve
keep if year>2002
bysort year month :egen z= total(x*weight)
tempfile t1
save `t1'
restore
keep if year<=2002
bysort year month : egen z= total(y*weight*0.5)
append using `t1'
list