I am performing an event study, see reproducible example below. I only include one unit but this is enough for the question I'm asking.
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
I generate dif_year which should take the difference of years to the treatment:
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
by unit: egen target_distance = min(year_target)
drop year_target
gen dif_year = year_nb - target_distance
drop year_nb target_distance
It works well with one treatment by unit, but here I have two. Using the code snippet from above, I get the following result:
unit
year
treatment
dif_year
1
2000
0
-2
1
2001
0
-1
1
2002
1
0
1
2003
0
1
1
2004
0
2
1
2005
1
3
1
2006
0
4
1
2007
0
5
You can see that it is anchored to the first treatment (2002) but ignores the second one (2005). How can I adapt dif_year to make it work with multiple treatments (here, in 2005) ? The values for 2003 and before are correct, but I would expect to get the value -1 for 2004, 0 for 2005, -1 for 2006 and -2 for 2007.
This solution uses no loops. Evidently the problem hinges on looking backwards as well as forwards; hence reversing time temporarily is a device that can be used.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
end
bysort unit (year) : gen wanted1 = 0 if treatment
by unit: replace wanted1 = wanted1[_n-1] + 1 if missing(wanted1)
gen negyear = -year
bysort unit (negyear) : gen wanted2 = 0 if treatment
by unit: replace wanted2 = wanted2[_n-1] + 1 if missing(wanted2)
gen wanted = cond(abs(wanted2) < abs(wanted1), - wanted2, wanted1)
sort unit year
list , sep(0)
+---------------------------------------------------------------+
| unit year treatm~t wanted1 negyear wanted2 wanted |
|---------------------------------------------------------------|
1. | 1 2000 0 . -2000 2 -2 |
2. | 1 2001 0 . -2001 1 -1 |
3. | 1 2002 1 0 -2002 0 0 |
4. | 1 2003 0 1 -2003 2 1 |
5. | 1 2004 0 2 -2004 1 -1 |
6. | 1 2005 1 0 -2005 0 0 |
7. | 1 2006 0 1 -2006 . 1 |
8. | 1 2007 0 2 -2007 . 2 |
+---------------------------------------------------------------+
Here is a solution where the largest number of years does not need to be hardcoded.
clear
input unit year treatment
1 2000 0
1 2001 0
1 2002 1
1 2003 0
1 2004 0
1 2005 1
1 2006 0
1 2007 0
1 2008 0
1 2009 0
1 2010 1
end
sort unit year
*Set all treatment years to 0
gen diff_year = 0 if treatment == 1
*Initilize locals used in the loop
local stop "false"
local diff_distance = 0
while "`stop'" == "false" {
**Replace diff to one more than diff on row above if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n-1] + 1 if unit == unit[_n-1] & missing(diff_year) & diff_year[_n-1] == `diff_distance'
**Replace diff to one less than diff on row below if unit is the same,
* no diff for this row, and diff on row above is the diff distance
* for this iteration of the loop.
replace diff_year = diff_year[_n+1] - 1 if unit == unit[_n+1] & missing(diff_year) & diff_year[_n+1] == `diff_distance' * -1
*Test if there are still missing values, and if so set stop local to true
count if missing(diff_year)
if `r(N)' == 0 local stop "true"
*Increment the diff distance by one for next loop
local diff_distance = `diff_distance' + 1
}
I found a quick fix to my own question.
I generate a variable that takes missing values if there is no treatment. I then loop over rows, replacing the row below and above each treatment year by its value, until there isn't any remaining missing values.
Here, three iterations are enough but I set the loop until i = 10 just to show that adding more loops doesn't change the outcome.
sort unit year
bysort unit: gen year_nb = _n
bysort unit: gen year_target = year_nb if treatment == 1
gen closest_treatment = year_target
forvalues i = 1(1)10 {
bysort unit: replace closest_treatment = closest_treatment[_n-`i'] if(year_target[_n-`i'] != . & closest_treatment[_n] == .)
bysort unit: replace closest_treatment = closest_treatment[_n+`i'] if(year_target[_n+`i'] != . & closest_treatment[_n] == .)
}
replace year_target = closest_treatment if year_target == .
drop closest_treatment
gen dif_year = year_nb - year_target
drop year_nb year_target
Edit: in my example, the number of rows between the two treatments is even. But this solution also works for odd values, as the last row to be iterated over would be exactly in between two treatments. It doesn't matter whether we assign the distance to the previous or next treatment, unless you are interested in the sign of the number, which I assume you want to take into consideration while doing event studies (e.g. if the distance to previous treatment would be +3 years, the distance to the next treatment would be -3). This code snippet assigns value to the previous treatment (positive sign). If you want the opposite, just swap the two lines inside the loop.
I'm trying to generate different 'total count' variables by companyid & year.
One 'total count' for subs, and one total count for loans.
Basically I'm trying to extend this question: Stata: Calculate sum of any x in y?
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 companyid int year float sub_num double sub_amt float(sub_year_total loan_num) double loan_amt float loan_year_total
"001004" 1999 . 0 425000 . 0 0
"001004" 1999 2 425000 425000 . 0 0
"001004" 2004 . 0 0 . 0 0
"001004" 2005 1 4232000 4232000 . 0 0
"001004" 2006 1 16000000 1.60e+07 . 0 0
"001004" 2007 3 58354 182444 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 . 0 471529 . 0 0
"001078" 2006 6 29872 471529 . 0 0
"001078" 2006 6 59748 471529 . 0 0
"001078" 2006 6 381909 471529 . 0 0
"001078" 2007 . 0 768825 7 270000 2580000
"001078" 2007 . 0 768825 7 360000 2580000
"001078" 2007 . 0 768825 7 1500000 2580000
"001078" 2007 . 0 768825 7 450000 2580000
"001078" 2007 . 0 768825 . 0 2580000
"001078" 2007 7 359454 768825 . 0 2580000
"001078" 2007 7 409371 768825 . 0 2580000
"001078" 2008 . 0 1751832 5 450000 2450000
"001078" 2008 . 0 1751832 5 2000000 2450000
"001078" 2008 5 47957 1751832 . 0 2450000
"001078" 2008 5 485631 1751832 . 0 2450000
"001078" 2008 5 1218244 1751832 . 0 2450000
end
To note: If sub_num = 0 then loan_num != 0, and vice versa.
I've tried bysort cik year: gen sub_num = _N if loan_amt != 0
and bysort cik year loan_amt: gen sub_num = _N but neither really does it. I've left my failed count variables in the examples for reference.
i.e. company #001078 in 2007 would have loan_num = 4 and sub_num = 2
I just noticed this example has one observation that has 0 for both, I can just eliminate entries that have 0 for both so no need to comment on that.
How can I make company total annual counts for my 'sub' and 'loan' variables?
This is a little hard to follow.
There is reference to cik in your code but it is not in your data example.
It is hard to know what is original data and what is the result of calculations you have tried.
The example seems more complicated than necessary.
Although the title refers to sums, it is also clear that you are interested in counting loans of certain kinds.
A count is a sum of indicators, so this shows some technique rather than necessarily being an answer. Feed to egen, total() a true-or-false expression and the result will be the count of observations for which the expression is true (1); arguments that are false (0) are ignored in the sense that they make no difference to the sum.
bysort companyid year : egen wanted1 = total(loan_amt > 0)
bysort companyid year : egen wanted2 = total(loan_amt > 0 & sub_num < .)
_N is just the number of observations, sometimes conditional on other variables. You naturally can assign that number to a variable, but also specifying an if qualifier doesn't make the calculation ignore the excluded values; it just affects which observations are ignored in receiving non-missing values. Consider this experiment:
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. gen count = _N if _n == 1
(999 missing values generated)
. l count in 1
+-------+
| count |
|-------|
1. | 1000 |
+-------+
Otherwise put, _N is not as general a counting method as you need here.
I think I found a work around:
gen lc = 0
replace lc = 1 if loan_sum != 0
bysort cik year lc: gen lcount = _N if lc != 0
then just do the same for other variables.
I have the following data with person ID and whether they have insurance in each year:
ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
What I want is to add another column, which equals 1 if a person is ever insured. For example, Person 2 only had insurance in 2002 but it means he has had insurance at some point, so Ever_Ins should equal 1 in all years:
ID Year Insured Ever_Ins
1 2001 1 1
2 2001 0 1
3 2001 0 0
1 2002 1 1
2 2002 1 1
3 2002 0 0
1 2003 1 1
2 2003 0 1
3 2003 0 0
I cannot use egen Ever_Ins = max(Insured), by (ID) because Insured is not a dummy in the true data. It has values such as 9 for unknown.
Technique for "any" and "all" problems is documented in this FAQ. See also this paper for a more detailed discussion. Here is one way to do it.
clear
input ID Year Insured
1 2001 1
2 2001 0
3 2001 0
1 2002 1
2 2002 1
3 2002 0
1 2003 1
2 2003 0
3 2003 0
end
egen Ever_Ins = max(Insured == 1), by(ID)
sort ID Year
list , sepby(ID)
+--------------------------------+
| ID Year Insured Ever_Ins |
|--------------------------------|
1. | 1 2001 1 1 |
2. | 1 2002 1 1 |
3. | 1 2003 1 1 |
|--------------------------------|
4. | 2 2001 0 1 |
5. | 2 2002 1 1 |
6. | 2 2003 0 1 |
|--------------------------------|
7. | 3 2001 0 0 |
8. | 3 2002 0 0 |
9. | 3 2003 0 0 |
+--------------------------------+
In pandas 0.18.1, python 2.7.6:
Imagine we have the following table:
ID,FROM_YEAR,FROM_MONTH,AREA
1,2015,1,200
1,2015,2,200
1,2015,3,200
1,2015,4,200
1,2015,5,200
1,2015,6,200
1,2015,7,200
1,2015,8,200
1,2015,9,200
1,2015,10,200
1,2015,11,200
1,2015,12,200
1,2016,1,100
1,2016,2,100
1,2016,3,100
1,2016,4,100
1,2016,5,100
1,2016,6,100
1,2016,7,100
1,2016,8,100
1,2016,9,100
1,2016,10,100
1,2016,11,100
1,2016,12,100
We are trying to get an calendar year average in the following format
ID,FROM_YEAR,TYPE,AREA
1,2015,A,200
1,2016,A,100
1,2015,B,200
1,2016,B,100
Note: TYPE is a string column for other information. Here we only have 2 types of 'TYPE': 'A' and 'B'
If we tried the following, the 'AREA' column name is missing, also the ID=1 only shows in the first case.
AREA_CY=df.groupby(['ID','FROM_YEAR'])['AREA'].mean()
it returns:
ID,FROM_YEAR,
1,2015,200
,2016,100
,2015,200
,2016,100
If we tried the following:
AREA_CY=df.groupby(['ID','FROM_YEAR'])['AREA'].mean(axis=1)
it returns:
TypeError: mean() got an unexpected keyword argument 'axis'
Could any guru enlighten? Thanks!
Try this:
In [102]: x = df.groupby(['ID','FROM_YEAR'])['AREA'].mean().reset_index(name='AREA')
In [103]: y = pd.DataFrame({'TYPE':['A','B']})
In [104]: x
Out[104]:
ID FROM_YEAR AREA
0 1 2015 200
1 1 2016 100
In [105]: y
Out[105]:
TYPE
0 A
1 B
In [106]: x.assign(key=0).merge(y.assign(key=0), on='key').drop('key', 1)
Out[106]:
ID FROM_YEAR AREA TYPE
0 1 2015 200 A
1 1 2015 200 B
2 1 2016 100 A
3 1 2016 100 B
Explanation:
Let's make a cartesian product (AKA full outer join) of x and y DFs:
In [126]: x.assign(key=0)
Out[126]:
ID FROM_YEAR AREA key
0 1 2015 200 0
1 1 2016 100 0
In [127]: y.assign(key=0)
Out[127]:
TYPE key
0 A 0
1 B 0
In [128]: x.assign(key=0).merge(y.assign(key=0), on='key')
Out[128]:
ID FROM_YEAR AREA key TYPE
0 1 2015 200 0 A
1 1 2015 200 0 B
2 1 2016 100 0 A
3 1 2016 100 0 B
I have a dataset and I would like to create a rolling conditional statement row by row (not sure what the exact term is called in SAS). I know how to do it in Excel but not sure on how it can be executed on SAS. The following is the dataset and what I would like to achieve.
Data set
----A---- | --Date-- | Amount |
11111 Jan 2015 1
11111 Feb 2015 1
11111 Mar 2015 2
11111 Apr 2015 2
11111 May 2015 2
11111 Jun 2015 1
11112 Jan 2015 2
11112 Feb 2015 1
11112 Mar 2015 1
11112 Apr 2015 4
11112 May 2015 3
11112 Jun 2015 1
I would like to 2 columns by the name of 'X' and 'Frequency' which would provide for each Column 'A' and 'Date' whether the Amount has gone up or down and by how much. See sample output below.
----A---- | --Date-- | Amount | --X-- | Frequency |
11111 Jan 2015 1 0 0
11111 Feb 2015 1 0 0
11111 Mar 2015 2 Add 1
11111 Apr 2015 2 0 0
11111 May 2015 2 0 0
11111 Jun 2015 1 Drop 1
11112 Jan 2015 2 0 0
11112 Feb 2015 1 Drop 1
11112 Mar 2015 1 0 0
11112 Apr 2015 4 Add 3
11112 May 2015 3 Drop 1
11112 Jun 2015 1 Drop 2
Example using Lag1():
Data A;
input date monyy7. Y $;
datalines;
Jan2015 1
Feb2015 1
Mar2015 2
Apr2015 2
May2015 2
Jun2015 1
Jan2015 2
Feb2015 1
Mar2015 1
Apr2015 4
May2015 3
Jun2015 1
;
data B;
set A;
lag_y=lag1(Y);
if lag_y = . then X ='missing';
if Y = lag_y then X='zero';
if Y > lag_y and lag_y ^= . then x='add';
if Y < lag_y then x= 'drop';
freq= abs(Y-lag_y);
run;
Output:
Obs date Y lag_y X freq
1 20089 1 missing
2 20120 1 1 zero 0
3 20148 2 1 add 1
4 20179 2 2 zero 0
5 20209 2 2 zero 0
6 20240 1 2 drop 1
7 20089 2 1 add 1
8 20120 1 2 drop 1
9 20148 1 1 zero 0
10 20179 4 1 add 3
11 20209 3 4 drop 1
12 20240 1 3 drop 2