Lag a variable with non unique id x time observations - stata

I have a repeated cross section every year. I have a variable, var1, which is the same across all observations in a given year (for instance, the mean of a variable in a given year). I'd like to create a variable, var1_l, that would be the lagged version of var1.
As an example, from the dataset
id1 year var1
3 1990 3.5
4 1990 3.5
5 1991 4
6 1991 4
7 1991 4
I would like to obtain
id1 year var1 var1_l
3 1990 3.5 .
4 1990 3.5 .
5 1991 4 3.5
6 1991 4 3.5
7 1991 4 3.5
A solution would be to use a merge but saving/restoring the dataset takes a lot of time when the dataset is big. For reference, below is my current merge solution:
preserve
keep year var1
replace year = year - 1
bys year: keep if _n == 1
rename var1 var1_l
sort year
tempfile temp
save `temp'
restore
merge m:1 year using `temp', nogen sorted
Another option would be to use the matrix returned by tabstat. I'm wondering if there is a more elegant solution (that returns . when there is no observation in year - 1).

This seems a little unusual, but could be just a twist on a standard problem as explained here.
. input id1 year var1
id1 year var1
1. 3 1990 3.5
2. 4 1990 3.5
3. 5 1991 4
4. 6 1991 4
5. 7 1991 4
6. end
. sort year id1
. generate var1_l = var1[_n-1] if year == year[_n-1] + 1
(4 missing values generated)
. replace var1_l = var1_l[_n-1] if year == year[_n-1] & missing(var1_l)
(2 real changes made)
. list
+----------------------------+
| id1 year var1 var1_l |
|----------------------------|
1. | 3 1990 3.5 . |
2. | 4 1990 3.5 . |
3. | 5 1991 4 3.5 |
4. | 6 1991 4 3.5 |
5. | 7 1991 4 3.5 |
+----------------------------+

This answer crossed with #Nick's but there's a slight difference in terms of results. I check only that years be different, while his code checks that years be consecutive.
clear
set more off
input ///
id year var1
1 1990 3.5
3 1990 3.5
2 1990 3.5
1 1991 2
2 1991 2
3 1991 2
3 1992 6
2 1992 6
1 1992 6
3 1993 6
2 1993 6
1 1993 6
4 1993 6
1 1994 4.3
2 1994 4.3
3 1994 4.3
end
list, sepby(year)
*----- what you want -----
sort year
generate var2 = var1[_n-1] if year != year[_n-1]
by year : replace var2 = var2[1]
list, sepby(year)

Related

Filter Specific Data in Stata

I'm using Stata 13 and have to clean a data set in a panel format with different ids for a given period from 2000 to 2003. My data looks like:
id year ln_wage
1 2000 2.30
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.10
3 2002 1.60
4 2002 2.46
4 2003 2.47
5 2000 2.10
5 2001 2.10
5 2003 2.12
I would like to keep in my dataset for each year only individuals that appear in t-1 year. In this way, the first year of my sample (2000) will be dropped. I'm looking for output like:
2001:
id year ln_wage
1 2001 2.31
5 2001 2.10
2002:
id year ln_wage
1 2002 2.31
2 2002 1.89
2003:
id year ln_wage
2 2003 2.10
4 2003 2.47
Regards,
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int year float ln_wage
1 2000 2.3
1 2001 2.31
1 2002 2.31
2 2001 1.89
2 2002 1.89
2 2003 2.1
3 2002 1.6
4 2002 2.46
4 2003 2.47
5 2000 2.1
5 2001 2.1
5 2003 2.12
end
xtset id year
drop if missing(L.ln_wage)
sort year id
list, noobs sepby(year)
+---------------------+
| id year ln_wage |
|---------------------|
| 1 2001 2.31 |
| 5 2001 2.1 |
|---------------------|
| 1 2002 2.31 |
| 2 2002 1.89 |
|---------------------|
| 2 2003 2.1 |
| 4 2003 2.47 |
+---------------------+
// Alternatively, assuming no duplicate years within id exist
bysort id (year): gen todrop = year[_n-1] != year - 1
drop if todrop

How to generate indicator if value of variable is observed in two different periods in Stata

I have a dataset containing various drugs and the dates they were supplied. I would like to create an indicator variable DIBP that takes a value of 1 if the same drug was supplied during both period 1 and period 2 of a given year, and zero otherwise. Period 1 is 1 April to 30 June, and period 2 is 1 October to 31 December.
I have written the following code:
. input id month day year str10 drug
id month day year drug
1. 1 5 1 2003 aspirin
2. 1 11 1 2003 aspirin
3. 1 6 1 2004 aspirin
4. 1 5 1 2005 aspirin
5. 1 11 1 2005 aspirin
6. end
.
. gen date = mdy(month,day,year)
. format date %d
.
. gen period = 1 if inlist(month,4,5,6)
(2 missing values generated)
. replace period = 2 if inlist(month,10,11,12)
(2 real changes made)
.
. label define plab 1"1 April to 30 June" 2"1 October to 31 December"
. label value period plab
.
. * Generate indicator
. gen DIBP = 0
. label var DIBP "Drug In Both Periods"
.
. bysort id year: replace DIBP = 1 if drug[period==1] == "aspirin" & drug[period==2] == "aspirin"
(0 real changes made)
.
. list
+---------------------------------------------------------------------------------+
| id month day year drug date period DIBP |
|---------------------------------------------------------------------------------|
1. | 1 5 1 2003 aspirin 01may2003 1 April to 30 June 0 |
2. | 1 11 1 2003 aspirin 01nov2003 1 October to 31 December 0 |
3. | 1 6 1 2004 aspirin 01jun2004 1 April to 30 June 0 |
4. | 1 5 1 2005 aspirin 01may2005 1 April to 30 June 0 |
5. | 1 11 1 2005 aspirin 01nov2005 1 October to 31 December 0 |
+---------------------------------------------------------------------------------+
I would expect DIBP to take a value of 1 for observations 1,2,3 and 4 (because they took aspirin during both periods for years 2003 and 2005) and a value of zero for observation 3 (because aspirin was only taken during one period in 2004), but this isn't the case. Where am I going wrong? Thank you.
There is a problem apparent with your use of subscripts. You seem to be assuming that a subscript can be used to select other observations, which can indeed be done individually. But what you tried is legal yet not what you want.
The expressions used as subscripts
period == 1
period == 2
will be evaluated as true (1) or false (0) according to the value of period in the current observation. Then either observation 0 (which is always regarded as having missing values) or observation 1 (the first in each group of observations) will be used. Otherwise put, subscripts evaluate as observation numbers, not as defining subsets of the data.
There is a further puzzle because even for the same person and year it seems that in principle period 1 or period 2 could mean several observations. In the example given, the drug is constant any way, but what would you expect the code to do if the drug was different? The crux most evident to me is distinguishing between a flag for any prescriptions of a certain drug and all prescriptions of that drug in a period. More at this FAQ.
Otherwise this code may help. Extension to several drugs is left as an exercise.
clear
input id month day year str10 drug
1 5 1 2003 aspirin
1 11 1 2003 aspirin
1 6 1 2004 aspirin
1 5 1 2005 aspirin
1 11 1 2005 aspirin
end
generate date = mdy(month,day,year)
format date %td
* code needs modification if any month is 1, 2, 3, 7, 8, 9
generate period = 1 if inlist(month,4,5,6)
replace period = 2 if inlist(month,10,11,12)
label define plab 1"1 April to 30 June" 2"1 October to 31 December"
label value period plab
bysort id year period (date): egen all_aspirin = min(drug == "aspirin")
by id year period: egen any_aspirin = max(drug == "aspirin")
by id year : gen both_all_aspirin = period[1] == 1 & period[_N] == 2 & all_aspirin[1] & all_aspirin[_N]
by id year : gen both_any_aspirin = period[1] == 1 & period[_N] == 2 & any_aspirin[1] & any_aspirin[_N]
list id date drug *aspirin
+----------------------------------------------------------------------+
| id date drug all_as~n any_as~n b~ll_a~n b~ny_a~n |
|----------------------------------------------------------------------|
1. | 1 01may2003 aspirin 1 1 1 1 |
2. | 1 01nov2003 aspirin 1 1 1 1 |
3. | 1 01jun2004 aspirin 1 1 0 0 |
4. | 1 01may2005 aspirin 1 1 1 1 |
5. | 1 01nov2005 aspirin 1 1 1 1 |
+----------------------------------------------------------------------+
As a style note, consider this example
generate dummy = 0
replace dummy = 1 if frog == 42
Experienced Stata programmers generally just write
generate dummy = frog == 42
See also this FAQ

How to reshape data multiple ways in Stata?

I am working with a data set covering multiple countries, variables, and years. It is currently organized wide like so (actually ~30 years and 5 different variables for each country):
country measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
What I would like is for the data to be rearranged long like so:
country year A B C
USA 1995 5 1 0
USA 1996 4 2 4
USA 1997 1 1 2
UK 1995 2 2 2
UK 1996 4 8 4
UK 1997 9 4 1
I tried using reshape long yr, i(country) j(year) but get the following error message:
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified i(country) and j(year). In
the current wide form, variable country should uniquely identify the observations.
I think this is because country is not the only long variable? (measure also is?)
Besides fixing that issue and arranging the years long instead of wide, I don't think this command will accomplish the other task of moving the different variables (A, B, C) into the wide format as column headers.
Will I need to use a separate reshape wide command for that? Or is there some way to expand the command to do both at once?
It's a double reshape. At least it can be done that way; and, further, that seems essential because years need to be long, not wide, and the measure(s) need to be wide, not long, so there are flavours of both problems.
Economic development data often arrive like this. Indeed the problem has given rise to at least one dedicated short paper
in the Stata Journal, but visible to all.
Your data example is helpful, and almost immediately useful, but please read the Stata tag and help dataex (if necessary, install dataex first using ssc install dataex).
See also this FAQ, which includes some hints beyond the Stata help and manual entry.
A search reshape in Stata would have pointed to these resources.
clear
input str3 country str1 measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
end
reshape long yr, i(country measure) j(year)
reshape wide yr, i(country year) j(measure) string
rename (yr*) *
list, sepby(country)
+----------------------------+
| country year A B C |
|----------------------------|
1. | UK 1995 2 2 2 |
2. | UK 1996 4 8 4 |
3. | UK 1997 9 4 1 |
|----------------------------|
4. | USA 1995 5 1 0 |
5. | USA 1996 4 2 4 |
6. | USA 1997 1 1 2 |
+----------------------------+

Creating a dummy indicating the last row of each group

I have the following panel dataset.
I did
sort FirmID Year
to make the following.
FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
I want to create a new variable exitnextyear which is 1 if the firm does not exist next year, so that the output is
FirmID Year exitnextyear
1 1996 0
1 1997 0
1 1998 1
2 2000 0
2 2001 1
I think I have to use something like
by FirmID: gen exitnextyear (and something)
but I don't know what to do next.
clear
input FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
end
bysort FirmID (Year) : gen byte exitnextyear = _n == _N
list, sepby(FirmID)
For the principles, see help and manual entries on by: and/or a tutorial review accessible here.
Row is spreadsheetspeak; in Stata the term is observation.

Making unbalanced panel balanced with missing observations

I am attempting to make the data balanced for my sample. My data currently looks like:
id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
....
And I would like it to look like:
id year y
1 2000 2
1 2001 .
1 2002 4
1 2003 5
2 2000 .
2 2001 2
2 2002 3
....
I have tried creating a .dta of just the year and merging it to the data; however, I can't get it to work. Essentially I would like to add rows of missing data to the panel. I realize I could just drop ids with unbalanced data, but this is not an option for my methodology.
You need to skim the Data-Management Reference Manual [D] when looking for basic data management functionality. In this case fillin does what you seem to be asking.
clear
input id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
end
fillin id year
list, sepby(id)
+-------------------------+
| id year y _fillin |
|-------------------------|
1. | 1 2000 2 0 |
2. | 1 2001 . 1 |
3. | 1 2002 4 0 |
4. | 1 2003 5 0 |
|-------------------------|
5. | 2 2000 . 1 |
6. | 2 2001 2 0 |
7. | 2 2002 3 0 |
8. | 2 2003 . 1 |
+-------------------------+