Creating a dummy indicating the last row of each group

Creating a dummy indicating the last row of each group - stata

I have the following panel dataset.
I did
sort FirmID Year
to make the following.
FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
I want to create a new variable exitnextyear which is 1 if the firm does not exist next year, so that the output is
FirmID Year exitnextyear
1 1996 0
1 1997 0
1 1998 1
2 2000 0
2 2001 1
I think I have to use something like
by FirmID: gen exitnextyear (and something)
but I don't know what to do next.

clear
input FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
end
bysort FirmID (Year) : gen byte exitnextyear = _n == _N
list, sepby(FirmID)
For the principles, see help and manual entries on by: and/or a tutorial review accessible here.
Row is spreadsheetspeak; in Stata the term is observation.

Related

How to generate indicator if value of variable is observed in two different periods in Stata

I have a dataset containing various drugs and the dates they were supplied. I would like to create an indicator variable DIBP that takes a value of 1 if the same drug was supplied during both period 1 and period 2 of a given year, and zero otherwise. Period 1 is 1 April to 30 June, and period 2 is 1 October to 31 December.
I have written the following code:
. input id month day year str10 drug
id month day year drug
1. 1 5 1 2003 aspirin
2. 1 11 1 2003 aspirin
3. 1 6 1 2004 aspirin
4. 1 5 1 2005 aspirin
5. 1 11 1 2005 aspirin
6. end
.
. gen date = mdy(month,day,year)
. format date %d
.
. gen period = 1 if inlist(month,4,5,6)
(2 missing values generated)
. replace period = 2 if inlist(month,10,11,12)
(2 real changes made)
.
. label define plab 1"1 April to 30 June" 2"1 October to 31 December"
. label value period plab
.
. * Generate indicator
. gen DIBP = 0
. label var DIBP "Drug In Both Periods"
.
. bysort id year: replace DIBP = 1 if drug[period==1] == "aspirin" & drug[period==2] == "aspirin"
(0 real changes made)
.
. list
+---------------------------------------------------------------------------------+
| id month day year drug date period DIBP |
|---------------------------------------------------------------------------------|
1. | 1 5 1 2003 aspirin 01may2003 1 April to 30 June 0 |
2. | 1 11 1 2003 aspirin 01nov2003 1 October to 31 December 0 |
3. | 1 6 1 2004 aspirin 01jun2004 1 April to 30 June 0 |
4. | 1 5 1 2005 aspirin 01may2005 1 April to 30 June 0 |
5. | 1 11 1 2005 aspirin 01nov2005 1 October to 31 December 0 |
+---------------------------------------------------------------------------------+
I would expect DIBP to take a value of 1 for observations 1,2,3 and 4 (because they took aspirin during both periods for years 2003 and 2005) and a value of zero for observation 3 (because aspirin was only taken during one period in 2004), but this isn't the case. Where am I going wrong? Thank you.

There is a problem apparent with your use of subscripts. You seem to be assuming that a subscript can be used to select other observations, which can indeed be done individually. But what you tried is legal yet not what you want.
The expressions used as subscripts
period == 1
period == 2
will be evaluated as true (1) or false (0) according to the value of period in the current observation. Then either observation 0 (which is always regarded as having missing values) or observation 1 (the first in each group of observations) will be used. Otherwise put, subscripts evaluate as observation numbers, not as defining subsets of the data.
There is a further puzzle because even for the same person and year it seems that in principle period 1 or period 2 could mean several observations. In the example given, the drug is constant any way, but what would you expect the code to do if the drug was different? The crux most evident to me is distinguishing between a flag for any prescriptions of a certain drug and all prescriptions of that drug in a period. More at this FAQ.
Otherwise this code may help. Extension to several drugs is left as an exercise.
clear
input id month day year str10 drug
1 5 1 2003 aspirin
1 11 1 2003 aspirin
1 6 1 2004 aspirin
1 5 1 2005 aspirin
1 11 1 2005 aspirin
end
generate date = mdy(month,day,year)
format date %td
* code needs modification if any month is 1, 2, 3, 7, 8, 9
generate period = 1 if inlist(month,4,5,6)
replace period = 2 if inlist(month,10,11,12)
label define plab 1"1 April to 30 June" 2"1 October to 31 December"
label value period plab
bysort id year period (date): egen all_aspirin = min(drug == "aspirin")
by id year period: egen any_aspirin = max(drug == "aspirin")
by id year : gen both_all_aspirin = period[1] == 1 & period[_N] == 2 & all_aspirin[1] & all_aspirin[_N]
by id year : gen both_any_aspirin = period[1] == 1 & period[_N] == 2 & any_aspirin[1] & any_aspirin[_N]
list id date drug *aspirin
+----------------------------------------------------------------------+
| id date drug all_as~n any_as~n b~ll_a~n b~ny_a~n |
|----------------------------------------------------------------------|
1. | 1 01may2003 aspirin 1 1 1 1 |
2. | 1 01nov2003 aspirin 1 1 1 1 |
3. | 1 01jun2004 aspirin 1 1 0 0 |
4. | 1 01may2005 aspirin 1 1 1 1 |
5. | 1 01nov2005 aspirin 1 1 1 1 |
+----------------------------------------------------------------------+
As a style note, consider this example
generate dummy = 0
replace dummy = 1 if frog == 42
Experienced Stata programmers generally just write
generate dummy = frog == 42
See also this FAQ

Adding across years

Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks

Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.

Lag a variable with non unique id x time observations

I have a repeated cross section every year. I have a variable, var1, which is the same across all observations in a given year (for instance, the mean of a variable in a given year). I'd like to create a variable, var1_l, that would be the lagged version of var1.
As an example, from the dataset
id1 year var1
3 1990 3.5
4 1990 3.5
5 1991 4
6 1991 4
7 1991 4
I would like to obtain
id1 year var1 var1_l
3 1990 3.5 .
4 1990 3.5 .
5 1991 4 3.5
6 1991 4 3.5
7 1991 4 3.5
A solution would be to use a merge but saving/restoring the dataset takes a lot of time when the dataset is big. For reference, below is my current merge solution:
preserve
keep year var1
replace year = year - 1
bys year: keep if _n == 1
rename var1 var1_l
sort year
tempfile temp
save `temp'
restore
merge m:1 year using `temp', nogen sorted
Another option would be to use the matrix returned by tabstat. I'm wondering if there is a more elegant solution (that returns . when there is no observation in year - 1).

This seems a little unusual, but could be just a twist on a standard problem as explained here.
. input id1 year var1
id1 year var1
1. 3 1990 3.5
2. 4 1990 3.5
3. 5 1991 4
4. 6 1991 4
5. 7 1991 4
6. end
. sort year id1
. generate var1_l = var1[_n-1] if year == year[_n-1] + 1
(4 missing values generated)
. replace var1_l = var1_l[_n-1] if year == year[_n-1] & missing(var1_l)
(2 real changes made)
. list
+----------------------------+
| id1 year var1 var1_l |
|----------------------------|
1. | 3 1990 3.5 . |
2. | 4 1990 3.5 . |
3. | 5 1991 4 3.5 |
4. | 6 1991 4 3.5 |
5. | 7 1991 4 3.5 |
+----------------------------+

This answer crossed with #Nick's but there's a slight difference in terms of results. I check only that years be different, while his code checks that years be consecutive.
clear
set more off
input ///
id year var1
1 1990 3.5
3 1990 3.5
2 1990 3.5
1 1991 2
2 1991 2
3 1991 2
3 1992 6
2 1992 6
1 1992 6
3 1993 6
2 1993 6
1 1993 6
4 1993 6
1 1994 4.3
2 1994 4.3
3 1994 4.3
end
list, sepby(year)
*----- what you want -----
sort year
generate var2 = var1[_n-1] if year != year[_n-1]
by year : replace var2 = var2[1]
list, sepby(year)

Stata: Generate new variable with all values (e.g. not just max or min) for a group based on other variable in the group

I want to create new variables for the group country (iso_o/iso_d) with characteristics of the variable indepdate.
So far I have been typing:
gen include=1 if heg_o != 1
egen iso_o_indepdate1=min(indepdate * include), by(iso_o)
egen iso_o_indepdate2=max(indepdate * include), by(iso_o)
replace iso_o_indepdate2=. if iso_o_indepdate1==iso_o_indepdate2
drop include
*
gen include=1 if heg_d !=1
egen iso_d_indepdate1=min(indepdate * include), by(iso_d)
egen iso_d_indepdate2=max(indepdate * include), by(iso_d)
replace iso_d_indepdate2=. if iso_d_indepdate1==iso_d_indepdate2
drop include
The problem is I can use min() and max() combined to create two new variables for the values within indepdate, but if there are more then three I haven't been able to get a solution. Here a small table.
iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
So for this small example I could try work with intervals, however the dataset is very large and therefore I cannot tell for sure how many values are in one interval.
Any hint on another approach for this?

You can reshape and then merge:
clear all
set more off
*----- example data ---
input ///
str3 iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
end
drop new*
list, sepby(group)
tempfile orig
save "`orig'"
*----- what you want -----
bysort group (indepdate) : gen j = _n
reshape wide indepdate, i(group) j(j)
keep group indepdate*
merge 1:m group using "`orig'", assert(match) nogenerate
// list
sort group indepdate
order iso_o group indepdate indepdate*
list, sepby(group)
See help dropmiss to drop variables that have only missing values.
But the bigger question is why do you want to do this?

Compare each obs with rest of its sub-group

I have the following goal regarding my data structure
group; month; year; next_year
1; February; 2014; 0
1; March; 2006; 0
1; November; 2013; 1
2; January; 2014; 0
3; January; 2004; 0
I do have group, month and year, however the column next_year needs to be generated from the first three: For each observation, I want to check if there is another observation within the same group that has a date-value which falls into the period of next year. If so, I want to set the value of next_year to 1, otherwise to 0 (see example).
I started by converting the date into a format that Stata can interpret - using ym(month, year) - such that I can make comparisons. However, I am not sure how to iterate over all observations within the group in order to determine if that is the case or not.
I would know how to do it in e.g. Java, but I don't for Stata. I suppose I should not start with loops as there are probably some implemented commands for such problems.

If you want to check if there is a following observation within the next 12 months, you can try:
clear
set more off
*----- example data -----
input group str8 month year
1 March 2006
1 March 2013
1 November 2013
1 January 2013
2 January 2014
3 January 2004
end
*----- what you want -----
gen dat = monthly(month + string(year), "MY")
format dat %tm
bysort group (dat): gen next = dat[_n+1] - dat <= 12
list, sepby(group)
Make sure you understand the difference between Nick's code and mine. They work under different assumptions. You can check the differences running both pieces of code with the data I have provided (originally Nick's but with one observation deleted to get the point across; by chance, if you use Nick's data without the modification, the results will be the same).

You are correct in avoiding an explicit loop. This kind of problem is soluble using by:.
I modified your example to have two observations for one group in one year.
clear
input group str8 month year
1 February 2014
1 March 2006
1 March 2013
1 November 2013
2 January 2014
3 January 2004
end
bysort group (year) : gen next_year = year[_n+1] == year + 1
bysort group year (next_year) : replace next_year = next_year[_N]
list, sepby(group)
+------------------------------------+
| group month year next_y~r |
|------------------------------------|
1. | 1 March 2006 0 |
2. | 1 November 2013 1 |
3. | 1 March 2013 1 |
4. | 1 February 2014 0 |
|------------------------------------|
5. | 2 January 2014 0 |
|------------------------------------|
6. | 3 January 2004 0 |
+------------------------------------+
Getting an explicit sort order is essential. Within group, we look ahead to see if the next year is the current year plus 1, assigning 1 if true and 0 if false. That will at most be true for the last observation for a given group and year. If there is more than one observation for each group and year, any occurrence of 1 must be spread to all such observations.
For a tutorial on by:, see Speaking Stata: How to move step by: step.
The assumption here is that you mean in the next calendar year, not in the next 12 months. Making your dates into Stata monthly dates will be needed for most other problems, but doesn't make this one easier. Here is one way to do that in your situation, assuming that month is string and year is numeric:
gen mdate = monthly(month + string(year), "MY")
format mdate %tm

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating a dummy indicating the last row of each group - stata

clear input FirmID Year 1 1996 1 1997 1 1998 2 2000 2 2001 end bysort FirmID (Year) : gen byte exitnextyear = _n == _N list, sepby(FirmID) For the principles, see help and manual entries on by: and/or a tutorial review accessible here. Row is spreadsheetspeak; in Stata the term is observation.

Related

How to generate indicator if value of variable is observed in two different periods in Stata

Adding across years

Lag a variable with non unique id x time observations

Stata: Generate new variable with all values (e.g. not just max or min) for a group based on other variable in the group

Compare each obs with rest of its sub-group

Categories

Resources