I have the following goal regarding my data structure
group; month; year; next_year
1; February; 2014; 0
1; March; 2006; 0
1; November; 2013; 1
2; January; 2014; 0
3; January; 2004; 0
I do have group, month and year, however the column next_year needs to be generated from the first three: For each observation, I want to check if there is another observation within the same group that has a date-value which falls into the period of next year. If so, I want to set the value of next_year to 1, otherwise to 0 (see example).
I started by converting the date into a format that Stata can interpret - using ym(month, year) - such that I can make comparisons. However, I am not sure how to iterate over all observations within the group in order to determine if that is the case or not.
I would know how to do it in e.g. Java, but I don't for Stata. I suppose I should not start with loops as there are probably some implemented commands for such problems.
If you want to check if there is a following observation within the next 12 months, you can try:
clear
set more off
*----- example data -----
input group str8 month year
1 March 2006
1 March 2013
1 November 2013
1 January 2013
2 January 2014
3 January 2004
end
*----- what you want -----
gen dat = monthly(month + string(year), "MY")
format dat %tm
bysort group (dat): gen next = dat[_n+1] - dat <= 12
list, sepby(group)
Make sure you understand the difference between Nick's code and mine. They work under different assumptions. You can check the differences running both pieces of code with the data I have provided (originally Nick's but with one observation deleted to get the point across; by chance, if you use Nick's data without the modification, the results will be the same).
You are correct in avoiding an explicit loop. This kind of problem is soluble using by:.
I modified your example to have two observations for one group in one year.
clear
input group str8 month year
1 February 2014
1 March 2006
1 March 2013
1 November 2013
2 January 2014
3 January 2004
end
bysort group (year) : gen next_year = year[_n+1] == year + 1
bysort group year (next_year) : replace next_year = next_year[_N]
list, sepby(group)
+------------------------------------+
| group month year next_y~r |
|------------------------------------|
1. | 1 March 2006 0 |
2. | 1 November 2013 1 |
3. | 1 March 2013 1 |
4. | 1 February 2014 0 |
|------------------------------------|
5. | 2 January 2014 0 |
|------------------------------------|
6. | 3 January 2004 0 |
+------------------------------------+
Getting an explicit sort order is essential. Within group, we look ahead to see if the next year is the current year plus 1, assigning 1 if true and 0 if false. That will at most be true for the last observation for a given group and year. If there is more than one observation for each group and year, any occurrence of 1 must be spread to all such observations.
For a tutorial on by:, see Speaking Stata: How to move step by: step.
The assumption here is that you mean in the next calendar year, not in the next 12 months. Making your dates into Stata monthly dates will be needed for most other problems, but doesn't make this one easier. Here is one way to do that in your situation, assuming that month is string and year is numeric:
gen mdate = monthly(month + string(year), "MY")
format mdate %tm
Related
Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.
Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:
Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old
Dividing individuals into treatment and controlgroups based on varying chance:
due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups.
To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture).
There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.
Programming
I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)
gen individualnumber=(age-year)+2007
I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:
gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9
gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9
gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)
After that I wanted to construct the samples...
splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show
...but I received an error since only a numlist might be used in the split(numlist) subcommand.
Question: How to construct the samples or overcome this error in an efficient way?
Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals.
Let's state there are 100 individuals like individual 7 in the table.
Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.
What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.
Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end
* Create the var that will display the
gen birthyear = year-age
preserve
* Collapse year-person level data to person level so
* that each individual only get one treatment status.
* You must have an individual id number for this
* Get standard deviation to test that data is good and the birthyear
* is identical for each individual across the panel data set
collapse (mean) birthyear (sd) bysd=birthyear, by(id)
* Test that birthyear is same across all indivudals - this is not needed,
* but good data quality assurance test. Then drop the var as it is not needed
assert bysd == 0
drop bysd
* Set seed to make replicable. Replace this seed when you have tested this
* script using a new random seed. For example from here:
* https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
set seed 123456
*Generate a random number based on the seed
gen random_draw = runiform()
* For each birthyear, get the rank of the random number divided by the number
* of individuals in each birthyear
sort birthyear random_draw
by birthyear : gen percent_rank = _n/_N
*Initiate treatmen variable
gen tmt_status = .
label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"
*Assign birthyear 2006-2004 that are all the same
replace tmt_status = 1 if birthyear == 2006
replace tmt_status = 1 if birthyear == 2005
replace tmt_status = 1 if birthyear == 2004
*Assign birthyear 2003
replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
replace tmt_status = 1 if birthyear == 2003 & percent_rank > .25
*Assign birthyear 2002
replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
replace tmt_status = 1 if birthyear == 2002 & percent_rank > .40
*Fill in birthyear 2001-1999
*Assign year 1998
replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
replace tmt_status = 1 if birthyear == 1998 & percent_rank > .72 & percent_rank <= .86
replace tmt_status = 2 if birthyear == 1998 & percent_rank > .86
*Fill in birthyear 1997-1990
* Do some tabulates etc to convince yourself the randomization is as expected
* Save tempfile of data to be merged to later
* (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
tempfile assignment_results
save `assignment_results'
restore
merge m:1 id using `assignment_results'
This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.
This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.
I have a dataset for U.S. manufacturing workers in the past 30 decades, and I am particularly interested in the following variables:
Month and year of 1st manufacturing job, recorded separately and named "start_month_job_1" & "start_yr_job_1."
Month and year of leaving the 1st manufacturing job, recorded separately and named "end_month_job_1" & "end_yr_job_1."
The reason for leaving the job (e.g. retirement, firing, factory shutdown, etc.), named "leaving_reason"
Month and year of 2nd manufacturing job, recorded separately and named "start_month_job_2" & "start_yr_job_2."
Month and year of leaving the 2nd manufacturing job, recorded separately and named "end_month_job_2" & "end_yr_job_2."
I am trying to create a variable that measures the duration of economic inactivity/idleness. I am defining "duration of economic inactivity" this as the time difference between leaving a 1st job and starting another job. I have created a variable that accomplishes that with years as in below:
gen econ_inactivity_duration_1 = start_yr_job_2 - end_yr_job_1
replace econ_inactivity_1 = 2018 - end_yr_job_1 if missing(start_yr_job_2 ) /// In cases where a worker never starts a second job until 2018, which is the latest year measured in the survey.
However, I want to actually create an economic_inactivity_duration variable that takes into account the difference in month and year, for both starting and leaving a job, respectively. For instance, the duration for the worker in row 1 would be 2 months, between May, 1993 and July, 1993, as opposed to zero, which is what my code above computes.
dataex start_month_job_1 byte start_yr_job_1 byte end_month_job_1 byte end_yr_job_1 byte start_month_job_2 byte start_yr_job_2 byte end_month_job_2 byte end_yr_job_2 byte leaving_reason
3 1990 5 1993 7 1993 4 1994 "Firm shutdown"
1 2003 7 2015 . . . . "job automation"
98 1979 98 2004 . . . . "Firm shutdown"
98 1975 98 2010 98 2010 98 2015 "job automation"
1 1983 12 1985 1 1986 . . "Firm shutdown"
98 1996 98 1998 . . . . "Firm shutdown"
There is probably a better way, but here is a crude method.
* Data example
input end_month_job_1 end_yr_job_1 start_month_job_2 start_yr_job_2
5 1993 7 1993
end
* Calculate months since 1960
gen j1_end = (end_yr_job_1 - 1960) * 12 + end_month_job_1
gen j2_start = (start_yr_job_2 - 1960) * 12 + start_month_job_2
* Calculate difference
gen wanted = j2_start - j1_end
* Check difference is positive
assert wanted > 0
list
+------------------------------------------------------------------------+
| end_mo~1 end_yr~1 s~mont~2 s~yr_j~2 j1_end j2_start wanted |
|------------------------------------------------------------------------|
1. | 5 1993 7 1993 401 403 2 |
+------------------------------------------------------------------------+
I have a dataset containing various drugs and the dates they were supplied. I would like to create an indicator variable DIBP that takes a value of 1 if the same drug was supplied during both period 1 and period 2 of a given year, and zero otherwise. Period 1 is 1 April to 30 June, and period 2 is 1 October to 31 December.
I have written the following code:
. input id month day year str10 drug
id month day year drug
1. 1 5 1 2003 aspirin
2. 1 11 1 2003 aspirin
3. 1 6 1 2004 aspirin
4. 1 5 1 2005 aspirin
5. 1 11 1 2005 aspirin
6. end
.
. gen date = mdy(month,day,year)
. format date %d
.
. gen period = 1 if inlist(month,4,5,6)
(2 missing values generated)
. replace period = 2 if inlist(month,10,11,12)
(2 real changes made)
.
. label define plab 1"1 April to 30 June" 2"1 October to 31 December"
. label value period plab
.
. * Generate indicator
. gen DIBP = 0
. label var DIBP "Drug In Both Periods"
.
. bysort id year: replace DIBP = 1 if drug[period==1] == "aspirin" & drug[period==2] == "aspirin"
(0 real changes made)
.
. list
+---------------------------------------------------------------------------------+
| id month day year drug date period DIBP |
|---------------------------------------------------------------------------------|
1. | 1 5 1 2003 aspirin 01may2003 1 April to 30 June 0 |
2. | 1 11 1 2003 aspirin 01nov2003 1 October to 31 December 0 |
3. | 1 6 1 2004 aspirin 01jun2004 1 April to 30 June 0 |
4. | 1 5 1 2005 aspirin 01may2005 1 April to 30 June 0 |
5. | 1 11 1 2005 aspirin 01nov2005 1 October to 31 December 0 |
+---------------------------------------------------------------------------------+
I would expect DIBP to take a value of 1 for observations 1,2,3 and 4 (because they took aspirin during both periods for years 2003 and 2005) and a value of zero for observation 3 (because aspirin was only taken during one period in 2004), but this isn't the case. Where am I going wrong? Thank you.
There is a problem apparent with your use of subscripts. You seem to be assuming that a subscript can be used to select other observations, which can indeed be done individually. But what you tried is legal yet not what you want.
The expressions used as subscripts
period == 1
period == 2
will be evaluated as true (1) or false (0) according to the value of period in the current observation. Then either observation 0 (which is always regarded as having missing values) or observation 1 (the first in each group of observations) will be used. Otherwise put, subscripts evaluate as observation numbers, not as defining subsets of the data.
There is a further puzzle because even for the same person and year it seems that in principle period 1 or period 2 could mean several observations. In the example given, the drug is constant any way, but what would you expect the code to do if the drug was different? The crux most evident to me is distinguishing between a flag for any prescriptions of a certain drug and all prescriptions of that drug in a period. More at this FAQ.
Otherwise this code may help. Extension to several drugs is left as an exercise.
clear
input id month day year str10 drug
1 5 1 2003 aspirin
1 11 1 2003 aspirin
1 6 1 2004 aspirin
1 5 1 2005 aspirin
1 11 1 2005 aspirin
end
generate date = mdy(month,day,year)
format date %td
* code needs modification if any month is 1, 2, 3, 7, 8, 9
generate period = 1 if inlist(month,4,5,6)
replace period = 2 if inlist(month,10,11,12)
label define plab 1"1 April to 30 June" 2"1 October to 31 December"
label value period plab
bysort id year period (date): egen all_aspirin = min(drug == "aspirin")
by id year period: egen any_aspirin = max(drug == "aspirin")
by id year : gen both_all_aspirin = period[1] == 1 & period[_N] == 2 & all_aspirin[1] & all_aspirin[_N]
by id year : gen both_any_aspirin = period[1] == 1 & period[_N] == 2 & any_aspirin[1] & any_aspirin[_N]
list id date drug *aspirin
+----------------------------------------------------------------------+
| id date drug all_as~n any_as~n b~ll_a~n b~ny_a~n |
|----------------------------------------------------------------------|
1. | 1 01may2003 aspirin 1 1 1 1 |
2. | 1 01nov2003 aspirin 1 1 1 1 |
3. | 1 01jun2004 aspirin 1 1 0 0 |
4. | 1 01may2005 aspirin 1 1 1 1 |
5. | 1 01nov2005 aspirin 1 1 1 1 |
+----------------------------------------------------------------------+
As a style note, consider this example
generate dummy = 0
replace dummy = 1 if frog == 42
Experienced Stata programmers generally just write
generate dummy = frog == 42
See also this FAQ
I have a data set that has data sorted by months and years. I want to destring the month variable so that I can ultimately create one date variable, but as they are all labeled as January, February, etc. how do I destring the variable?
You don't. That's a job for date functions. All are documented, e.g. via help datetime.
destring is for numbers that happen to be read as string variables so that typical entries might be "42" and "666". Import as string usually arises when the variable includes metadata (e.g. header lines), or non-Stata flags for missings (e.g. "NA"), or some other non-numeric characters, often in as few as one observation. Import from MS Excel is a common cause, as spreadsheet users tend to be loose on sprinkling text in numeric data columns.
A variable with values such as "January" doesn't qualify. It's in your mind that month names map on to month numbers, but destring doesn't share that knowledge.
Date functions have this job:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 month float year
"January" 2017
"February" 1942
end
gen mdate = monthly(month + string(year), "MY")
list
+-------------------------+
| month year mdate |
|-------------------------|
1. | January 2017 684 |
2. | February 1942 -215 |
+-------------------------+
format mdate %tm
list
+--------------------------+
| month year mdate |
|--------------------------|
1. | January 2017 2017m1 |
2. | February 1942 1942m2 |
+--------------------------+
(Declaration of interest: original author of destring.)
See also this thread.
I have the following panel dataset.
I did
sort FirmID Year
to make the following.
FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
I want to create a new variable exitnextyear which is 1 if the firm does not exist next year, so that the output is
FirmID Year exitnextyear
1 1996 0
1 1997 0
1 1998 1
2 2000 0
2 2001 1
I think I have to use something like
by FirmID: gen exitnextyear (and something)
but I don't know what to do next.
clear
input FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
end
bysort FirmID (Year) : gen byte exitnextyear = _n == _N
list, sepby(FirmID)
For the principles, see help and manual entries on by: and/or a tutorial review accessible here.
Row is spreadsheetspeak; in Stata the term is observation.