Splitsample in Stata 16: How to create samples based on varying proportions saved in a variable? - stata

Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.
Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:
Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old
Dividing individuals into treatment and controlgroups based on varying chance:
due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups.
To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture).
There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.
Programming
I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)
gen individualnumber=(age-year)+2007
I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:
gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9
gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9
gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)
After that I wanted to construct the samples...
splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show
...but I received an error since only a numlist might be used in the split(numlist) subcommand.
Question: How to construct the samples or overcome this error in an efficient way?
Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals.
Let's state there are 100 individuals like individual 7 in the table.
Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.

What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.
Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end
* Create the var that will display the
gen birthyear = year-age
preserve
* Collapse year-person level data to person level so
* that each individual only get one treatment status.
* You must have an individual id number for this
* Get standard deviation to test that data is good and the birthyear
* is identical for each individual across the panel data set
collapse (mean) birthyear (sd) bysd=birthyear, by(id)
* Test that birthyear is same across all indivudals - this is not needed,
* but good data quality assurance test. Then drop the var as it is not needed
assert bysd == 0
drop bysd
* Set seed to make replicable. Replace this seed when you have tested this
* script using a new random seed. For example from here:
* https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
set seed 123456
*Generate a random number based on the seed
gen random_draw = runiform()
* For each birthyear, get the rank of the random number divided by the number
* of individuals in each birthyear
sort birthyear random_draw
by birthyear : gen percent_rank = _n/_N
*Initiate treatmen variable
gen tmt_status = .
label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"
*Assign birthyear 2006-2004 that are all the same
replace tmt_status = 1 if birthyear == 2006
replace tmt_status = 1 if birthyear == 2005
replace tmt_status = 1 if birthyear == 2004
*Assign birthyear 2003
replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
replace tmt_status = 1 if birthyear == 2003 & percent_rank > .25
*Assign birthyear 2002
replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
replace tmt_status = 1 if birthyear == 2002 & percent_rank > .40
*Fill in birthyear 2001-1999
*Assign year 1998
replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
replace tmt_status = 1 if birthyear == 1998 & percent_rank > .72 & percent_rank <= .86
replace tmt_status = 2 if birthyear == 1998 & percent_rank > .86
*Fill in birthyear 1997-1990
* Do some tabulates etc to convince yourself the randomization is as expected
* Save tempfile of data to be merged to later
* (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
tempfile assignment_results
save `assignment_results'
restore
merge m:1 id using `assignment_results'
This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.
This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.

Related

Creating a flag using indexes

I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+

Adding across years

Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks
Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.

Compare each obs with rest of its sub-group

I have the following goal regarding my data structure
group; month; year; next_year
1; February; 2014; 0
1; March; 2006; 0
1; November; 2013; 1
2; January; 2014; 0
3; January; 2004; 0
I do have group, month and year, however the column next_year needs to be generated from the first three: For each observation, I want to check if there is another observation within the same group that has a date-value which falls into the period of next year. If so, I want to set the value of next_year to 1, otherwise to 0 (see example).
I started by converting the date into a format that Stata can interpret - using ym(month, year) - such that I can make comparisons. However, I am not sure how to iterate over all observations within the group in order to determine if that is the case or not.
I would know how to do it in e.g. Java, but I don't for Stata. I suppose I should not start with loops as there are probably some implemented commands for such problems.
If you want to check if there is a following observation within the next 12 months, you can try:
clear
set more off
*----- example data -----
input group str8 month year
1 March 2006
1 March 2013
1 November 2013
1 January 2013
2 January 2014
3 January 2004
end
*----- what you want -----
gen dat = monthly(month + string(year), "MY")
format dat %tm
bysort group (dat): gen next = dat[_n+1] - dat <= 12
list, sepby(group)
Make sure you understand the difference between Nick's code and mine. They work under different assumptions. You can check the differences running both pieces of code with the data I have provided (originally Nick's but with one observation deleted to get the point across; by chance, if you use Nick's data without the modification, the results will be the same).
You are correct in avoiding an explicit loop. This kind of problem is soluble using by:.
I modified your example to have two observations for one group in one year.
clear
input group str8 month year
1 February 2014
1 March 2006
1 March 2013
1 November 2013
2 January 2014
3 January 2004
end
bysort group (year) : gen next_year = year[_n+1] == year + 1
bysort group year (next_year) : replace next_year = next_year[_N]
list, sepby(group)
+------------------------------------+
| group month year next_y~r |
|------------------------------------|
1. | 1 March 2006 0 |
2. | 1 November 2013 1 |
3. | 1 March 2013 1 |
4. | 1 February 2014 0 |
|------------------------------------|
5. | 2 January 2014 0 |
|------------------------------------|
6. | 3 January 2004 0 |
+------------------------------------+
Getting an explicit sort order is essential. Within group, we look ahead to see if the next year is the current year plus 1, assigning 1 if true and 0 if false. That will at most be true for the last observation for a given group and year. If there is more than one observation for each group and year, any occurrence of 1 must be spread to all such observations.
For a tutorial on by:, see Speaking Stata: How to move step by: step.
The assumption here is that you mean in the next calendar year, not in the next 12 months. Making your dates into Stata monthly dates will be needed for most other problems, but doesn't make this one easier. Here is one way to do that in your situation, assuming that month is string and year is numeric:
gen mdate = monthly(month + string(year), "MY")
format mdate %tm

Expanding observations from a range to form a panel

I currently have a data set that appears as follows
mnbr uact_id hiredate termdate
9 3709 19510101 20000915
20 9409 20001001 20080601
33 25646 19990201 20000731
mnbr represents the member number of a given worker in a labor union. uact_id is the shop they were working for and hiredate and termdate (given yyyymmdd) represent the given dates they were at the shop/uact_id. I am currently trying to use the expand command in Stata to create a panel such that there is one observation per year for each member number (mnbr) between the indicators of hiredate and termdate.
i.e. it should ideally look like
mnbr uact_id year
9 3709 1951
9 3709 1952
9 3709 1953
9 3709 1954
etc. for each member number for each year.
Assuming arbitrarily that the dates are strings, we can go
gen year = real(substr(hiredate, 1, 4))
gen duration = real(substr(termdate, 1, 4)) - year + 1
expand duration
bysort mnbr : replace year = year[_n-1] + 1 if _n > 1
If the dates are numeric, specifically integers, then the first two lines could be
gen year = floor(hiredate/10000)
gen duration = floor(termdate/10000) - year + 1
The replace step is discussed within
How can I replace missing values with previous or following nonmissing values or within sequences?

How to write the best code for data aggregation?

I have the following dataset (individual level data):
pid year state income
1 2000 il 100
2 2000 ms 200
3 2000 al 30
4 2000 dc 400
5 2000 ri 205
1 2001 il 120
2 2001 ms 230
3 2001 al 50
4 2001 dc 400
5 2001 ri 235
.........etc.......
I need to estimate average income for each state in each year and create a new dataset that would look like this:
state year average_income
ar 2000 150
ar 2001 200
ar 2002 250
il 2000 150
il 2001 160
il 2002 160
...........etc...............
I already have a code that runs perfectly fine (I have two loops). However, I would like to know is there any better way in Stata like sql style query?
This is shorter code than any suggested so far:
collapse average_income=income, by(state year)
This shouldn't need 2 loops, or any for that matter. There are in fact more efficient ways to do this. When you are repeating an operation on many groups, the bysort command is useful:
bysort year state: egen average_income = mean(income)
You also don't have to create a new dataset, you can just prune this one and save it. Start by only keeping the variables you want (state, year and average_income) and get rid of duplicates:
keep state year average_income
duplicates drop
save "mynewdataset.dta"
You have the SQL tag on the question. This is a basic aggregation query in SQL:
select state, year, avg(income) as average_income
from t
group by state, year;
To put this in a table, depends on your database. One of the following typically works:
create table NewTable as
select state, year, avg(income) as average_income
from t
group by state, year;
Or:
select state, year, avg(income) as average_income
into NewTable
from t
group by state, year;