My data is currently organized in Stata as follows:
input str2 Country gdp_2015 gdp_2016 gdp_2017 imports_2016 imports_2017 exports_2016
"A" 11 12 13 5 6 8 5
"B" 11 . . 5 6 10 5
"C" 12 13 . 5 6 8 5
end
gen net_imports = (imports_2017-foodexport_2017)
gen net_imports_toGDP = (net_imports/gdpcurrent_2017)
The code works well but only created a variable if a country has 2017 data, but I would like to essentially create an import to GDP ratio, based on the most recent observation available for GDP.
You could simply replace the missing data as follows:
replace gdp_2016 = gdp_2015 if mi(gdp_2016)
replace gdp_2017 = gdp_2016 if mi(gdp_2017)
However, a more general approach would begin by reshaping your data from wide to long:
reshape long gdp_ imports_ exports_, i(Country)
See help reshape for more detail on the command. The gdp_ etc. are the stubs that will be the new variable names, and i(Country) sets the identifier.
Then you can fill forward within each observation using time-series variables:
encode Country, generate(Country_num
xtset Country_num _j
replace gdp_=l.gdp_ if mi(gdp_) & !mi(l.gdp_)
Related
Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.
Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:
Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old
Dividing individuals into treatment and controlgroups based on varying chance:
due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups.
To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture).
There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.
Programming
I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)
gen individualnumber=(age-year)+2007
I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:
gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9
gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9
gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)
After that I wanted to construct the samples...
splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show
...but I received an error since only a numlist might be used in the split(numlist) subcommand.
Question: How to construct the samples or overcome this error in an efficient way?
Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals.
Let's state there are 100 individuals like individual 7 in the table.
Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.
What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.
Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end
* Create the var that will display the
gen birthyear = year-age
preserve
* Collapse year-person level data to person level so
* that each individual only get one treatment status.
* You must have an individual id number for this
* Get standard deviation to test that data is good and the birthyear
* is identical for each individual across the panel data set
collapse (mean) birthyear (sd) bysd=birthyear, by(id)
* Test that birthyear is same across all indivudals - this is not needed,
* but good data quality assurance test. Then drop the var as it is not needed
assert bysd == 0
drop bysd
* Set seed to make replicable. Replace this seed when you have tested this
* script using a new random seed. For example from here:
* https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
set seed 123456
*Generate a random number based on the seed
gen random_draw = runiform()
* For each birthyear, get the rank of the random number divided by the number
* of individuals in each birthyear
sort birthyear random_draw
by birthyear : gen percent_rank = _n/_N
*Initiate treatmen variable
gen tmt_status = .
label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"
*Assign birthyear 2006-2004 that are all the same
replace tmt_status = 1 if birthyear == 2006
replace tmt_status = 1 if birthyear == 2005
replace tmt_status = 1 if birthyear == 2004
*Assign birthyear 2003
replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
replace tmt_status = 1 if birthyear == 2003 & percent_rank > .25
*Assign birthyear 2002
replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
replace tmt_status = 1 if birthyear == 2002 & percent_rank > .40
*Fill in birthyear 2001-1999
*Assign year 1998
replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
replace tmt_status = 1 if birthyear == 1998 & percent_rank > .72 & percent_rank <= .86
replace tmt_status = 2 if birthyear == 1998 & percent_rank > .86
*Fill in birthyear 1997-1990
* Do some tabulates etc to convince yourself the randomization is as expected
* Save tempfile of data to be merged to later
* (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
tempfile assignment_results
save `assignment_results'
restore
merge m:1 id using `assignment_results'
This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.
This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.
data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.
I have a trouble using L1 command in Stata 14 to create lag variables.
The resulted Lag variable is 100% missing values!
gen d = L1.equity
tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there.
As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years:
clear
input float(id year gdp)
1 1 5
1 2 2
1 3 7
1 4 9
1 5 6
3 1 3
3 2 4
3 3 5
3 4 3
3 5 4
end
Now, if you improperly tsset this data, you can easily generate the missing values you describe:
tsset year id
gen lag_gdp = L1.gdp
And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2).
Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect:
clear
input float(year gdp)
1 5
2 3
3 2
4 4
5 7
end
tsset year gdp
gen d = L1.gdp
I suspect you are having a similar issue.
Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.
I want to calculate growth rates in Stata for observations having the same ID. My data looks like this in a simplified way:
ID year a b c d e f
10 2010 2 4 9 8 4 2
10 2011 3 5 4 6 5 4
220 2010 1 6 11 14 2 5
220 2011 6 2 12 10 5 4
334 2010 4 5 4 6 1 4
334 2011 5 5 4 4 3 2
Now I want to calculate for each ID growth rates from variables a-f from 2010 to 2011:
For e.g ID 10 and variable a it would be: (3-2)/2, for variable b: (5-4)/4 etc. and store the results in new variables (e.g. growth_a, growth_b etc).
Since I have over 120k observations and around 300 variables, is there an efficient way to do so (loop)?
My code looks like the following (simplified):
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
FYI: variables a-f are numeric.
But Stata says: 'local not found' and I am not sure whether the code is correct. Do I also have to sort for year first?
The specific error in
local variables "a b c d e f"
foreach x in local variables {
bys ID: g `x'_gr = (`x'[_n]-`x'[_n-1])/`x'[_n-1]
}
is an error in the syntax of foreach, which here expects syntax like foreach x of local variables, given your prior use of a local macro. With the keyword in, foreach takes the word local literally and here looks for a variable with that name: hence the error message. This is basic foreach syntax: see its help.
This code is problematic for further reasons.
Sorting on ID does not guarantee the correct sort order, here time order by year, for each distinct ID. If observations are jumbled within ID, results will be garbage.
The code assumes that all time values are present; otherwise the time gap between observations might be unequal.
A cleaner way to get growth rates is
tsset ID year
foreach x in a b c d e f {
gen `x'_gr = D.`x'/L.`x'
}
Once you have tsset (or xtset) the time series operators can be used without fear: correct sorting is automatic and the operators are smart about gaps in the data (e.g. jumps from 1982 to 1984 in yearly data).
For more variables the loop could be
foreach x of var <whatever> {
gen `x'_gr = D.`x'/L.`x'
}
where <whatever> could be a general (numeric) varlist.
EDIT: The question has changed since first posting and interest is declared in calculating growth rates only from 2010 to 2011, with the implication in the example that only those years are present. The more general code above will naturally still work for calculating those growth rates.
I currently have a data set that appears as follows
mnbr uact_id hiredate termdate
9 3709 19510101 20000915
20 9409 20001001 20080601
33 25646 19990201 20000731
mnbr represents the member number of a given worker in a labor union. uact_id is the shop they were working for and hiredate and termdate (given yyyymmdd) represent the given dates they were at the shop/uact_id. I am currently trying to use the expand command in Stata to create a panel such that there is one observation per year for each member number (mnbr) between the indicators of hiredate and termdate.
i.e. it should ideally look like
mnbr uact_id year
9 3709 1951
9 3709 1952
9 3709 1953
9 3709 1954
etc. for each member number for each year.
Assuming arbitrarily that the dates are strings, we can go
gen year = real(substr(hiredate, 1, 4))
gen duration = real(substr(termdate, 1, 4)) - year + 1
expand duration
bysort mnbr : replace year = year[_n-1] + 1 if _n > 1
If the dates are numeric, specifically integers, then the first two lines could be
gen year = floor(hiredate/10000)
gen duration = floor(termdate/10000) - year + 1
The replace step is discussed within
How can I replace missing values with previous or following nonmissing values or within sequences?