I have a dataset of the top management teams of US banks from 2005 - 2015.
Now I want to generate a change-variable if a TMT composition changed between 2006 and 2009.
So first I used:
drop if Year > 2009
drop if Year < 2006
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N])
and afterwards I used
by id (id), sort: gen changed = (DirectorID[1] != DirectorID[_N]) if Year < 2010 & Year > 2005
However there is a difference in output between two variables:
247 cases of "No change" and 853 cases of "Change" in the first and 116 cases of "No change" and the rest as "Changed" in the second variable
Could anyone clarify what the differences between these two commands are in Stata?
There are a couple reasons you may be seeing a different count of changes to the dataset. The data is most likely sorted differently for these two calls. The (id) parts have no effect here because you are already sorting by id. What you likely want to do is residually sort by year. So, bysort id (Year) - this way the dataset will be in the same order for each command you type. In the second command, the if clause is going to set the variable changed to missing for observations outside of the year range, but those observations are still being included in the calculation. You could create a new variable to flag the years of interest, and then add that new variable to the bysort call.
Lastly, you need to decide whether you only want to look at changes year-over-year (the value of the changed could vary by year within id), or have the value of changed reflect whether there were any changes in DirectorID over the entire time frame of interest (the value of changed would be constant within id).
Here's a toy example illustrating the difference. Essentially, when you drop the data, the last and the first observation could be the same, but in general you will have less data to compare the first and last observation since much of the data will be gone. When you use if, then the data is still there, even though the calculation is restricted to the middle observation by the if:
. clear
. input id year director_id
id year directo~d
1. 1 2016 10
2. 1 2017 20
3. 1 2018 30
4. end
.
. bys id (year): gen changed = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
(2 missing values generated)
. list, clean noobs
id year direct~d changed
1 2016 10 .
1 2017 20 1
1 2018 30 .
.
. drop if inlist(year, 2016,2018)
(2 observations deleted)
. bys id (year): gen changed2 = (director_id[1] != director_id[_N]) if year < 2018 & year > 2016
. list, clean noobs
id year direct~d changed changed2
1 2017 20 1 0
I added a sort by year since that seems in the spirit of your exercise.
Related
I want to generate a variable month that has the month and year together as 2013M01.
Below is a sample of my data:
clear
input expected_arrival_month year
1 2013
2 2014
3 2015
4 2016
5 2017
6 2018
end
I tried the following command:
generate month = .
replace month = 2013M01 if expected_arrival_month == 1 & year == 2013
However, I received the error:
2013M01 invalid name
r(198)
How can I get the desired output?
For essentially all Stata purposes a numeric monthly date variable is better than anything hand- or homemade (and certainly than dates held as string variables). You can get such variables to appear as you ask. You certainly do not need to calculate individual values directly. Although this code is for a minimal dataset it will apply to all values in numeric variables as you describe. See help datetime for invaluable (and unavoidable) information.
clear
set obs 1
generate year = 2013
generate arrival_month = 1
generate wanted = ym(year, arrival_month)
format wanted %tmCCYY!MNN
list
+---------------------------+
| year arriva~h wanted |
|---------------------------|
1. | 2013 1 2013M01 |
+---------------------------+
(As commented, you should provide example data directly and in a way that makes variable types clear. If one or both variables are really string, apply destring first or use monthly().)
The issue here is in dealing with string rather than numeric variables. Given that the variable you are generating is a string variable, the contents of the variable must be enclosed in quotation marks:
generate month = "2013M01" if expected_arrival_month == 1 & year == 2013
There would also be other more efficient ways to deal with this generation, for example using Stata's egen command (and concat), or datetime functions as indicated in another response.
I have firm level data for three years (2015, 2016 and 2017).
I need to know which firms have a change in a dummy variable ModelJaarrekening from year 2016 to year 2017 - a dummy that determines if the firm is large (value 2) or small (value 1).
In other words, I need to select the firms that have a value for ModelJaarrekening of 2 in year 2015 and 2016 but has the value 1 in year 2017.
The following command does not work:
gen dummy=1 if (ModelJaarrekening ==2 & year<2017 & ModelJaarrekening ==1 & year==2017)
I think this is because it first executes the first command and deletes the other observations.
How can I solve this problem?
Your command is legal.
It doesn't delete any observations!
It just doesn't do what you want. The reason is that nothing in your syntax instructs Stata to look beyond each observation separately. So,
ModelJaarrekening == 2 & ModelJaarrekening == 1
is never going to be true of any observation: a variable can't be 1 and 2 in the same observation. The same kind of problem holds for
year < 2017 & year == 2017
The result is that your indicator will have values that are all missing.
What you want is more like this. I posit a firm identifier id.
local foo ModelJaarrekening
egen OK1 = total(`foo' == 2 & year <= 2017), by(id)
egen OK2 = total(`foo' == 1 & year == 2017), by(id)
gen wanted = OK1 & OK2
Then OK1 will be 1 or more if and only if there was any value 2 before 2017. `OK2' will be 1 if and only if there was value 1 in 2017 for each firm.
wanted will be 1 if and only if both its arguments are non-zero (in this case, negative values are impossible and only positive values count); and 0 otherwise.
It is thus an indicator (you say dummy) with values 1 and 0.
Indicators that are 1 or missing are less useful in Stata than those that are 1 or 0.
I am working with a very large dataset (1 million obs.).
I have a string date that looks like this
key seq startdate (string)
AD07 1 August 2011
AD07 2 June 2011
AD07 3 February 2004
AD07 4 November 2004
AD07 5 2001
AD07 6 January 1998
AD5c23 1 January 2014
AD5c235 2 February 2014
AD5c235 3 2014
These are self-reported employment dates.
Some did not report the month at which they started.
But I would like to replace for AD07 the date “2001” to “January 2001”. Hence I cannot simply replace it because I would like to keep the original years but add the month in the string variable.
I started with:
levelsof start if start<="2016", local(levels)
which gives me all the years without the month from 1900 to 2016.
Now I would like to add "January" for the years without the month and keep original years.
How should I do that without using replace for every year? foreach loop?
You have a serious data quality problem if people are claiming to have started work in 1900 and every year since then! Even considering early employment starts and delayed retirement, that implies people older than the oldest established age.
Also, imputing "January" will impart bias as almost all job durations will be longer than they would have been. Real January starts will be correct, but no others: "June" or "July" or random months would make more obvious statistical sense.
That said, there is no loop needed here. You're asking for one line, say
replace startdate = "January " + startdate if length(trim(date)) == 4
or
replace startdate = "January " + startdate if real(startdate) < .
-- assuming a follow-up in converting to numeric dates. The logic there is that all year-only dates trim down to 4 characters, or (better) that feeding month names to real() will yield missings.
That said in turn, creating a new variable is better practice than over-writing one. Also, consider throwing away the month detail. Is it needed?
EDIT
You may have another problem if there are people with two or more jobs in the same year without month specifications. You don't want to impute all months in question as "January". You can check for such observations by
gen byte incomplete = real(startdate) < .
gen year = substr(trim(startdate), -4, 4)
bysort key year incomplete : gen byte multiplebad = incomplete & _N > 1
I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.
Case 1
Suppose the data are sorted by year then by month (always have 3 observations in data).
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
I need to copy the Index of last month to new observation
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
2015 2 1.2
Case 2
Year is removed from data. So we only have Month and Index.
Month Index
1 1.2
11 1.1
12 1.5
Data is always collected from consecutive 3 months. So 1 is the last month.
Still, ideal output is
Month Index
1 1.2
2 1.2
11 1.1
12 1.5
I solve it by creating another dataset only contains Month (1,2...12). Then right join the original dataset twice. But I think there's more elegant way to deal with this.
Case 1 can be a straight-forward data step. Add end=eof to the set statement to initialize a variable eof that returns value 1 when the data step is reading the last row of the data set. An output statement in the data step outputs a row during each iteration. If eof=1, a do block runs that increments the month by 1 and outputs another row.
data want;
set have end=eof;
output;
if eof then do;
month=mod(month+1,12);
output;
end;
run;
For case 2, I would switch to an sql solution. Self join the table to itself on month, incremented by 1 in the second table. Use the coalesce function to keep the values from the existing table if it exists. If not, use the values from the second table. Since a case crossing December-January will produce 5 months, limit the output to four rows using the outobs= option in proc sql to exclude the unwanted second January.
proc sql outobs=4;
create table want as
select
coalesce(t1.month,mod(t2.month+1,12)) as month,
coalesce(t1.index,t2.index) as index
from
have t1
full outer join have t2
on t1.month = t2.month+1
order by
coalesce(t1.month,t2.month+1)
;
quit;