I have the following columns in my data
Firm - revenue - industry - year
I want to calculate the percentage change in total revenue for each industry between 2008 and 2015.
I tried:
by industry: egen tot_2008 = sum(revenue) if year == 2008
by industry: egen tot_2015 = sum(revenue) if year == 2015
gen change = (tot_2015-tot_2008)/tot_2008
But this doesn't work as the ifs restrict which years the egen creates values for as well as which years are included in each sum.
As you realise, the problem with your code is that 2008 and 2015 values will be non-missing values only for those years respectively, and hence never not missing on both variables. Here is one way to spread values to all years for each industry:
by industry: egen tot_2008 = total(revenue / (year == 2008))
by industry: egen tot_2015 = total(revenue / (year == 2015))
gen change = (tot_2015-tot_2008)/tot_2008
That hinges on expressions such as year == 2008 being evaluated as 1 if true and 0 if false. If you divide by 0, the result is a missing value, which Stata ignores, which is exactly what you want. Taking totals over all observations in an industry ensures that the same value is recorded for each industry.
Here is another way that some find more explicit:
by industry: egen tot_2008 = total(cond(year == 2008, revenue, .))
by industry: egen tot_2015 = total(cond(year == 2015, revenue, .))
gen change = (tot_2015-tot_2008)/tot_2008
which hinges on the same principle, that missings will be ignored.
Note the use of the egen function total() here. The egen function sum() still works, and is the same function, but that name is undocumented as of Stata 9, in an attempt to avoid confusion with the Stata function sum().
To avoid double (indeed multiple) counting, use
egen tag = tag(industry)
to tag just one observation for each industry, to be used in graphs and tables for which you want that.
For discussion, see here, sections 9 and 10.
Related
I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.
I want to match treated firms to control firms by industry and year considering firms that are the closest in terms of profitability (roa). I want a 1:1 match. I am using a distance measure (mahalanobis).
I have 530,000 firm-year observations in my sample, namely 267,000 treated observations and 263,000 control observations approximatively. Here is my code:
gen neighbor1 = .
gen idobs = .
levelsof industry
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if industry == `i' & year == `j', mahalanobis(roa)
capture noisily replace neighbor1 = _n1 if industry == `i' & year == `j'
capture noisily replace idobs = _id if industry == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
Treat is my treatment variable. It takes the value of 1 for treated observations and 0 for non-treated observations.
The command psmatch2 creates the variable _n1 and _id among others. _n1 is the id number of the matched observation (closest neighbor) and _id is an id number (1 - 530,000) that is unique to each observation.
The code 'works', i.e. I get no error message. My variable neighbor1 has 290,724 non-missing observations.
However, these 290,724 observations vary between 1 and 933 which is odd. The variable neighbor1 should provide me the observation id number of the matched observation, which can vary between 1 and 530,000.
It seems that the code erases or ignores the result of the matching process in different subgroups. What am I doing wrong?
Edit:
I found a public dataset and adapted my previous code so that you can run my code with this dataset and see more clearly what the problem could be.
I am using Vella and Verbeek (1998) panel data on 545 men worked every year from 1980-1987 from this website: https://www.stata.com/texts/eacsap/
Let's say that I want to match treated observations, i.e. people, to control observations by marriage status (married) and year considering people that worked a similar number of hours (hours), i.e. the shortest distance.
I create a random treatment variable (treat) for the sake of this example.
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta
gen treat = round(runiform())
gen neighbor1 = .
gen idobs = .
levelsof married
local a = r(levels)
levelsof year
local b = r(levels)
foreach i in `a' {
foreach j in `b'{
capture noisily psmatch2 treat if married == `i' & year == `j', mahalanobis(hours)
capture noisily replace neighbor1 = _n1 if married == `i' & year == `j'
capture noisily replace idobs = _id if married == `i' & year == `j'
drop _treated _support _weight _id _n1 _nn
}
}
What this code should do is to look at each subgroup of observations: 444 observations in 1980 that are not married, 101 observations in 1980 that are married, ..., and 335 observations in 1987 that are married. In each of these subgroups, I would like to match a treated observation to a control observation considering the shortest distance in the number of hours worked.
There are two problems that I see after running the code.
First, the variable idobs should take a unique number between 1 and 4360 because there are 4360 observations in this dataset. It is just an ID number. It is not the case. A few observations can have an ID number 1, 2 and so on.
Second, neighbor1 varies between 1 and 204 meaning that the matched observations have only ID numbers varying from 1 to 204.
What is the problem with my code?
Here is a solution using the command iematch, installed through the package ietoolkit -> ssc install ietoolkit. For disclosure, I wrote this command. psmatch2 is great if you want the ATT. But if all you want is to match observations across two groups using nearest neighbor, then iematch is cleaner.
In both commands you need to make each industry-year match in a subset, then combine that information. In both commands the matched group ID will restart from 1 in each subset.
Using your example data, this creates one matchID var for each subset, then you will have to find a way to combine these to a single matchID without conflicts across the data set.
* use data set and keep only vars required for simplicity
use http://www.stata.com/data/jwooldridge/eacsap/wagepan.dta, clear
keep year married hour
* Set seed for replicability. NEVER use the 123456 seed in production, randomize a new seed
set seed 123456
*Generate mock treatment
gen treat = round(runiform())
*generate vars to store results
gen matchResult = .
gen matchDiff = .
gen matchCount = .
*Create locals for loops
levelsof married
local married_statuses = r(levels)
levelsof year
local years = r(levels)
*Loop over each subgroup
foreach year of local years {
foreach married_status of local married_statuses {
*This command is similar to psmatch2, but a simplified version for
* when you are not looking for the ATT.
* This command is only about matching.
iematch if married == `married_status' & year == `year', grp(treat) match(hour) seedok m1 maxmatch(1)
*These variables list meta info about the match. See helpfile for docs,
*but this copy info from each subset in this loop to single vars for
*the full data set. Then the loop specfic vars are dropped
replace matchResult = _matchResult if married == `married_status' & year == `year'
replace matchDiff = _matchDiff if married == `married_status' & year == `year'
replace matchCount = _matchCount if married == `married_status' & year == `year'
drop _matchResult _matchDiff _matchCount
*For each loop you will get a match ID restarting at 1 for each loop.
*Therefore we need to save them in one var for each loop and combine afterwards.
rename _matchID matchID_`married_status'_`year'
}
}
I have firm level data for three years (2015, 2016 and 2017).
I need to know which firms have a change in a dummy variable ModelJaarrekening from year 2016 to year 2017 - a dummy that determines if the firm is large (value 2) or small (value 1).
In other words, I need to select the firms that have a value for ModelJaarrekening of 2 in year 2015 and 2016 but has the value 1 in year 2017.
The following command does not work:
gen dummy=1 if (ModelJaarrekening ==2 & year<2017 & ModelJaarrekening ==1 & year==2017)
I think this is because it first executes the first command and deletes the other observations.
How can I solve this problem?
Your command is legal.
It doesn't delete any observations!
It just doesn't do what you want. The reason is that nothing in your syntax instructs Stata to look beyond each observation separately. So,
ModelJaarrekening == 2 & ModelJaarrekening == 1
is never going to be true of any observation: a variable can't be 1 and 2 in the same observation. The same kind of problem holds for
year < 2017 & year == 2017
The result is that your indicator will have values that are all missing.
What you want is more like this. I posit a firm identifier id.
local foo ModelJaarrekening
egen OK1 = total(`foo' == 2 & year <= 2017), by(id)
egen OK2 = total(`foo' == 1 & year == 2017), by(id)
gen wanted = OK1 & OK2
Then OK1 will be 1 or more if and only if there was any value 2 before 2017. `OK2' will be 1 if and only if there was value 1 in 2017 for each firm.
wanted will be 1 if and only if both its arguments are non-zero (in this case, negative values are impossible and only positive values count); and 0 otherwise.
It is thus an indicator (you say dummy) with values 1 and 0.
Indicators that are 1 or missing are less useful in Stata than those that are 1 or 0.
I am working with a very large dataset (1 million obs.).
I have a string date that looks like this
key seq startdate (string)
AD07 1 August 2011
AD07 2 June 2011
AD07 3 February 2004
AD07 4 November 2004
AD07 5 2001
AD07 6 January 1998
AD5c23 1 January 2014
AD5c235 2 February 2014
AD5c235 3 2014
These are self-reported employment dates.
Some did not report the month at which they started.
But I would like to replace for AD07 the date “2001” to “January 2001”. Hence I cannot simply replace it because I would like to keep the original years but add the month in the string variable.
I started with:
levelsof start if start<="2016", local(levels)
which gives me all the years without the month from 1900 to 2016.
Now I would like to add "January" for the years without the month and keep original years.
How should I do that without using replace for every year? foreach loop?
You have a serious data quality problem if people are claiming to have started work in 1900 and every year since then! Even considering early employment starts and delayed retirement, that implies people older than the oldest established age.
Also, imputing "January" will impart bias as almost all job durations will be longer than they would have been. Real January starts will be correct, but no others: "June" or "July" or random months would make more obvious statistical sense.
That said, there is no loop needed here. You're asking for one line, say
replace startdate = "January " + startdate if length(trim(date)) == 4
or
replace startdate = "January " + startdate if real(startdate) < .
-- assuming a follow-up in converting to numeric dates. The logic there is that all year-only dates trim down to 4 characters, or (better) that feeding month names to real() will yield missings.
That said in turn, creating a new variable is better practice than over-writing one. Also, consider throwing away the month detail. Is it needed?
EDIT
You may have another problem if there are people with two or more jobs in the same year without month specifications. You don't want to impute all months in question as "January". You can check for such observations by
gen byte incomplete = real(startdate) < .
gen year = substr(trim(startdate), -4, 4)
bysort key year incomplete : gen byte multiplebad = incomplete & _N > 1
I am using a datasheet with about 87 countries for the years 1985 until 2004. One of my variables is Real GDP per capita. My intention is to create a new variable based on the previous, but with only 2 observations per country -- showing the average for 2 time periods.
So for 1985 I would want the average GDP for the time period 1985 - 1994, and for 1995 the average GDP for 1995 - 2004.
There is no data example, no specification of variable names and no code attempt here. But schematically
gen period = year < 1995
egen mean = mean(GDPpc), by(country period)
could be a start, or even a finish, depending on exactly what you want. If you want to be able to compare periods directly, then something like
egen mean1 = mean(GDPpc / (year < 1995)), by(country)
egen mean2 = mean(GDPpc / (year > 1994)), by(country)
tabdisp country period, c(mean) format(%2.0f)
tabdisp country, c(mean1 mean2) format(%2.0f)
will put variables side by side. See also the tag() function of egen.
Warning: None of this code was tested.