I am using a datasheet with about 87 countries for the years 1985 until 2004. One of my variables is Real GDP per capita. My intention is to create a new variable based on the previous, but with only 2 observations per country -- showing the average for 2 time periods.
So for 1985 I would want the average GDP for the time period 1985 - 1994, and for 1995 the average GDP for 1995 - 2004.
There is no data example, no specification of variable names and no code attempt here. But schematically
gen period = year < 1995
egen mean = mean(GDPpc), by(country period)
could be a start, or even a finish, depending on exactly what you want. If you want to be able to compare periods directly, then something like
egen mean1 = mean(GDPpc / (year < 1995)), by(country)
egen mean2 = mean(GDPpc / (year > 1994)), by(country)
tabdisp country period, c(mean) format(%2.0f)
tabdisp country, c(mean1 mean2) format(%2.0f)
will put variables side by side. See also the tag() function of egen.
Warning: None of this code was tested.
Related
I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.
I have the following columns in my data
Firm - revenue - industry - year
I want to calculate the percentage change in total revenue for each industry between 2008 and 2015.
I tried:
by industry: egen tot_2008 = sum(revenue) if year == 2008
by industry: egen tot_2015 = sum(revenue) if year == 2015
gen change = (tot_2015-tot_2008)/tot_2008
But this doesn't work as the ifs restrict which years the egen creates values for as well as which years are included in each sum.
As you realise, the problem with your code is that 2008 and 2015 values will be non-missing values only for those years respectively, and hence never not missing on both variables. Here is one way to spread values to all years for each industry:
by industry: egen tot_2008 = total(revenue / (year == 2008))
by industry: egen tot_2015 = total(revenue / (year == 2015))
gen change = (tot_2015-tot_2008)/tot_2008
That hinges on expressions such as year == 2008 being evaluated as 1 if true and 0 if false. If you divide by 0, the result is a missing value, which Stata ignores, which is exactly what you want. Taking totals over all observations in an industry ensures that the same value is recorded for each industry.
Here is another way that some find more explicit:
by industry: egen tot_2008 = total(cond(year == 2008, revenue, .))
by industry: egen tot_2015 = total(cond(year == 2015, revenue, .))
gen change = (tot_2015-tot_2008)/tot_2008
which hinges on the same principle, that missings will be ignored.
Note the use of the egen function total() here. The egen function sum() still works, and is the same function, but that name is undocumented as of Stata 9, in an attempt to avoid confusion with the Stata function sum().
To avoid double (indeed multiple) counting, use
egen tag = tag(industry)
to tag just one observation for each industry, to be used in graphs and tables for which you want that.
For discussion, see here, sections 9 and 10.
I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.
Cross-posting:
german: http://www.stata-forum.de/post1716.html#p1716
english: http://www.talkstats.com/showthread.php/47299-sales-growth-rate-with-multiple-groups-conditions
I want to calculate the annual sales growth rate of different firm-groups in Stata. The firms are grouped by variables country and industry.
I summed sales for each group (called it sales_total: sales of all firms in a group with equal country, industry and year):
bysort country year industry: egen sales_total = sum(sales)
I have a much bigger sample, but I tried to calculate the growth-rate with a smaller sample.
I tried multiple combinations such as:
egen group = group(year country industry)
xtset group year, yearly
bys group: g salesgrowth = log(D.sales_total)
or
bysort group: gen salesgrowth=(sales[_n]-sales[_n-1])/sales[_n-1]*
also with tsset.
and tried everything from this answer:
Generate percent change between annual observations in Stata?
but I always get error messages such as
repeated time values within panel
or
repeated time values within sample
due to the repetition of the number in a variable such as group.
Can you help me to find the yearly growth rate from each group (firms from same country & industry)?
update
here again an example of my observations (which normally have 10,000 firms over 10 years). There are also missing values (for sales, industry, year, country)
firms -- country -- year -- industry -- sales
-a --------usa-------1----------1----------300
-a---------usa-------2----------1--------4000
-b---------ger-------1----------1--------200
-b---------ger-------2----------1--------400
-c---------usa------1----------1----------100
-c---------usa------2----------1----------300
-d---------usa------1----------1----------400
-d---------usa------2----------1----------200
-e---------usa------1----------1----------7000
-e---------usa------2----------1----------900
-f----------ger------1----------2----------100
-f---------ger------2----------2----------700
-h---------ger------1----------2----------700
-h---------ger------2----------2----------600
-.................etc.....................................
I tried the programing you mentioned, but I got a couple of variables that need to be used in the same row and not in the same column (which I would probably need). Is there a possibility to keep the data without reshaping, keeping them in a row, for example grouping the observations:
egen group=group(industry year country)
and then try
xtset group year
bysort group: sales_growth = log(D.sales)
or
bysort group: gen sales_growth = (sales[_n]-sales[_n-1])/sales[_n-1]
Thank you!
The strategy here is trying to work at the wrong level of resolution. You should
collapse (sum) sales, by(country year industry)
and then work with that reduced dataset. Depending on what you want precisely, you will probably need to restructure that data with reshape so that different industries give different variables. Then
xtset country year
and growth rates will then be easier to calculate.
I'm pretty new to Stata.
I have a set of observations of the form "Country GDP Year". I want to create a new variable GDP1960, which gives the GDP in 1960 of each country for each year:
USA $100m 1960 USA $100m 1960 $100m
USA $200m 1965 --> USA $200m 1965 $100m
Canada $60m 1960 Canada $60m 1960 $60m
What's the right syntax to make this happen? (I assume egen is involved in some mysterious way)
You've found a solution with cond(), but here's a couple of suggestions that might make modeling your data easier and help you avoid problems with issues that might arise when sorting by creating your rank variable (and I've got the egen solution that you asked about below):
Paste the code below into your do-file editor and run it:
*---------------------------------BEGIN EXAMPLE
clear
inp str20 country str10 gdp year
"USA" "$100m" 1960
"USA" "$200m" 1965
"Canada" "$60m" 1960
"Canada" "$120m" 1965
"USA" "$250m" 1970
"Mexico" "$90m" 1970
"Canada" "$800m" 1970
"Mexico" "$160m" 1960
"Mexico" "$220m" 1965
"Mexico" "$350m" 1975
end
//1. destring gdp so that we can work with it
destring gdp, ignore("$", "m") replace
//2. Create GDP for 1960 var:
bys country: g x = gdp if year==1960
bys country: egen gdp60 = max(x)
drop x
**you could also create balanced panels to see gaps in your data**
preserve
ssc install panels
panels country year
fillin country year
li //take a look at the results win. to see how filled panel data would look
restore
//3. create a gdp variable for each year (reshape the dataset)
drop gdp60
reshape wide gdp, i(country) j(year)
**much easier to use this format for modeling
su gdp1970
**here's a fake "outcome" or response variable to work with**
g outcome = 500+int((1000-500+1)*runiform())
anova outcome gdp1960-gdp1970 //or whatever makes sense for your situation
*---------------------------------END EXAMPLE
A one-line solution is
egen gdp60 = mean(gdp / (year == 1960)), by(country)
The trick here is the division by the expression year == 1960. This is true for 1960, in which case we divide by 1, which leaves the gdp for that year unchanged. It is false for all other years, in which case we divide by 0. That sounds crazy, but the consequence whenever we divide by zero is just missing values, which will be ignored by egen's mean() function.
You could use other egen functions, as in this case there should be at most one value for 1960 for each country, so e.g. max(), min(), total() should all work too. (If a country has no value for 1960, or a missing value, we will end up with missing, which is precisely as it should be.)
Discussion at http://www.stata-journal.com/article.html?article=dm0055
Well, I found a solution in the end. It relies on the fact that generate and replace work on the data in its sorted order, and that you can refer to the current observation with _n.
gen rank = 100
replace rank = 50 if year == 1960
gen gdp60 = .
sort country rank
replace gdp60 = cond(iso == iso[_n-1], gdp60[_n-1], gdp[_n])
drop rank
sort country year
EDIT: A more direct solution with the same flavour:
gen wanted = year == 1960
bysort country (wanted) : gen gdp60 = gdp[_N]
drop wanted
sort country year
Here wanted will be 1 for 1960 and 0 otherwise.
I can't think of anything shorter than these two lines:
gen temp = gdp if year == 1960
by country : egen gdp60 = max(temp)
If you want a variable for each year (e.g., gdp60, gdp61, gdp62,...) then you probably should use reshape