Stata estpost esttab: Generate table with mean of variable split by year and group - stata

I want to create a table in Stata with the estout package to show the mean of a variable split by 2 groups (year and binary indicator) in an efficient way.
I found a solution, which is to split the main variable cash_at into 2 groups by hand through the generation of new variables, e.g. cash_at1 and cash_at2. Then, I can generate summary statistics with tabstat and get output with esttab.
estpost tabstat cash_at1 cash_at2, stat(mean) by(year)
esttab, cells("cash_at1 cash_at2")
Link to current result: http://imgur.com/2QytUz0
However, I'd prefer a horizontal table (e.g. year on the x axis) and a way to do it without splitting the groups by hand - is there a way to do so?

My preference in these cases is for year to be in rows and the statistic (e.g. mean) in the columns, but if you want to do it the other way around, there should be no problem.
For a table like the one you want it suffices to have the binary variable you already mention (which I name flag) and appropriate labeling. You can use the built-in table command:
clear all
set more off
* Create example data
set seed 8642
set obs 40
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
list, sepby(year)
* Define labels
label define lflag 0 "cash0" 1 "cash1"
label values flag lflag
* Table
table flag year, contents(mean cash)
In general, for tables, apart from the estout module you may want to consider also the user-written command tabout. Run ssc describe tabout for more information.
On the other hand, it's not clear what you mean by "splitting groups by hand". You show no code for this operation, but as long as it's general enough for your purposes (and practical) I think you should allow for it. The code might not be as elegant as you wish but if it's doing what it's supposed to, I think it's alright. For example:
clear all
set more off
set seed 8642
set obs 40
* Create example data
egen year = seq(), from(1985) to (2005) block(4)
gen cash = floor(runiform()*500)
gen flag = round(runiform())
* Data management
gen cash0 = cash if flag == 0
gen cash1 = cash if flag == 1
* Table
estpost tabstat cash*, stat(mean) by(year)
esttab, cells("cash0 cash1")
can be used for a table like the one you give in your original post. It's true you have two extra lines and variables, but they may be harmless. I agree with the idea that in general, efficiency is something you worry about once your program is behaving appropriately; unless of course, the lack of it prevents you from reaching that state.

Related

Clarification on tabstat use after bysort in Stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

how to use tsrevar to create lag variable using stata

I got a panel data (time: date name: ticker). I want to create upto 10 lagged variables for x. So I use the following code.
tsrevar L(1/10).x
rename (`r(varlist)') x_#, addnumber
Because my data is in hourly frequency, and only observation during the daytime. Using the code above, the first observation for each trading day is missing.
My alternative solution is:
by ticker: gen lag1 = return[_n-1]
Then, I have to copy and paste this code 10 times, which looks very messy. Could anyone teach me how to solve this problem please
This is my "10-minute guess" given I don't actually work with anything finer than a daily periodicity.
Stata has no hourly display format that I know of. One way to achieve what you want is using the delta() option when you tsset the data.
clear
set more off
*----- example data -----
// an "hour-by-hour" time series which really has millisecond format
set obs 25
gen double t = _n*1000*60
format %tcDDmonCCYY_HH:MM:SS:.sss t
set seed 3129745
gen ret = runiform()
list, sep(0)
*----- what you want? -----
// 1000*60 milliseconds conform 1 hour
tsset t, delta((1000*60))
// one way
tsrevar L(1/2).ret
rename (`r(varlist)') ret_#, addnumber
// two other ways
gen ret1 = L.ret
gen ret11 = ret[_n-1]
// check
assert ret_1 == ret1
assert ret_1 == ret11
list, sep(0)
tsset also has a generic option, and delta() itself has several specifications. Take a look and test, to see if you find a better fit.
(You mention an "hourly frequency" but you don't give example data with the specifics. There really is no way to know for sure what you're dealing with.)

Stata: Newvar for multiple equal dates

I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.

Quantile Regression with Quantiles based on independent variable

I am attempting to run a quantile regression on monthly observations (of mutual fund characteristics). What I would like to do is distribute my observations in quintiles for each month (my dataset comprises 99 months). I want to base the quintiles on a variable (lagged fund size i.e. Total Net Assets) that will be later employed as an independent variable to explain fund performance.
What I already tried to do is use the qreg command, but that uses quantiles based on the dependent variable not the independent variable that is needed.
Moreover I tried to use the xtile command to create the quintiles; however, the by: command is not supported.
. by Date: xtile QLagTNA= LagTNA, nq(5)
xtile may not be combined with by
r(190);
Is there a (combination of) command(s) which saves me from creating quintiles manually on a month-by-month basis?
Statistical comments first before getting to your question, which has two Stata answers at least.
Quantile regression is defined by prediction of quantiles of the response (what you call the dependent variable). You may or may not want to do that, but using quantile-based groups for predictors does not itself make a regression a quantile regression.
Quantiles (here quintiles) are values that divide a variable into bands of defined frequency. Here you want the 0, 20, 40, 60, 80, 100% points. The bands, intervals or groups themselves are not best called quantiles, although many statistically-minded people would know what you mean.
What you propose seems common in economics and business, but it is still degrading the information in the data.
All that said, you could always write a loop using forval, something like this
egen group = group(Date)
su group, meanonly
gen QLagTNA = .
quietly forval d = 1/`r(max)' {
xtile work = LagTNA if group == `d', nq(5)
replace QLagTNA = work if group == `d'
drop work
}
For more, see this link
But you will probably prefer to download a user-written egen function [correct term here] to do this
ssc inst egenmore
h egenmore
The function you want is xtile().

Trimming data in Stata

I have a data set and want to drop 1% of data at one end. For example, I have 3000 observations and I want to drop the 30 highest ones. Is there a command for this kind of trimming? Btw, I am new to Stata.
You can use _pctile in Stata for that.
sysuse auto, clear
_pctile weight, nq(100)
return list #this is optional
drop if weight>r(r99) #top 1 percent
If you know what the cutoff is for your drop you can use:
drop if var1>300
which drops all rows with var1 over 300.
You can use summarize var1, detail to get the key percentiles: it will give you 1% and 99% percentiles along with other standard percentiles.
To select 30 top observations in stata, use the following command:
keep if (_n<=30 )
To drop top 30 observations in stata, use the following command
keep if (_n>30)