Clarification on tabstat use after bysort in Stata

Clarification on tabstat use after bysort in Stata - stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.

I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

Related

Ranking sum of data by two categories

I have data that I would like to rank by two separate categories, State and ServiceType. Essentially, there are multiple years of data for each ServiceType across various states, and I was hoping to get the sum of all years for each ServiceType by State, meaning each State is treated independently and the sums of the various categories are ranked only within that state, not nationally.
I've tried
bys State ServiceCategory (quant_variable): ///
egen rank_quant_variable= rank(sum(quant_variable)), field
as well as a version of above where I used a pre-calculated sum variable. Both don't really work.

This lacks a reproducible example, as you do not give your data or phrase your problem in terms of a dataset we could download, for example as loaded with or referred to in Stata. There is no need to give the full dataset but just a minimal example with the same structure.
The call to sum() here would be to Stata's sum() function, which yields the cumulative or running sum, which evidently isn't what you want. So that case is easy to dismiss.
The problem remaining is quite what you did in the code you don't show with a pre-calculated sum.
At a guess you worked out
bys State ServiceCategory: egen sum = total(quant_variable)
and then pushed that sum through rank(). But that would use each value of sum as many times as it occurred.
Perhaps you want something more like this:
egen tag = tag(State ServiceCategory)
bysort State: egen rank_quant_variable = rank(sum) if tag, field
bysort State (rank): replace rank = rank[1]
But it's really hard (for me) to visualize this without details on what you did or an example to work on.

Deleting all observations given one observation for each variable's type

I have a table with firm identifiers, fiscal year, quarter and market_capital. I want to delete all firm observations that had a specific market capital at a specific quarter of a specific year. That is, I want to delete all observations for a firm if its market capital for 2006, quarter 2 was below 50.
My table is in the form:
enter image description here

If I understand correctly, you have a Stata dataset containing four variables which I will call firm, year, quarter, and mc (since "Capital Market" shown in the picture of your data is not valid a Stata variable name).
The following code might start you in the right direction, but it is untested since my copy of Stata cannot read the picture of your data, and "I want to retype data from a picture of data" said nobody, ever.
Added in edit: the untested code had an error, so I removed it.

Having a quarterly date variable -- rather than separate year and quarter variables -- will be needed sooner or later.
That could be
gen qdate = yq(year, quarter)
format qdate %tq
Then your code for dropping is
egen todrop = total(capital < 50 & qdate == yq(2006, 2)), by(firm)
drop if todrop
as the variable todrop will be 1 if and only if you want to drop a firm and 0 otherwise.
See this paper for a review of related technique.

How to check for outliers using scalars

I'm checking the distribution of test scores by year, subject, and grade. I want to make sure that there aren't any outliers, which would be anything more than 4 standard deviations from the mean. This is my code:
bys year subject tested_grade: summarize test_score
But when I try to get the scalars I can only get the scalar corresponding to the last year, subject, tested_grade. I've tried creating a loop but it leads to the same problem.
I have found Nick Cox's extremes command but it doesn't tell me how many standard deviations the extreme values are from the mean.
If anyone has some ideas of how to check for outliers as determined by a measure of standard deviations away from the mean it would be really helpful.
Edit
This code gets me (mostly) what I want.
bys year subject tested_grade: summarize test_score
gen std_test_score = (test_score > 4*r(sd)) if test_score < .
list test_score std_test_score if std_test_score==1
The only problem is that the last year, subject, and tested_grade is where the r(sd) comes from. I'd want to create a variable - std_test_score1-20 - for each year, subject, and tested_grade.

Means and SDs may be generated for several groups at once by
bysort year subject tested_grade : egen mean_test_score = mean(test_score)
by year subject tested_grade: egen sd_test_score = sd(test_score)
gen std_test_score = (test_score - mean_test_score) / sd_test_score
Indeed, egen has a function std() to do this in one step, but it's often a good idea to re-create basics from even more basic principles.
Your code omits subtraction of the mean.
However, as underlined in comments, (value - mean) / SD is a poor criterion for outliers as outliers themselves influence the mean and SD. That's why, for example, box plots are based on median, quartiles and (commonly) points more than so many interquartile ranges away from the nearer quartile.

Stata: using egen group() to create unique identifiers

I have a dataset where each row is a firm, year pair with a firmid that is a string.
If I do
duplicates drop firmid year, force
it doesn't delete anything since there are no duplicates (I originally created the dataset after running duplicates drop firmid year, force).
So far so good. I want to create a panel which requires a firmid that is numeric. So I run
egen newid = group(firmid)
xtset newid year
But the 'repeated time values in panel' error pops up. Moreover,
duplicates list newid year
lists a whole bunch of duplicates.
It seems as though egen, group() isn't generating unique groups. My question is: why, and how do I create unique groups in a robust way?

This is an old thread, but I have recently experienced the same symptoms, so I wanted to share my solution. Of course, so long as the questioner does not give further details, we will not know whether the causes are the same for me and him.
The problem turned out to be an issue of precision. As explained here in section 4.4, calculations done on integers stored as floats are precise only in the range up to 16,777,216. So, if you have more than 16,777,216 firms in your sample, rounding error will result in the same ID being assigned to multiple firms. This is straightforwardly dealt with by increasing the precision of the ID variable to long:
egen long newid = group(firmid)

How do I get a group minimum over level combinations of factors?

I would like to find minimum values within groups. In stata, I think it is simply "by group, sort : egen minvalue=min(value)"...
I tried to mess around with ave and rowsum, but to no avail.
ave(value, group, FUN=min) did not work.

Sorry this answer is a little late but, in case you are still looking for the answer or for future searchers here goes....
You are onto the right track with the -by- command. Here is what I'd do to find the lowest price of cars in the auto.dta dataset by domestic/foreign grouping.
sysuse auto, clear
bysort foreign : egen minprice = min(price)
What this does is create a new variable 'minprice' that holds the minimum price for domestic cars if a given car (observation) is domestic and vice versa for foreign cars. So this new variable with have just two values in this example and you can check that by doing:
tabulate minprice
Depending on why you wanted to find the minimum values by group this may not be what you had in mind but hopefully someone finds it helpful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js