I'm checking the distribution of test scores by year, subject, and grade. I want to make sure that there aren't any outliers, which would be anything more than 4 standard deviations from the mean. This is my code:
bys year subject tested_grade: summarize test_score
But when I try to get the scalars I can only get the scalar corresponding to the last year, subject, tested_grade. I've tried creating a loop but it leads to the same problem.
I have found Nick Cox's extremes command but it doesn't tell me how many standard deviations the extreme values are from the mean.
If anyone has some ideas of how to check for outliers as determined by a measure of standard deviations away from the mean it would be really helpful.
Edit
This code gets me (mostly) what I want.
bys year subject tested_grade: summarize test_score
gen std_test_score = (test_score > 4*r(sd)) if test_score < .
list test_score std_test_score if std_test_score==1
The only problem is that the last year, subject, and tested_grade is where the r(sd) comes from. I'd want to create a variable - std_test_score1-20 - for each year, subject, and tested_grade.
Means and SDs may be generated for several groups at once by
bysort year subject tested_grade : egen mean_test_score = mean(test_score)
by year subject tested_grade: egen sd_test_score = sd(test_score)
gen std_test_score = (test_score - mean_test_score) / sd_test_score
Indeed, egen has a function std() to do this in one step, but it's often a good idea to re-create basics from even more basic principles.
Your code omits subtraction of the mean.
However, as underlined in comments, (value - mean) / SD is a poor criterion for outliers as outliers themselves influence the mean and SD. That's why, for example, box plots are based on median, quartiles and (commonly) points more than so many interquartile ranges away from the nearer quartile.
Related
I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.
I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).
I am struggling with part 2 of the question, but I'll type the whole thing for context.
A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.
Sort the data by hours.
Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
Plot this moving average against hours using the twoway connected graph command.
I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.
Any help would be great and please say if more information is needed.
Thanks!
Edit1:
Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.
Still getting used to the syntax, but here is my attempt at it
sort hours
gen yma=0
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12])
3. }
There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:
sort hours
// generate variables for the 12 leads and lags
forvalues i = 1/12 {
gen lnearns_plus`i' = lnearns[_n+`i']
gen lnearns_minus`i' = lnearns[_n-`i']
}
// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)
// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)
// get the average
replace yma = yma/count
// clean up
drop lnearns_* count
This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.
As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.
In fact this dataset can be read into a suitable directory by
net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear
This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).
sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)
There are many other ways to smooth. One is
gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)
Even better would be to use lpoly.
I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.
I have Household ID's and their respective sales. As it turn out there are few of these HH ID's who have extremely high Total Sales. Can you guys please suggest a good method for the outlier treatment.
It will be great if you suggest in SAS.
Regards,
Saket
The following is a basic, rather crude method. It involves removing values more than 3 standard deviations from the mean:-
** Standardise data;
proc standard data=sales_data mean=0 std=1 out=sales_data_std;
var sales;
run;
** Remove values more than 3 std devs from mean;
data sales_data_no_outliers;
set sales_data_std;
where sales < -3 or sales > 3;
run;
There's a reference to this approach in Wikipedia.
Still, it's crude; it relies on your variable being normally distributed and will almost always find outliers (if n > 100) even if, in all reasonableness, the values are not really outlying.
The subject of outliers is long and detailed but a cursory overview of the topic might be useful. Unfortunately, I can't really think of any introductory sources off-hand.
I am pretty new to Stata and I am having difficulty doing something which I would guess is not that unusual of a thing to try to do. I am working with a panel data set (countries and times). Each observation consists of a country, a year, and a variable, call it x. The data is sorted by country year (i.e. all observations corresponding to a given country are consecutive and sorted by year).
Each country has 54 years of data corresponding to 1960 to 2013 inclusive. I would like to run a t-test something like in the following way:
by country: ttest x = x[54] if year != 2013
But I get an error ("weights not allowed") which I don't know how to interpret. I could do it by hardcoding it in and using the usual syntax
by country: ttest x = # if year != 2013
but I want to avoid hard-coding since there are >100 countries and I want to be able to flexibly add / remove countries (and this is just poor form in general).
My first thought was to define a macro using something like
levelsof country, local(levels)
foreach c of local levels {
local y x if year == 2013
ttest x = y if year != 2013
// some code to store the value that I haven't figured out yet
}
but you can't use "if" with declaring a local macro. I am pretty lost and would appreciate any help you all can give. Thank you!
Student's t tests here make little sense without adjustment for time and space dependence structure, unless you have grounds for treating your data as equivalent to independent draws from the same distribution. You can do the tests, but standard errors and P-values are dubious if not bogus. That is, your individual tests on time series face one problem; and collectively your tests face another problem. For a good account, see either edition of Box, Hunter, Hunter, Statistics for experimenters. John Wiley.
That large point aside, Stata is choking on the [] which are being misread as an attempt to specify weights. My guess is that
by country: ttest x = `=x[54]' if year != 2013
would be acceptable syntax to Stata, although still dubious statistics. The detail here is the macro-like syntax
`= '
which has the effect that the expression given will be evaluated by Stata before the line is passed to ttest. So the result, a numeric value, will be what the ttest command sees.
This is naturally similar in spirit to what you were imagining, although your code is some way from being legal and correct.
UPDATE This calculation may also be helpful:
egen mean = mean(x / (year != 2013)), by(country)
egen sd = sd(x / (year != 2013), by(country)
gen z = (x - mean) / sd if year == 2013
list country x z if year == 2013