I have Household ID's and their respective sales. As it turn out there are few of these HH ID's who have extremely high Total Sales. Can you guys please suggest a good method for the outlier treatment.
It will be great if you suggest in SAS.
Regards,
Saket
The following is a basic, rather crude method. It involves removing values more than 3 standard deviations from the mean:-
** Standardise data;
proc standard data=sales_data mean=0 std=1 out=sales_data_std;
var sales;
run;
** Remove values more than 3 std devs from mean;
data sales_data_no_outliers;
set sales_data_std;
where sales < -3 or sales > 3;
run;
There's a reference to this approach in Wikipedia.
Still, it's crude; it relies on your variable being normally distributed and will almost always find outliers (if n > 100) even if, in all reasonableness, the values are not really outlying.
The subject of outliers is long and detailed but a cursory overview of the topic might be useful. Unfortunately, I can't really think of any introductory sources off-hand.
Related
I'm attempting to use SAS to do a pretty basic regression problem but I'm having trouble getting the full set of results.
I'm using a data set that includes professors' overall quality (the dependent variable) and has the following independent variables: gender, numYears, pepper, discipline, easiness, and rateInterest.
I'm using the code below to generate the analysis of the data set:
proc glm data=WORK.IMPORT;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest;
run;
I get the following results, which is mostly what I need, EXCEPT that I would like to see exactly which responses from the class variables (gender, pepper, discipline) are significant.
From these results, I can see that easiness, rateInterest, pepper, and discipline are significant; however, I'd like to see which specific values of pepper and discipline are significant. For example, pepper was answered as a 'yes' or 'no' by the student. I'd like to see if quality correlates specifically to pepperyes or pepperno. Can anyone give me some advice about how to alter my code to return a breakdown of the class variables?
Here is also a link to the dataset, in case it's needed for reference:
https://drive.google.com/file/d/1Kc9cb_n-l7qwWRNfzXtZi5OsiY-gsYZC/view?usp=sharingRateprof
I really, truly appreciate any assistance!
Add the solution option to your model statement to break out statistics of each class variable; however, reference parameterization is not available in proc glm, and will cause biased estimates. There are ways around this to continue using proc glm, but the simplest solution is to use proc glmselect instead. proc glmselect allows you to specify reference parameterization. Use the selection=none option to disable variable selection.
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline / param=reference;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
The interpretation of this would be:
All other variables held constant, females affect the quality rating by
-0.046782 units compared to males. This variable is not statistically significant.
The breakdown of each class level is a comparison to a reference value. By default, the reference value selected is the last level after all class values are internally sorted. You can specify a reference using the ref= option after each class variable. For example, if you wanted to use females as a reference value instead of males:
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
Note that you can also do this with prox mixed. For this specific purpose, the preference is up to you based on the output style that you like. proc mixed is a more flexible way to run regressions, but would be a bit overkill here.
proc mixed data=import;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / solution;
run;
I'm checking the distribution of test scores by year, subject, and grade. I want to make sure that there aren't any outliers, which would be anything more than 4 standard deviations from the mean. This is my code:
bys year subject tested_grade: summarize test_score
But when I try to get the scalars I can only get the scalar corresponding to the last year, subject, tested_grade. I've tried creating a loop but it leads to the same problem.
I have found Nick Cox's extremes command but it doesn't tell me how many standard deviations the extreme values are from the mean.
If anyone has some ideas of how to check for outliers as determined by a measure of standard deviations away from the mean it would be really helpful.
Edit
This code gets me (mostly) what I want.
bys year subject tested_grade: summarize test_score
gen std_test_score = (test_score > 4*r(sd)) if test_score < .
list test_score std_test_score if std_test_score==1
The only problem is that the last year, subject, and tested_grade is where the r(sd) comes from. I'd want to create a variable - std_test_score1-20 - for each year, subject, and tested_grade.
Means and SDs may be generated for several groups at once by
bysort year subject tested_grade : egen mean_test_score = mean(test_score)
by year subject tested_grade: egen sd_test_score = sd(test_score)
gen std_test_score = (test_score - mean_test_score) / sd_test_score
Indeed, egen has a function std() to do this in one step, but it's often a good idea to re-create basics from even more basic principles.
Your code omits subtraction of the mean.
However, as underlined in comments, (value - mean) / SD is a poor criterion for outliers as outliers themselves influence the mean and SD. That's why, for example, box plots are based on median, quartiles and (commonly) points more than so many interquartile ranges away from the nearer quartile.
i have 1 million + rows of data and on of the columns is channel_name. The people collecting the data didn't seem to care that they entered one channel in about 10 different variations, lots of which contain the # symbol. Google search isn't giving me any decent documentation, can anyone direct me to something useful?
To some extent the answer has to be, "it depends". Your actual data will determine the best solution to this; and there may not be one true solution - you may have to try a few things, and there may well be more manual work than you'd like.
One option is to build a format based on what you see. That format can either convert various values to one consistent value, or convert to a numeric category (which is then overlaid with a format that shows the consistent value).
For example, you might have 'channel' as retail store:
data have;
infile datalines truncover;
input #1 channel $8.;
datalines;
Best Buy
BestBuy
BB
;;;;
run;
So you can do one of two things:
proc format;
value $channel
"Best Buy","BB","BestBuy" = "Best Buy";
quit;
data want;
set have;
channel_coded = put(channel,$channel.);
run;
Or you can do:
proc format;
invalue channeli
"Best Buy", "BB","BestBuy" = 1
;
value channelf
1 = "Best Buy"
;
quit;
data want;
set have;
channel_coded = input(channel,CHANNELI.);
format channel_coded channelf.;
run;
Which you do is largely up to you - the latter gives you more flexibility in the long run, for example when Sears and K-Mart merged, it would be somewhat to take 2 and 16 and format then as Sears, than to change the stored values for the character format - and even easier to roll back if/when KMart splits off again.
This does require some manual work, though; you have to code things by hand here, or develop some method for figuring out what the coding is. You can use the other option in proc format to easily identify new values and add them to the format (which can be derived from a dataset, instead of hand written code), but at the end of the day the actual values you have determine what solution is best for the actual work of determining what is "Best Buy", and a by-hand solution (each time a new value comes in, it is looked at by a person and coded) may ultimately be the best.
I have a large SAS dataset and I would like to make a series of tables and charts using by value processing. I am outputing these to a PDF.
Is there any way to get SAS to alternate between the table and the chart as it goes through the data? Right now, I have to print all of the tables first and then print the charts. If it were just 4 tables/charts, then I would be ok writing
Here is a simple example:
data sample;
input byval $ item $ amount;
datalines;
A X 15
A Y 16
A Z 12
B X 25
B Y 10
B Z 18
;
run;
symbol1 i=j;
proc print data=sample;
by byval;
var item amount;
run;
proc gplot uniform data=sample;
by byval;
plot amount*item;
run;
This prints 2 tables, followed by 2 charts.
I would like the Chart for "A" to come after the table for "A" so that the reader can flip through the pdf and always see the associated charts and tables together.
I could write separate procs for each one, but then the gplot won't have a uniform axis (and it gets messy if I have 100 different groups instead of 2).
I thought about pumping them into greplay but then you can't use titles with "#BYVAL1".
Is there any easy way to do this?
I've never used it, but it may be worth checking out ODS DOCUMENT. This allows you to store the output of all your procedures and then reference specific items from them using PROC DOCUMENT.
Below is a link to the SAS website with useful information about this, in particular the paper by Cynthia Zender for the SAS Global Forum 2009.
http://support.sas.com/rnd/base/ods/odsdocument/index.html
Cynthia also regularly contributes to the SAS Support Communities website (https://communities.sas.com/community/support-communities), so it may be worth asking on there if you are still stuck.
Good luck
I don't know of any way to do what you ask directly. GREPLAY is probably the closest you'll come; the primary problem is that SAS processes the PROCs linearly, first processing the entire PROC PRINT, then the entire PROC GPLOT. GREPLAY would allow you to redisplay the output, but if that doesn't work for your needs due to the #BYVAL issue, I'm not sure there's a better solution. Perhaps you can modify the title afterwards (not sure if GREPLAY allows this)?
You could try using ODS LAYOUT, but I don't think that would be any better. The one way it could be better is if you can work out having two columns on a 'page', one column being the PROC PRINT outputs, one the PROC GPLOT, and then print the columns one page than the other. I don't think this is possible, but it might be worth exploring.
You might also try setting up a macro to do each BYVAL separately, defining the axis in a uniform manner manually (ie, defining it based on your own calculation of the correct axis parameters, as an argument to the macro). That is probably the easiest solution that might still allow #BYVAL to work properly.
You might also try browsing about Richard DeVenezia's site (http://www.devenezia.com/downloads/sas/samples/ ) which has a lot of examples of SAS/GRAPH solutions. He also posts on SAS-L (sasl#listserv.uga.edu) sometimes, not sure if I've seen him on StackOverflow. He's probably the person most likely to be able to answer the question that I know of.
Is it possible to choose the whisker value with proc sgplot.
Because it seems only 25th and 75th are avalaible for SGPLOT.
Maybe someone know if it is possible or not?
Thanks
It is not possible to add non-standard whiskers, probably because it is discouraged by statisticians.
They discourage it because the boxplot has a specific definition in terms of the quartiles.
While there are occasional variations, (i.e., to get a rough normality plot),
in general people expect to see quartiles in a box plot.
Adding arbitrary percentiles, even ones that make sense like the ones you propose,
is likely to confuse the audience more than it helps.
Try this visualization: A waterfall graph of sales contributions based on the percentile intervals you suggest:
data actualBinned; set sashelp.prdsale;
keep actual;
run;
proc rank data=actualBinned out=actualBinned
groups=100
descending;
var actual;
ranks rank;
run;
data actualBinned; set actualBinned;
if rank < 5 then bin="00-05";
else if rank < 25 then bin="05-25";
else if rank < 50 then bin="25-50";
else if rank < 75 then bin="50-75";
else if rank < 95 then bin="75-95";
else bin="95-100";
run;
proc sort data=actualBinned;
by bin;
run;
proc sgplot data=actualBinned;
waterfall category=bin response=actual;
run;
I am not a huge fan of bins of different width displayed with the same width. I would rather use 20 bins of width 5.
With that caveat, I can see how a manager might find this visualization more useful in a specific context.
BTW, the waterfall graph is experimental in 9.3. For older version of SAS there are several recipes online.