Please help me with the sas code. I need to calculate rolling standard deviation by firm (permno) within 60 months. For this is use the following code:
proc sort data = Crsp;
by permno date;
run;
proc expand data = Crsp out = std_a;
by permno;
convert ret=rw3_std / transformout = (movstd 60 trim 59);
run;
What it does is it provides me with the standard deviations, but after calculating some standard deviations it provides no calculations. For example, it starts to provide stand. dev. from observation 60 (since the trim is 59) and after 280th observation it does not calculate stand. deviation even there are observations. Is there anything wrong with the code?
Related
I have this dataset and need to calculate the days' difference between each dose date per period. How do I label each period study date so I can carry out an intck to calculate the days' difference per subject (ptno)
Just use the DIF() function to calculate the change in value for your date variable. SAS stores dates as number of days so the difference will be the number of days between the two observations. You could then test if the difference is 7 days or not.
data want;
set have;
by ptno period;
interval = dif(ex_stadt);
if first.ptno then interval=0;
seven_days = (interval = 7) ;
run;
The code of Tom works very well. I simulated the data set with a few rows based in
the sample showed above and it's OK.
Only thing absent is PROC SORT. If the data set is huge the log will exhibit an error.
proc sort data=have;
by ptno period;
run;
Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.
My goal is to fit a data to any distribution which has positive support. (weibull(2p), gamma(2p), pareto(2p), lognormal (2p),exponential(1P)). First attempt,i used proc univariate.This is my code
proc univariate data=fit plot outtable=table;
var week1;
histogram / exp gamma lognormal weibull pareto;
inset n mean(5.3) std='Standar Deviasi'(5.3)
/ pos = ne header = 'Summary Statistics';
axis1 label=(a=90 r=0);
run;
The first thing i noticed, there's no kolmogorov statistic shown for weibull distribution.Then i used proc severity instead.
proc severity data=fit print=all plots(histogram kernel)=all;
loss week1;
dist exp pareto gamma logn weibull;
run;
Now, i got the KS statistic for weibull distribution.
Then i compared KS statistic produced by proc severity and proc univariate. They're different. Why? Which one should i use?
I do not have access to SAS/ETS so cannot confirm this with proc severity, but I imagine that the difference you are seeing come down to the way the distribution parameters are fitted.
With your proc univriate code you are not requesting estimation for several of the parameters (some are in some cases set to 1 or 0 by default, see sigma and theta in the user guide). For example:
data have;
do i = 1 to 1000;
x = rand("weibull", 5, 5);
output;
end;
run;
ods graphics on;
proc univariate data = have;
var x;
/* Request maximum liklihood estimate of scale and threshold parameters */
histogram / weibull(theta = EST sigma = EST);
/* Request maximum liklihood estimate of scale parameter and 0 as threshold */
histogram / weibull;
run;
You will note that when an estimate of theta is requested SAS also produces the KS statistic, this is due to the way that SAS estimates the fit statistic requiring know distribution parameters (full explanation here).
My guess is that you are seeing different fit statistics between the two procedures because either they are returning slightly different fits, or they use different calculations for the estimation of fit statistics. If you are interested you can investigate how they perform their parameter estimation in the user guide (proc severity and proc univariate). If you wanted to investigate further you could force the distribution parameters to match in both procedures and then compare the fit statistics to see how far they differ.
I would recommend that if possible you use only one of the procedure, and that you select the one that best fits your needs in terms of output.
My data set is really simple, just one colum with a ratio and another colum with a categorical var, I need to calculate the standard deviation for each class as well as the confidence interval.
Is there a built in function in SAS (proc SQL) to calculate the conficende interval of the standar deviation???
something like the excel function confidence() does?
thanks!
Not Proc SQl but PROC Univariate will give you the confidence intervals of mean, standard deviations and variance. The details are available in SAS support documents:
https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect064.htm
The following statements produce confidence limits for the mean, standard deviation, and variance of the population of heights:
ods select BasicIntervals;
proc univariate data=Heights cibasic;
var Height;
run;
Is it possible to choose the whisker value with proc sgplot.
Because it seems only 25th and 75th are avalaible for SGPLOT.
Maybe someone know if it is possible or not?
Thanks
It is not possible to add non-standard whiskers, probably because it is discouraged by statisticians.
They discourage it because the boxplot has a specific definition in terms of the quartiles.
While there are occasional variations, (i.e., to get a rough normality plot),
in general people expect to see quartiles in a box plot.
Adding arbitrary percentiles, even ones that make sense like the ones you propose,
is likely to confuse the audience more than it helps.
Try this visualization: A waterfall graph of sales contributions based on the percentile intervals you suggest:
data actualBinned; set sashelp.prdsale;
keep actual;
run;
proc rank data=actualBinned out=actualBinned
groups=100
descending;
var actual;
ranks rank;
run;
data actualBinned; set actualBinned;
if rank < 5 then bin="00-05";
else if rank < 25 then bin="05-25";
else if rank < 50 then bin="25-50";
else if rank < 75 then bin="50-75";
else if rank < 95 then bin="75-95";
else bin="95-100";
run;
proc sort data=actualBinned;
by bin;
run;
proc sgplot data=actualBinned;
waterfall category=bin response=actual;
run;
I am not a huge fan of bins of different width displayed with the same width. I would rather use 20 bins of width 5.
With that caveat, I can see how a manager might find this visualization more useful in a specific context.
BTW, the waterfall graph is experimental in 9.3. For older version of SAS there are several recipes online.