Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.
Related
I am implementing a logit model in a database of households using as dependent variable the classification of poor or not poor household (1 if it is poor, 0 if it is not):
proc logistic data=regression;
model poor(event="1") = variable1 variable2 variable3 variable4;
run;
Using the proc logistic in SAS, I obtained the table "Association of predicted probabilities and observed responses" that allows me to know the concordant percentage. However, I require detailed information of how many households are classified poor adequately, in this way:
I will appreciate your help with this issue.
Add the CTABLE option to your MODEL statement.
model poor(event="1") = variable1 variable2 variable3 variable4 / ctable;
CTABLE classifies the input binary response observations according to
whether the predicted event probabilities are above or below some
cutpoint value z in the range . An observation is predicted as an
event if the predicted event probability exceeds or equals z. You can
supply a list of cutpoints other than the default list by specifying
the PPROB= option. Also, you can compute positive and negative
predictive values as posterior probabilities by using Bayes’ theorem.
You can use the PEVENT= option to specify prior probabilities for
computing these statistics. The CTABLE option is ignored if the data
have more than two response levels. This option is not available with
the STRATA statement.
For more information, see the section Classification Table.
I have a data set, call it training, that I need the prediction of column Y using X. The conditional probability distribution of Y given X is in a different look up table and it is a Table distribution according to SAS terminology. The space of X and Y are both {1,2,...44}. The probability lookup table is 44 x 44.
Suppose my probability look up table is
x p1 p2 ... p44
1 0.001 0.004 ... 0.0078
2 0.0045 ... 0.000
.. ... ...
44 0.00089 ... 0.00067
The conditional probabilities are highly skewed. And my training data has a large N.
I am looking for an efficient, high speed way of matching and retrieving the conditional probabilities to give prediction of y in the training data. I am looking at hash to speed up the process and may be some other sampling method besides RAND('table', of p1-p44) to possibily make the program run faster?
below is a mock hash program for this task.
data prediction;
array prob[44] _temporary_;
if _n_ =1 then do;
if 0 then set work.prob_lookup;
dcl hash lookup(dataset:'work.prob_lookup');
look.definekey('x');
look.definedata('p1',...,'p44'); /* Is there a wildcard method to specify variable p1 to p44?
look.definedone();
end;
set work.training;
rc = lookup.find();
if rc = 0 then do;
y_predict =rand('table', p1-p44);
output;
end;
run;
The training data table for prediction has a large N, about 2k+, but should also allow more when the probability lookup table is applied for prediction. The probability lookup table is 44*44, relatively small, but highly skewed.
Help on building a faster code for this task is much appreciated. Also if there's an easy way to do it in R, it will be appreciated as well.
Not sure what HASH adds to this problem.
data prediction;
merge training prob_lookup;
by x;
y_predict=rand('table',of p1-p44);
run;
I guess it might save having to sort your training set by X? But if your X values are just integers between 1 and 44 then why not just use POINT= option?
data prediction;
set training ;
pointer=x;
if pointer in (1:44) then do;
set prob_lookup point=pointer ;
y_predict=rand('table',of p1-p44);
end;
else call missing(y_predict,of p1-p44);
run;
I was looking for a way to compute the effect size for Friedman's test using sas, but could not find any reference. I wanted to see if there is any difference between the groups and what its size was.
Here is my code:
proc freq data=mydata;
tables id*b*y / cmh2 scores=rank noprint;
run;
These are the results:
The FREQ Procedure
Summary Statistics for b by y
Controlling for id
Cochran-Mantel-Haenszel Statistics (Based on Rank Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 230.7145 <.0001
2 Row Mean Scores Differ 1 230.7145 <.0001
This question is correlated with the one posted on Cross Validated, that is concerned with the general statistical formula to compute the effect size for Friedman's test. Here, I would like to find out how to get the effect size in sas.
My goal is to fit a data to any distribution which has positive support. (weibull(2p), gamma(2p), pareto(2p), lognormal (2p),exponential(1P)). First attempt,i used proc univariate.This is my code
proc univariate data=fit plot outtable=table;
var week1;
histogram / exp gamma lognormal weibull pareto;
inset n mean(5.3) std='Standar Deviasi'(5.3)
/ pos = ne header = 'Summary Statistics';
axis1 label=(a=90 r=0);
run;
The first thing i noticed, there's no kolmogorov statistic shown for weibull distribution.Then i used proc severity instead.
proc severity data=fit print=all plots(histogram kernel)=all;
loss week1;
dist exp pareto gamma logn weibull;
run;
Now, i got the KS statistic for weibull distribution.
Then i compared KS statistic produced by proc severity and proc univariate. They're different. Why? Which one should i use?
I do not have access to SAS/ETS so cannot confirm this with proc severity, but I imagine that the difference you are seeing come down to the way the distribution parameters are fitted.
With your proc univriate code you are not requesting estimation for several of the parameters (some are in some cases set to 1 or 0 by default, see sigma and theta in the user guide). For example:
data have;
do i = 1 to 1000;
x = rand("weibull", 5, 5);
output;
end;
run;
ods graphics on;
proc univariate data = have;
var x;
/* Request maximum liklihood estimate of scale and threshold parameters */
histogram / weibull(theta = EST sigma = EST);
/* Request maximum liklihood estimate of scale parameter and 0 as threshold */
histogram / weibull;
run;
You will note that when an estimate of theta is requested SAS also produces the KS statistic, this is due to the way that SAS estimates the fit statistic requiring know distribution parameters (full explanation here).
My guess is that you are seeing different fit statistics between the two procedures because either they are returning slightly different fits, or they use different calculations for the estimation of fit statistics. If you are interested you can investigate how they perform their parameter estimation in the user guide (proc severity and proc univariate). If you wanted to investigate further you could force the distribution parameters to match in both procedures and then compare the fit statistics to see how far they differ.
I would recommend that if possible you use only one of the procedure, and that you select the one that best fits your needs in terms of output.
I'm trying to create a simulation of drug concentration based on the dose of a drug given. I have some preliminary data and I used a random effects model to analyze the relationship between log(dose), predicting log(drug concentration), modelling subject as a random effect.
The results of that analysis are below. I want to take these results and simulate similar data in SAS, so I can look at the effect of changing doses on the resulting concentration of drug in the body. I know that when I simulate the data, I need to ensure the random slope is correlated with the random intercept, but I'm unsure exactly how to do that. Any example code would be appreciated.
Random effects:
Formula: ~LDOS | RANDID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.15915378 (Intr)
LDOS 0.01783609 0.735
Residual 0.05790635
Fixed effects:
LCMX ~ LDOS
Value Std.Error DF t-value p-value
(Intercept) 3.340712 0.04319325 16 77.34339 0
LDOS 1.000386 0.01034409 11 96.71090 0
Correlation:
(Intr)
LDOS -0.047