I'm trying to calculate some odds ratios and significance forsomething that can be out into a 2x2 table. The problem is the Fisher test in Sas is taking a long time.
I already have the cell counts. I could calculate a chi square if not for the fact that done of the sample sizes are extremely small. And yet some are extremely large, with cell sizes in the hundreds of thousands.
When I try to compute these in R, I have no problem. However, when I try to compute them in Sas, it either tasks way too long, out simply errors out with the message "Fishers exact test cannot be computed with sufficient precision for this sample size."
When I create a toy example (pull one instance from the dataset, and calculate it) it does calculate, but takes a long time.
Data Bob;
Input targ $ status $ wt;
Cards;
A c 4083
A d 111
B c 376494
B d 114231
;
Run;
Proc freq data = Bob;
Weight wt;
Tables targ*status;
Exact Fisher;
Run;
What is going wrong here?
That's funny. SAS calculates the Fisher's exact test p-value the exact way, by enumerating the hypergeometric probability of every single table in which the odds ratio is at least as big or bigger in favor of the alternative hypothesis. There's probably a way for me to calculate how many tables that is, but knowing that it's big enough to slow SAS down is enough.
R does not do this. R uses Monte Carlo methods which work just as fine in small sample sizes as large sample sizes.
tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
pc <- proc.time()
fisher.test(tab)
proc.time()-pc
gives us
> tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
> pc <- proc.time()
> fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
9.240311 13.606906
sample estimates:
odds ratio
11.16046
> proc.time()-pc
user system elapsed
0.08 0.00 0.08
>
A fraction of a second.
That said, the smart statistician would realize, in tables such as yours, that the normal approximation to the log odds ratio is fairly good, and as such the Pearson Chi-square test should give approximately very similar results.
People claim two very different advantages to the Fisher's exact test: some say it's good in small sample sizes. Others say it's good when cell counts are very small in specific margins of the table. The way that I've come to understand it is that Fisher's exact test is a nice alternative to the Chi Square test when bootstrapped datasets are somewhat likely to generate tables with infinite odds ratios. Visually you can imagine that the normal approximation to the log odds ratio is breaking down.
Related
I've noticed strange behavior with SAS proc mixed: Models with a modestly large number of rows, which take only seconds to converge, nevertheless take upwards of half an hour to finish running if I ask for output of predicted values & residuals. The thing that seems perverse is that when I run the analogous models in R using nlme::lme(), I get the predicted values & residuals as a side effect and the models complete in seconds. That makes me think this is not merely a memory limitation of my machine.
Here's some sample code. I can't provide the real data for which I'm seeing this issue, but the structure is 1-5 rows per subject, ~1500 unique subjects, ~5,000 outcome-covariate sets total.
In SAS:
proc mixed data=testdata noclprint covtest;
class subjid ed gender;
model outcome = c_age ed gender / ddfm=kr solution residual outp=testpred;
random int c_age / type=un sub=subjid;
run;
In R:
lme.test <- lme(outcome ~ c_age + ed + gender, data=testdata,
random = ~c_age|factor(subjid), na.action=na.omit)
Relevant stats: Win7, SAS 9.4 (64-bit), R 3.3, nlme 3.1-131.
Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.
I want to select my sample in Stata 13 based on three stratum variables with 12 strata in total (size - two strata; sector - three strata; intangible intensity - two strata). The selection should be proportional without replacement.
However, I can only find disproportionate selection commands that select for instance x% of each stratum.
Can anyone help me out with this problem?
Thank you for this discussion. I think I know where my problem was.
The command "gsample" can select strata based on different variables. Therefore, I thought I had to define three different stratum variables. But the solution should be more simple.
There are 12 strata in total (the large firms with high intensity in sector 1, the small firms with high intensity in sector 1, and so on) with each firm in the sample falling in to one of the strata.
All I have to do is creating a variable "strataident" with values from 1 to 12 identifying the different strata. I do this for the population dataset, so the number of firms falling into each stratum is representative for the population. The following code will provide me a stratified random sample that is representative for the population.
gsample 10, percent strata (strataident) wor
This command works as well and is much easier, see the example in 1:
gsample 10, percent wor strata(size sector intensity)
The problem is, that strata may "overlap". So you probably have to rebalance the sample after initial draft.
Now the question is, how this can be implemented. The final sample should represent the proportion of the population as good as possible.
I have a SAS dataset of 60k customers with the following attributes:
1) customer number
2) X coordinate
3) Y coordinate
4) store visits
I need to calculate the average weighted distance from each customer to all the other customers in the table where each distance is weighted by the comparing customer's number of visits. As an example, the distance between Customer A & Customer B is 10. We would then weight that distance by Customer B's number of visits (2) which equals 5. This process would repeat for all other customers in the table and we would then average all of these weighted distances for each of the 60k customers.
I suppose the brute force way to do this is a Cartesian join (ie. create a 60k x 60k = 3.6 billion record table) but that will likely run out of memory or crash SAS. I have also thought of breaking this up into more manageable Cartesian joins (ie. 10 x 60K = 600k x 6000 iterations but this is likely to be quite time consuming -- maybe my only choice though). I'm guessing you guys/gals know a much better way to do this!
I appreciate all your suggestions.
Thanks for you help!
Bad news, there is no way to speed up this calculation (that I know of).
Good news is SAS won't crash or run out of memory if you do a Cartesian product. Other good news is doing this in a data step is faster than doing it in PROC SQL.
data test;
do cn=1 to 64000;
x = ceil(Ranuni(13)*100);
y = ceil(ranuni(13)*100);
visits = max(1,round(rannor(12)*3 + 8,1));
output;
end;
run;
sasfile test load;
data ave_dist(keep=cn ave_dist);
set test end=last;
dist=0;
td= 0;
total_visits=0;
do i=1 to n;
set test(rename=(cn=cn_2 x=x_2 y=y_2) drop=visits) point=i nobs=n;
if cn ^= cn_2 then do;
xx = (x-x_2);
yy = (y-y_2);
total_visits = total_visits + visits;
dist = sqrt(xx*xx + yy*yy);
if dist^= 0 then
dist = 1/dist;
else
dist = 100; /*Adjust to something that makes sense to your data*/
td = visits*dist + td;
end;
end;
ave_dist = td / total_visits;
output;
run;
sasfile test close;
I inverted the distance calculation. You want small distances to have a higher score. I made this a true visit weighted average.
This takes about 13 minutes to run on my laptop.
if your customers base is going to be <100k then PROC DISTANCE could be of help. Using the dataset created by #DomPazz you could run the following code and examine the results. In this case I'm only trying it out on the first 10K customers which runs in 16secs. Do not let that fool you into false sense of security. When you double the no. of customers the time taken goes by 4times.
(actual times: 10K - 16secs, 20K - 47 secs, 40K - 3mins...)
This procedure produce a NxN square matrix (where N is the no. of customers in your input dataset). You could try and experiment and see at what point SAS runs into RAM memory issues ( be sure to have plenty of hard drive space, at least in the order of 1.10*NxN*8bytes).
Each cell in the matrix represents ith customer's (in rows) distance with 'j'th customer (in columns). Once you get the distance it is a simple matter of multiplying the respective distances with the customer's visits and taking the average.
proc distance data = test(obs = 100)
OUT=test_distances(compress = binary)
METHOD= EUCLID shape = SQUARE
UNDEF=1000000
VARDEF=wdf;
var INTERVAL(x y)
;
copy cn visits;
run;
data avg_dist;
set train_distances;
array dist{*} dist:;
prod=0;
do i = 1 to dim(dist);
prod = visits*dist{i}+prod;
end;
avg_dist=prod/dim(dist);
dims=dim(dist);
drop i dist:
;
run;
proc sql;
drop table test_distances;
quit;
The type of problem you are looking to solve are generally known as k-nearest neighbour problems. There has been decades of research in this area and most often these are solved using special data-structures such as Kd-trees for performance. Most often one is interested in answering questions such as who are the top-10 (or K) closest customers to this customer I'm interested in? Another procedure which is very good for solving these type of problems efficiently is the PROC PMBR which supports both kd-tree and SAS's proprietary structure called the Rd-tree - look it up - you will only find a pdf document from SAS Eminer 4.3 days
The moment you are having to calculate distance between N*N items you are asking for trouble.
From reading your project description in the comments it appears that what you need is not calculate distance between every customer with every other customer but something like distance between every customer and every store.
This will dramatically improve your query performance since the dimensionality of the problem is greatly reduced.
Let's say you have N customers and S stores then you only need to calculate distance between N*S points. ( a simple data step will do the job as there is no need for a cartesian product nor specialised data structures)
From there you can look at, for each store in S what proportion of the customer's who shoped at that store live with in 1KM, 2KM, 3KM ....
Then you can come up answers such as 80% live within 1km , 15% live within 2KM etc...
I'm trying to create a simulation of drug concentration based on the dose of a drug given. I have some preliminary data and I used a random effects model to analyze the relationship between log(dose), predicting log(drug concentration), modelling subject as a random effect.
The results of that analysis are below. I want to take these results and simulate similar data in SAS, so I can look at the effect of changing doses on the resulting concentration of drug in the body. I know that when I simulate the data, I need to ensure the random slope is correlated with the random intercept, but I'm unsure exactly how to do that. Any example code would be appreciated.
Random effects:
Formula: ~LDOS | RANDID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.15915378 (Intr)
LDOS 0.01783609 0.735
Residual 0.05790635
Fixed effects:
LCMX ~ LDOS
Value Std.Error DF t-value p-value
(Intercept) 3.340712 0.04319325 16 77.34339 0
LDOS 1.000386 0.01034409 11 96.71090 0
Correlation:
(Intr)
LDOS -0.047