Outputting predicted values in SAS proc mixed: Prohibitive performance issues - sas

I've noticed strange behavior with SAS proc mixed: Models with a modestly large number of rows, which take only seconds to converge, nevertheless take upwards of half an hour to finish running if I ask for output of predicted values & residuals. The thing that seems perverse is that when I run the analogous models in R using nlme::lme(), I get the predicted values & residuals as a side effect and the models complete in seconds. That makes me think this is not merely a memory limitation of my machine.
Here's some sample code. I can't provide the real data for which I'm seeing this issue, but the structure is 1-5 rows per subject, ~1500 unique subjects, ~5,000 outcome-covariate sets total.
In SAS:
proc mixed data=testdata noclprint covtest;
class subjid ed gender;
model outcome = c_age ed gender / ddfm=kr solution residual outp=testpred;
random int c_age / type=un sub=subjid;
run;
In R:
lme.test <- lme(outcome ~ c_age + ed + gender, data=testdata,
random = ~c_age|factor(subjid), na.action=na.omit)
Relevant stats: Win7, SAS 9.4 (64-bit), R 3.3, nlme 3.1-131.

Related

SAS Regression Returns 0 Coefficient

I am running a SAS regression using the following model:
ods output ParameterEstimates=stock_params;
proc reg data=REG_DATA;
by SYMBOL DATE;
model RETURN_SEC = market_premium;
run;
ods output close;
Where RETURN_SEC is the return of the stock per second and market_premium is the return of SPY index minus the risk free rate (the risk free rate is quite close to zero because it is at a second level).
However, I got lots of 0s (not all of them, but a significant number of) in the coefficient of market_premium. When I check the log it says:
NOTE: Model is not full rank. Least-squares solutions for the parameters are
not unique. Some statistics will be misleading. A reported DF of 0 or B
means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a
linear combination of other variables as shown.
market_premium = - 19E-12 * Intercept
This is quite weird. I checked the data and it seems fine (although lot of data contains 0 return_sec, which is normal because sometimes the return doesn't change in seconds but in minutes).
What also puzzles me is that why SAS would return 0 coefficient on market_premium when market_premium = - 19E-12 * Intercept. I mean, does SAS treat the Intercept as the only variable when it sees that market_premium is a scalar times of Intercept?

Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.

How to compute the effect size for Friedman's test using sas?

I was looking for a way to compute the effect size for Friedman's test using sas, but could not find any reference. I wanted to see if there is any difference between the groups and what its size was.
Here is my code:
proc freq data=mydata;
tables id*b*y / cmh2 scores=rank noprint;
run;
These are the results:
The FREQ Procedure
Summary Statistics for b by y
Controlling for id
Cochran-Mantel-Haenszel Statistics (Based on Rank Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 230.7145 <.0001
2 Row Mean Scores Differ 1 230.7145 <.0001
This question is correlated with the one posted on Cross Validated, that is concerned with the general statistical formula to compute the effect size for Friedman's test. Here, I would like to find out how to get the effect size in sas.

SAS Fisher test p values for large sample sizes

I'm trying to calculate some odds ratios and significance forsomething that can be out into a 2x2 table. The problem is the Fisher test in Sas is taking a long time.
I already have the cell counts. I could calculate a chi square if not for the fact that done of the sample sizes are extremely small. And yet some are extremely large, with cell sizes in the hundreds of thousands.
When I try to compute these in R, I have no problem. However, when I try to compute them in Sas, it either tasks way too long, out simply errors out with the message "Fishers exact test cannot be computed with sufficient precision for this sample size."
When I create a toy example (pull one instance from the dataset, and calculate it) it does calculate, but takes a long time.
Data Bob;
Input targ $ status $ wt;
Cards;
A c 4083
A d 111
B c 376494
B d 114231
;
Run;
Proc freq data = Bob;
Weight wt;
Tables targ*status;
Exact Fisher;
Run;
What is going wrong here?
That's funny. SAS calculates the Fisher's exact test p-value the exact way, by enumerating the hypergeometric probability of every single table in which the odds ratio is at least as big or bigger in favor of the alternative hypothesis. There's probably a way for me to calculate how many tables that is, but knowing that it's big enough to slow SAS down is enough.
R does not do this. R uses Monte Carlo methods which work just as fine in small sample sizes as large sample sizes.
tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
pc <- proc.time()
fisher.test(tab)
proc.time()-pc
gives us
> tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
> pc <- proc.time()
> fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
9.240311 13.606906
sample estimates:
odds ratio
11.16046
> proc.time()-pc
user system elapsed
0.08 0.00 0.08
>
A fraction of a second.
That said, the smart statistician would realize, in tables such as yours, that the normal approximation to the log odds ratio is fairly good, and as such the Pearson Chi-square test should give approximately very similar results.
People claim two very different advantages to the Fisher's exact test: some say it's good in small sample sizes. Others say it's good when cell counts are very small in specific margins of the table. The way that I've come to understand it is that Fisher's exact test is a nice alternative to the Chi Square test when bootstrapped datasets are somewhat likely to generate tables with infinite odds ratios. Visually you can imagine that the normal approximation to the log odds ratio is breaking down.

Simulating random effects / mixed models in SAS

I'm trying to create a simulation of drug concentration based on the dose of a drug given. I have some preliminary data and I used a random effects model to analyze the relationship between log(dose), predicting log(drug concentration), modelling subject as a random effect.
The results of that analysis are below. I want to take these results and simulate similar data in SAS, so I can look at the effect of changing doses on the resulting concentration of drug in the body. I know that when I simulate the data, I need to ensure the random slope is correlated with the random intercept, but I'm unsure exactly how to do that. Any example code would be appreciated.
Random effects:
Formula: ~LDOS | RANDID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.15915378 (Intr)
LDOS 0.01783609 0.735
Residual 0.05790635
Fixed effects:
LCMX ~ LDOS
Value Std.Error DF t-value p-value
(Intercept) 3.340712 0.04319325 16 77.34339 0
LDOS 1.000386 0.01034409 11 96.71090 0
Correlation:
(Intr)
LDOS -0.047