I am very new to SAS. I am performing many tests at once using the BY statement (e.g. analyzing by Sex and by Strain to get separate results for each Sex x Strain combination). Results are usually generated as many separate tables. How can I generate a table of results which have all of a single estimate (e.g. p-values of fixed effects tests) for each combination of levels in the BY columns (e.g. for each Sex x Strain combination)?
Results format wanted:
Sex Strain p-value
-------------------
F Exp 0.115
F Ctrl 0.57
M Exp 0.024
M Ctrl 0.483
My current code is:
proc mixed data=dataset;
class Strain EXPERIMENT EXPERIMENTAL_STRAIN EXPERIMENTAL_SEX;
model Response = Strain / ddfm=satterth;
random EXPERIMENT;
BY EXPERIMENTAL_STRAIN EXPERIMENTAL_SEX;
run;
In JMP this could be accomplished by right-clicking on a particular table in the Results Report (e.g. the table of Fixed Effects Tests) and selecting "Make Combined Data Table". If SAS results could be exported into JMP, this could be a solution.
Related
I've noticed strange behavior with SAS proc mixed: Models with a modestly large number of rows, which take only seconds to converge, nevertheless take upwards of half an hour to finish running if I ask for output of predicted values & residuals. The thing that seems perverse is that when I run the analogous models in R using nlme::lme(), I get the predicted values & residuals as a side effect and the models complete in seconds. That makes me think this is not merely a memory limitation of my machine.
Here's some sample code. I can't provide the real data for which I'm seeing this issue, but the structure is 1-5 rows per subject, ~1500 unique subjects, ~5,000 outcome-covariate sets total.
In SAS:
proc mixed data=testdata noclprint covtest;
class subjid ed gender;
model outcome = c_age ed gender / ddfm=kr solution residual outp=testpred;
random int c_age / type=un sub=subjid;
run;
In R:
lme.test <- lme(outcome ~ c_age + ed + gender, data=testdata,
random = ~c_age|factor(subjid), na.action=na.omit)
Relevant stats: Win7, SAS 9.4 (64-bit), R 3.3, nlme 3.1-131.
Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.
I was looking for a way to compute the effect size for Friedman's test using sas, but could not find any reference. I wanted to see if there is any difference between the groups and what its size was.
Here is my code:
proc freq data=mydata;
tables id*b*y / cmh2 scores=rank noprint;
run;
These are the results:
The FREQ Procedure
Summary Statistics for b by y
Controlling for id
Cochran-Mantel-Haenszel Statistics (Based on Rank Scores)
Statistic Alternative Hypothesis DF Value Prob
1 Nonzero Correlation 1 230.7145 <.0001
2 Row Mean Scores Differ 1 230.7145 <.0001
This question is correlated with the one posted on Cross Validated, that is concerned with the general statistical formula to compute the effect size for Friedman's test. Here, I would like to find out how to get the effect size in sas.
My goal is to fit a data to any distribution which has positive support. (weibull(2p), gamma(2p), pareto(2p), lognormal (2p),exponential(1P)). First attempt,i used proc univariate.This is my code
proc univariate data=fit plot outtable=table;
var week1;
histogram / exp gamma lognormal weibull pareto;
inset n mean(5.3) std='Standar Deviasi'(5.3)
/ pos = ne header = 'Summary Statistics';
axis1 label=(a=90 r=0);
run;
The first thing i noticed, there's no kolmogorov statistic shown for weibull distribution.Then i used proc severity instead.
proc severity data=fit print=all plots(histogram kernel)=all;
loss week1;
dist exp pareto gamma logn weibull;
run;
Now, i got the KS statistic for weibull distribution.
Then i compared KS statistic produced by proc severity and proc univariate. They're different. Why? Which one should i use?
I do not have access to SAS/ETS so cannot confirm this with proc severity, but I imagine that the difference you are seeing come down to the way the distribution parameters are fitted.
With your proc univriate code you are not requesting estimation for several of the parameters (some are in some cases set to 1 or 0 by default, see sigma and theta in the user guide). For example:
data have;
do i = 1 to 1000;
x = rand("weibull", 5, 5);
output;
end;
run;
ods graphics on;
proc univariate data = have;
var x;
/* Request maximum liklihood estimate of scale and threshold parameters */
histogram / weibull(theta = EST sigma = EST);
/* Request maximum liklihood estimate of scale parameter and 0 as threshold */
histogram / weibull;
run;
You will note that when an estimate of theta is requested SAS also produces the KS statistic, this is due to the way that SAS estimates the fit statistic requiring know distribution parameters (full explanation here).
My guess is that you are seeing different fit statistics between the two procedures because either they are returning slightly different fits, or they use different calculations for the estimation of fit statistics. If you are interested you can investigate how they perform their parameter estimation in the user guide (proc severity and proc univariate). If you wanted to investigate further you could force the distribution parameters to match in both procedures and then compare the fit statistics to see how far they differ.
I would recommend that if possible you use only one of the procedure, and that you select the one that best fits your needs in terms of output.
I'm trying to create a simulation of drug concentration based on the dose of a drug given. I have some preliminary data and I used a random effects model to analyze the relationship between log(dose), predicting log(drug concentration), modelling subject as a random effect.
The results of that analysis are below. I want to take these results and simulate similar data in SAS, so I can look at the effect of changing doses on the resulting concentration of drug in the body. I know that when I simulate the data, I need to ensure the random slope is correlated with the random intercept, but I'm unsure exactly how to do that. Any example code would be appreciated.
Random effects:
Formula: ~LDOS | RANDID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.15915378 (Intr)
LDOS 0.01783609 0.735
Residual 0.05790635
Fixed effects:
LCMX ~ LDOS
Value Std.Error DF t-value p-value
(Intercept) 3.340712 0.04319325 16 77.34339 0
LDOS 1.000386 0.01034409 11 96.71090 0
Correlation:
(Intr)
LDOS -0.047