Simulating random effects / mixed models in SAS - sas

I'm trying to create a simulation of drug concentration based on the dose of a drug given. I have some preliminary data and I used a random effects model to analyze the relationship between log(dose), predicting log(drug concentration), modelling subject as a random effect.
The results of that analysis are below. I want to take these results and simulate similar data in SAS, so I can look at the effect of changing doses on the resulting concentration of drug in the body. I know that when I simulate the data, I need to ensure the random slope is correlated with the random intercept, but I'm unsure exactly how to do that. Any example code would be appreciated.
Random effects:
Formula: ~LDOS | RANDID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.15915378 (Intr)
LDOS 0.01783609 0.735
Residual 0.05790635
Fixed effects:
LCMX ~ LDOS
Value Std.Error DF t-value p-value
(Intercept) 3.340712 0.04319325 16 77.34339 0
LDOS 1.000386 0.01034409 11 96.71090 0
Correlation:
(Intr)
LDOS -0.047

Related

Independent variable to find seasonality effect?

I'm not sure if it's right to ask this here but any help greatly appreciated. I'm working on sas forecast studio.
This is my time series dataset (quarterly data):
Date e.g. 1-Jan-80, 1-Apr-80, 1-Jul-80
DateQ e.g. 1980Q1, 1980Q2, 1980Q3
Year e.g. 1980, 1981, 1982
GDP (dependable variable) e.g. 2650.1
T e.g. 1, 2, 3
Which of this variable, or should I create a new quarterly variable, to use as an independent variable for a linear regression to evaluate if there is any seasonal effect?
Seasonal effects should not be identified using simple linear regression on the time variable when analyzing time-series data. But, to answer your question, use date with the intnx() function to convert it to quarter.
data want;
format quarter yyq.;
set have;
quarter = intnx('quarter', date, 0, 'B');
run;
Seasonal effects can be identified a number of ways:
1. Graphing it
If a time series has a seasonal effect, it will tend to be clear. Simply looking at a graph of the data will let you know whether it is seasonal by your chosen interval.
In sashelp.air, it's very clear that there is a 12-month season.
2. Spectral Density Analysis
proc timeseries will give you a spectrum analysis to help identify significant seasons within the data. Peaks indicate possible cycles or seasons. You will need to do some filtering to a reasonable seasonal amount since the density may increase significantly after a certain point, and it is not representative of the true season.
Forecast Studio and Time Series Studio will do this for you and can give you similar output to the below.
proc timeseries data=sashelp.air
outspectra=outspectra;
id date interval=month;
var air;
spectra;
run;
proc sgplot data=outspectra;
where period BETWEEN 1 AND 24;
scatter x=period y=p;
series x=period y=p;
run;
We can see a strong indicator for a seasonality of 12. We also see some potential 3-month and 6-month cycles that could be tested within a model for significance.
3. ACF/PACF/IACF plots
Your ACF/PACF/IACF plots in Forecast Studio will also help you identify clear seasons.
The classic decaying suspension-bridge look is indicative of a seasonal effect. Note that the season increases around 12 and then decreases again. Additionally, the significant negative spike at 12 in the PACF and IACF plots are other indicators of a significant seasonal effect at 12.
Model Building and Testing
Tools like the seasonal augmented dickey fuller test that are available Forecast Studio can help you identify if you've captured seasonality and achieved stationarity after differencing.
The selection boxes in the Series view allow you to quickly add simple or seasonal differencing. Selecting (1) for simple differencing will add one simple difference. i.e:
y = y - lag(y)
Selecting (1) for seasonal differencing will add 1 seasonal difference. Note that when you create a project in Forecast Studio, the season is automatically diagnosed and assumed. This should be done after doing our diagnostics above for our best guess as to what the true season is. In our case, we've assumed our season is 12. This would be equivalent to:
y = y - lag12(y)
We can then use stationarity tests to ensure we've achieved stationarity. In our case, we'll add 1 simple and seasonal difference.
Notice how our white noise plot has improved and our spikes at 12 have decreased to non-significance. Additionally, our stationarity tests are looking good and significant - that is, there is no unit root present.
Adding Seasonal or Cyclical Effects
Your model choice will dictate how seasonal or cyclical effects are added. Differencing in an ARIMA model will take care of seasonality. Dummy variables can be used for additional cyclical effects in the ARIMA model. For example:
data want;
set have;
q1 = (qtr(date) = 1);
q2 = (qtr(date) = 2);
q3 = (qtr(date) = 3);
run;
UCMs can take care of all of these by adding both seasonal and cyclical effects. Holt-Winters ESMs take care of trend and seasonality without requiring dummy variables. Your modeling goals and performance considerations for each type of model will dictate which model you choose.

Outputting predicted values in SAS proc mixed: Prohibitive performance issues

I've noticed strange behavior with SAS proc mixed: Models with a modestly large number of rows, which take only seconds to converge, nevertheless take upwards of half an hour to finish running if I ask for output of predicted values & residuals. The thing that seems perverse is that when I run the analogous models in R using nlme::lme(), I get the predicted values & residuals as a side effect and the models complete in seconds. That makes me think this is not merely a memory limitation of my machine.
Here's some sample code. I can't provide the real data for which I'm seeing this issue, but the structure is 1-5 rows per subject, ~1500 unique subjects, ~5,000 outcome-covariate sets total.
In SAS:
proc mixed data=testdata noclprint covtest;
class subjid ed gender;
model outcome = c_age ed gender / ddfm=kr solution residual outp=testpred;
random int c_age / type=un sub=subjid;
run;
In R:
lme.test <- lme(outcome ~ c_age + ed + gender, data=testdata,
random = ~c_age|factor(subjid), na.action=na.omit)
Relevant stats: Win7, SAS 9.4 (64-bit), R 3.3, nlme 3.1-131.

Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.

SAS Fisher test p values for large sample sizes

I'm trying to calculate some odds ratios and significance forsomething that can be out into a 2x2 table. The problem is the Fisher test in Sas is taking a long time.
I already have the cell counts. I could calculate a chi square if not for the fact that done of the sample sizes are extremely small. And yet some are extremely large, with cell sizes in the hundreds of thousands.
When I try to compute these in R, I have no problem. However, when I try to compute them in Sas, it either tasks way too long, out simply errors out with the message "Fishers exact test cannot be computed with sufficient precision for this sample size."
When I create a toy example (pull one instance from the dataset, and calculate it) it does calculate, but takes a long time.
Data Bob;
Input targ $ status $ wt;
Cards;
A c 4083
A d 111
B c 376494
B d 114231
;
Run;
Proc freq data = Bob;
Weight wt;
Tables targ*status;
Exact Fisher;
Run;
What is going wrong here?
That's funny. SAS calculates the Fisher's exact test p-value the exact way, by enumerating the hypergeometric probability of every single table in which the odds ratio is at least as big or bigger in favor of the alternative hypothesis. There's probably a way for me to calculate how many tables that is, but knowing that it's big enough to slow SAS down is enough.
R does not do this. R uses Monte Carlo methods which work just as fine in small sample sizes as large sample sizes.
tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
pc <- proc.time()
fisher.test(tab)
proc.time()-pc
gives us
> tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
> pc <- proc.time()
> fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
9.240311 13.606906
sample estimates:
odds ratio
11.16046
> proc.time()-pc
user system elapsed
0.08 0.00 0.08
>
A fraction of a second.
That said, the smart statistician would realize, in tables such as yours, that the normal approximation to the log odds ratio is fairly good, and as such the Pearson Chi-square test should give approximately very similar results.
People claim two very different advantages to the Fisher's exact test: some say it's good in small sample sizes. Others say it's good when cell counts are very small in specific margins of the table. The way that I've come to understand it is that Fisher's exact test is a nice alternative to the Chi Square test when bootstrapped datasets are somewhat likely to generate tables with infinite odds ratios. Visually you can imagine that the normal approximation to the log odds ratio is breaking down.

SAS sequential regression (in Quandt's log likelihood method)

I am coding in SAS Enterprise Guide 4.2.
I am trying to calculate the Quandt's log likelihood ratio. But it is not important to understand that to understand my question.
The ratio is based on sequential regressions.
Namely regressions from 1 to t0 where 1<=t0<=T and T is the samplesize.
Illustration:
First perform regression on the first observation
Then perform regression on the first two observations
Then perform regression on the first 3 observations
...and so on
It is also performing a "forward regression" from t0+1 to T.
Illustration:
First perform regression on the last T-1 observations
Then perform regression on the last T-2 observations
Then perform regression on the last T-3 observations
...and so on
The regression is an Ordinary Least Squares regression.
After the regression is performed, the square of the residuals are summed.
So this is what I need.
For each observation t0 I want to:
do an OLS regression from 1 to t0 and sum up the square of the residuals
do an OLS regression from t0+1 to T and sum up the square of the residuals
The data consists of one group variable, one dependent variable and one independent variable.
The calculations should be performed grouped by the group variable (but that should'nt be too difficult).
I have been able to do part of this task myself, but it is horribly ineffeicient and since the data consists of over 1,000,000,000 observations efficiency is very important
I have also noticed that the procedure "autoreg" calculates the CUSUM statistic that is also based on sequential regression and therefore I suspect that this functionality could be availible in SAS but I haven't been able to find it.
And the part I am struggling with most right now is the summation.
Simple example of the summation I want to do:
Input:
col1 col2
1 2
2 5
5 4
7 6
Output:
col3
2 =1*2
15 =1*5+2*5
32 =1*4+2*4+5*4
90 =1*6+2*6+5*6+7*6
Has anyone encounter a similar problem or have any idea on how to solve it in an efficient way?
All help is welcome and feel free to ask me to clarify something if it is unclear.
As far as the summation goes, the below should work (though your input dataset must be sorted by group first).
Since the summation you're asking for is basically col2 multiplied by the cumulative sum of col1 within each group, you can use a retain statement to keep track of the sum of col1, and by-group processing to reset the cumulative sum each time the data step encounters a new group.
data output;
retain cusum;
set input;
by group;
if first.group then cusum = col1;
else cusum = cusum + col1;
col3 = cusum * col2;
drop cusum;
run;