Advice on a faster SAS code - sas

I have a data set, call it training, that I need the prediction of column Y using X. The conditional probability distribution of Y given X is in a different look up table and it is a Table distribution according to SAS terminology. The space of X and Y are both {1,2,...44}. The probability lookup table is 44 x 44.
Suppose my probability look up table is
x p1 p2 ... p44
1 0.001 0.004 ... 0.0078
2 0.0045 ... 0.000
.. ... ...
44 0.00089 ... 0.00067
The conditional probabilities are highly skewed. And my training data has a large N.
I am looking for an efficient, high speed way of matching and retrieving the conditional probabilities to give prediction of y in the training data. I am looking at hash to speed up the process and may be some other sampling method besides RAND('table', of p1-p44) to possibily make the program run faster?
below is a mock hash program for this task.
data prediction;
array prob[44] _temporary_;
if _n_ =1 then do;
if 0 then set work.prob_lookup;
dcl hash lookup(dataset:'work.prob_lookup');
look.definekey('x');
look.definedata('p1',...,'p44'); /* Is there a wildcard method to specify variable p1 to p44?
look.definedone();
end;
set work.training;
rc = lookup.find();
if rc = 0 then do;
y_predict =rand('table', p1-p44);
output;
end;
run;
The training data table for prediction has a large N, about 2k+, but should also allow more when the probability lookup table is applied for prediction. The probability lookup table is 44*44, relatively small, but highly skewed.
Help on building a faster code for this task is much appreciated. Also if there's an easy way to do it in R, it will be appreciated as well.

Not sure what HASH adds to this problem.
data prediction;
merge training prob_lookup;
by x;
y_predict=rand('table',of p1-p44);
run;
I guess it might save having to sort your training set by X? But if your X values are just integers between 1 and 44 then why not just use POINT= option?
data prediction;
set training ;
pointer=x;
if pointer in (1:44) then do;
set prob_lookup point=pointer ;
y_predict=rand('table',of p1-p44);
end;
else call missing(y_predict,of p1-p44);
run;

Related

Offsetting Oversampling in SAS for rare events in Logistic Regression

Can anyone help me understand the Premodel and Postmodel adjustments for Oversampling using the offset method ( preferably in Base SAS in Proc Logistic and Scoring) in Logistic Regression .
I will take an example. Considering the traditional Credit scoring model for a bank, lets say we have 10000 customers with 50000 good and 2000 bad customers. Now for my Logistic Regression I am using all 2000 bad and random sample of 2000 good customers. How can I adjust this oversampling in Proc Logistic using options like Offset and also during scoring. Do you have any references with illustrations on this topic?
Thanks in advance for your help!
Ok here are my 2 cents.
Sometimes, the target variable is a rare event, like fraud. In this case, using logistic regression will have significant sample bias due to insufficient event data. Oversampling is a common method due to its simplicity.
However, model calibration is required when scores are used for decisions (this is your case) – however nothing need to be done if the model is only for rank ordering (bear in mind the probabilities will be inflated but order still the same).
Parameter and odds ratio estimates of the covariates (and their confidence limits) are unaffected by this type of sampling (or oversampling), so no weighting is needed. However, the intercept estimate is affected by the sampling, so any computation that is based on the full set of parameter estimates is incorrect.
Suppose the true model is: ln(y/(1-y))=b0+b1*x. When using oversampling, the b1′ is consistent with the true model, however, b0′ is not equal to bo.
There are generally two ways to do that:
weighted logistic regression,
simply adding offset.
I am going to explain the offset version only as per your question.
Let’s create some dummy data where the true relationship between your DP (y) and your IV (iv) is ln(y/(1-y)) = -6+2iv
data dummy_data;
do j=1 to 1000;
iv=rannor(10000); *independent variable;
p=1/(1+exp(-(-6+2*iv))); * event probability;
y=ranbin(10000,1,p); * independent variable 1/0;
drop j;
output;
end;
run;
and let’s see your event rate:
proc freq data=dummy_data;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 979 97.90 979 97.90
1 21 2.10 1000 100.00
Similar to your problem the event rate is p=0.0210, in other words very rare
Let’s use poc logistic to estimate parameters
proc logistic data=dummy_data;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.4337 0.4874 124.3027 <.0001
iv 1 1.8356 0.2776 43.7116 <.0001
Logistic result is quite close to the real model however basic assumption will not hold as you already know.
Now let’s oversample the original dataset by selecting all event cases and non-event cases with p=0.2
data oversampling;
set dummy_data;
if y=1 then output;
if y=0 then do;
if ranuni(10000)<1/20 then output;
end;
run;
proc freq data=oversampling;
tables y;
run;
Cumulative Cumulative
y Frequency Percent Frequency Percent
------------------------------------------------------
0 54 72.00 54 72.00
1 21 28.00 75 100.00
Your event rate has jumped (magically) from 2.1% to 28%. Let’s run proc logistic again.
proc logistic data=oversampling;
model y(event="1")=iv;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -2.9836 0.6982 18.2622 <.0001
iv 1 2.0068 0.5139 15.2519 <.0001
As you can see the iv estimate still close to the real value but your intercept has changed from -5.43 to -2.98 which is very different from our true value of -6.
Here is where the offset plays its part. The offset is the log of the ratio between known population and sample event probabilities and adjust the intercept based on the true distribution of events rather than the sample distribution (the oversampling dataset).
Offset = log(0.28)/(1-0.28)*(0.0210)/(1-0.0210) = 2.897548
So your intercept adjustment will be intercept = -2.9836-2.897548= -5.88115 which is quite close to the real value.
Or using the offset option in proc logistic:
data oversampling_with_offset;
set oversampling;
off= log((0.28/(1-0.28))*((1-0.0210)/0.0210)) ;
run;
proc logistic data=oversampling_with_offset;
model y(event="1")=iv / offset=off;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.8811 0.6982 70.9582 <.0001
iv 1 2.0068 0.5138 15.2518 <.0001
off 1 1.0000 0 . .
From here all your estimates are correctly adjusted and analysis & interpretation should be carried out as normal.
Hope its help.
This is a great explanation.
When you oversample or undersample in the rare event experiment, the intercept is impacted and not slope. Hence in the final output , you just need to adjust the intercept by adding offset statement in proc logistic in SAS. Probabilities are impacted by oversampling but again, ranking in not impacted as explained above.
If your aim is to score your data into deciles, you do not need to adjust the offset and can rank the observations based on their probabilities of the over sampled model and put them into deciles (Using Proc Rank as normal). However, the actual probability scores are impacted so you cannot use the actual probability values. ROC curve is not impacted as well.

Is sorting more favorable (efficient) in if-else statement?

Assume two functions fun1, fun2 have been defined to carry out some calculation given input x.
The structure of data have is:
Day Group x
01Jul14 A 1.5
02JUl14 B 2.7
I want to do sth like this:
data want;
set have;
if Group = 'A' then y = fun1(x);
if Group = 'B' then y = fun2(x);
run;
Is it better to do proc sort data=have;by Group;run; first then move on to the data step? Or it doesn't matter because each time it just picks one observation and determines which if statement it falls into?
So long as you are not doing anything to alter the normal input of observations - such as using random access (point=), building a hash table, using a by statement, etc. - sorting will have no impact: you read each row regardless of the if statement, check both lines, execute one of them. Nothing different occurs sorted or unsorted.
This is easy to test. Write something like this:
%put Before Unsorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Unsorted Time: %sysfunc(time(),time8.);
proc sort data=your_dataset;
by x;
run;
%put Before Sorted Time: %sysfunc(time(),time8.);
***your datastep here***;
%put After Sorted Time: %sysfunc(time(),time8.);
Or just run your datasteps and look at the execution time!
You may be confusing this with sorting your if statements (ie, changing the order of them in the code). That could have an impact, if your data is skewed and you use else. That's because SAS won't have to evaluate further downstream conditionals. It's not very common for this to have any sort of impact - it only matters when you have extremely skewed data, large numbers of observations, and certain other conditions based on your code - so I wouldn't program for it.

Holding Sampled Macro Variable Constant

Hopefully a simple answer. I'm doing a simulation study, where I need to sample a random number of individuals, N, from a uniform distribution, U(25,200), at each of a thousand or so replications. Code for one replication is shown below:
%LET U = RAND("UNIFORM");
%LET N = ROUND(25 + (200 - 25)*&U.);
I created both of these macro variables outside of a DATA step because I need to call the N variable repeatedly in subsequent DATA steps and DO loops in both SAS and IML.
The problem is that every time I call N within a replication, it re-samples U, which necessarily modifies N. Thus, N is not held constant within a replication. This issue is shown in the code below, where I first create N as a variable (that is constant across individuals) and sample predictor values for X for each individual using a DO loop. Note that the value in N is not the same as the total number of individuals, which is also a problem.
DATA ID;
N = &N.;
DO PersonID = 1 TO &N.;
X = RAND("NORMAL",0,1); OUTPUT;
END;
RUN;
I'm guessing that what I need to do is to somehow hold U constant throughout the entirety of one replication, and then allow it to be re-sampled for replication 2, and so on. By holding U constant, N will necessarily be held constant.
Is there a way to do this using macro variables?
&N does not store a value. &N stores the code "ROUND(...(RAND..." etc. You're misusing macro variables, here: while you could store a number in &N you aren't doing so; you have to use %sysfunc, and either way it's not really the right answer here.
First, if you're repeatedly sampling replicates, look at the paper Don't be Loopy', which has some applications here. Also consider Rick Wicklin's paper, Sampling with Replacement, and his book that he references ("Simulating Data in SAS") in there is quite good as well. If you're running your process on a one-sample-one-execution model, that's the slow and difficult to work with way. Do all the replicates at once, process them all at once; IML and SAS are both happy to do that for you. Your uniform random sample size is a bit more difficult to work with, but it's not insurmountable.
If you must do it the way you're doing it, I would ask the data step to create the macro variable, if there's a reason to do that. At the end of the sample, you can use call symput to put out the value of N. IE:
%let iter=7; *we happen to be on the seventh iteration of your master macro;
DATA ID;
CALL STREAMINIT(&iter.);
U = RAND("UNIFORM");
N = ROUND(25 + (200 - 25)*U);
DO PersonID = 1 TO N;
X = RAND("NORMAL",0,1);
OUTPUT;
END;
CALL SYMPUTX('N',N);
CALL SYMPUTX('U',U);
RUN;
But again, a one-data-step model is probably your most efficient model.
I'm not sure how to do it in the macro world, but this is how you could convert your code to a data step to accomplish the same thing.
The key is setting the random number stream initialization value, using CALL STREAMINIT.
Data _null_;
call streaminit(35);
u=rand('uniform');
call symput('U', u);
call symput('N', ROUND(25 + (200 - 25)*U));
run;
%put &n;
%put &u;
As Joe points out, the efficient way to perform this simulation is to generate all 1000 samples in a single data step, as follows:
data AllSamples;
call streaminit(123);
do SampleID = 1 to 1000;
N = ROUND(25 + (200 - 25)*RAND("UNIFORM"));
/* simulate sample of size N HERE */
do PersonID = 1 to N;
X = RAND("NORMAL",0,1);
OUTPUT;
end;
end;
run;
This ensures independence of the random number streams, and it takes a fraction of a second to produce the 1000 samples. You can then use a BY statement to analyze the sampling distributions of the statistics on each sample. For example, the following call to PROC MEANS outputs the sample size, sample mean, and sample standard deviation for each of the 1000 samples:
proc means data=AllSamples noprint;
by SampleID;
var X;
output out=OutStats n=SampleN mean=SampleMean std=SampleStd;
run;
proc print data=OutStats(obs=5);
var SampleID SampleN SampleMean SampleStd;
run;
For more details about why the BY-group approach is more efficient (total time= less than 1 second!) see the article "Simulation in SAS: The slow way or the BY way."

Unknown Errors with Proc Transpose

Trying to utilize proc transpose to a dataset of the form:
ID_Variable Target_Variable String_Variable_1 ... String_Variable_100
1 0 The End
2 0 Don't Stop
to the form:
ID_Variable Target_Variable String_Variable
1 0 The
. . .
. . .
1 0 End
2 0 Don't
. . .
. . .
2 0 Stop
However, when I run the code:
proc transpose data=input_data out=output_data;
by ID_Variable Target_Variable;
var String_Variable_1-String_Variable_100;
run;
The change in file size from input to output balloons from 33.6GB to over 14TB, and instead of the output described above we have that output with many additional completely null string variables (41 of them). There are no other columns on the input dataset so I'm unsure why the resulting output occurs. I already have a work around using macros to create my own proxy transposing procedure, but any information around why the situation above is being encountered would be extremely appreciated.
In addition to the suggestion of compression (which is nearly always a good one when dealing with even medium sized datasets!), I'll make a suggestion for a simple solution without PROC TRANSPOSE, and hazard a few guesses as to what's going on.
First off, wide-to-narrow transpose is usually just as easy in a data step, and sometimes can be faster (not always). You don't need a macro to do it, unless you really like typing ampersands and percent signs, in which case feel free.
data want;
set have;
array transvars string_Variable_1-string_Variable_100;
do _t = 1 to dim(transvars);
string_variable = transvars[_t];
if not missing(String_variable) then output; *unless you want the missing ones;
end;
keep id_variable target_variable string_Variable;
run;
Nice short code, and if you want you can throw in a call to vname to get the name of the transposed variable (or not). PROC TRANSPOSE is shorter, but this is short enough that I often just use it instead.
Second, my guess. 41 extra string variables tells me that you very likely have some duplicates by your BY group. If PROC TRANSPOSE sees duplicates, it will create that many columns. For EVERY ROW, since that's how columns work. It will look like they're empty, and who knows, maybe they are empty - but SAS still transposes empty things if it sees them.
To verify this, run a PROC SORT NODUPKEY before the transpose. If that doesn't delete at least 40 rows (maybe blank rows - if this data originated from excel or something I wouldn't be shocked to learn you had 41 blank rows at the end) I'll be surprised. If it doesn't fix it, and you don't like the datastep solution, then you'll need to provide a reproducible example (ie, provide some data that has a similar expansion of variables).
Without seeing a working example, it's hard to say exactly what's going on here with regards to the extra variables generated by proc transpose.
However, I can see three things that might be contributing towards the increased file size after transposing:
If you have option compress = no; set, proc transpose creates an uncompressed dataset by default. Also, if some of your character variables are different lengths, they will all be transposed into one variable with the longest length of any of them, further increasing the file size if compression is disabled in the output dataset.
I suspect that some of the increase in file size may be coming from the automatic _NAME_ column generated by proc transpose, which contains an extra ~100 * max_var_name_length bytes for every ID-target combination in the input dataset.
If you are using option compress = BINARY; (i.e. compressing all output datasets that way by default), the SAS compression algorithm may be less effective after transposing. This is because SAS only compresses one record at a time, and this type of compression is much less effective with shorter records. There isn't much you can do about this, unfortunately.
Here's an example of how you can avoid both of these potential issues.
/*Start with a compressed dataset*/
data have(compress = binary);
length String_variable_1 $ 10 String_variable_2 $20; /*These are transposed into 1 var with length 20*/
input ID_Variable Target_Variable String_Variable_1 $ String_Variable_2 $;
cards;
1 0 The End
2 0 Don't Stop
;
run;
/*By default, proc transpose creates an uncompressed output dataset*/
proc transpose data = have out = want_default prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;
/*Transposing with compression enabled and without the _NAME_ column*/
proc transpose data = have out = want(drop = _NAME_ compress = binary) prefix = string_variable;
by ID_variable Target_variable;
var String_Variable_1 String_Variable_2;
run;

Compute growth rate, improvements over PROC EXPAND

I have a SAS dataset, sorted, which has two columns: PERIOD and MYMETRIC
For each row, I want to compute the growth rate of the 4 periods preceding, by using a linear regression. So the formula is basically
GROWTH RATE = Cov([MYMETRIC_lag_4,MYMETRIC_lag_3, MYMETRIC_lag_2, MYMETRIC_lag_1],[1,2,3,4])/Var([1,2,3,4])
I can do this in SAS through a proc expand to compute the lags, then a data step to compute the growth rate. I was wondering if there was a shorter way to do this? Especially if suddenly, I choose to include 8 points and not 4, I want to minimize the rework.
You can use a data step entirely. This assumes you're asking for the four previous rows. I'm not sure what [1,2,3,4] means, though, so you'll have to fill in exactly what that means in the growth rate.
%let numlags=4;
data want;
set have;
array lags[&numlags] _temporary_; *temporary arrays are retained!;
growth_rate = cov(of lags[*])/var(of lags[*]); *if you want cov of the 4 lags divided by var of the 4 lags;
*move things about;
do _t = 1 to dim(lags)-1;
lags[_t]=lags[_t+1];
end;
lags[dim(lags)] = MYMETRIC;
run;