SAS - creating counter variable for dataset sampling - sas

I'm new to SAS and wondering how to randomly sample a dataset.
I create a dataset work.seg, then sample from that table. I want to continue sampling until the sum of the prem column in the resampled table is greater than some amount.
In my current version of the code, I think it resets sumprem to 0 each time, so it never exceeds the threshold, and the code just keeps running.
data work.seg;
input segment $3. prem loss;
datalines;
AAA 5000 0
AAA 3000 12584
AAA 250 245
AAA 500 678
;
data work.test;
sumprem = 0;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem=sumprem+prem_i;
if sumprem>15000 then stop;
run;

Since you are using POINT= option there is no need to let the normal iteration of the data step happen. Just add a loop and an output statement. You might want to also put an upper bound on maximum number of samples.
data work.test;
do _n_=1 to 100000 until (sumprem>15000) ;
row_i=int(ranuni(777)*n)+1;
set work.seg point=row_i nobs=n;
sumprem + prem_i;
output;
end;
stop;
run;

you just need to replace the sumprem=0 to retain statement and also prem_i is unidentified, use prem variable instead
sumprem=0; /* Change this to next statement*/
retain sumprem 0;

Related

Sum a number of specific rows before and after

I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.

updating last not blank record SAS

I have the following dataset.
ID var1 var2 var3
1 100 200
1 150 300
2 120
2 100 150 200
3 200 150
3 250 300
I would like to have a new dataset with only the last not blank record for each group of variables.
id var1 var2 var3
1 150 200 300
2 100 150 200
3 250 300 150
last. select the last reord, but i need to selet the last not null record
Looks like you want the last non missing value for each non-key variable. So you can let the UPDATE statement do the work for you. Normally for update operation you apply transactions to a master dataset. But for your application you can use OBS=0 dataset option to make your current dataset work as both the master and the transactions.
data want ;
update have(obs=0) have ;
by id;
run;
Riccardo:
There are many ways to select, within each group, the last non-missing value of each column. Here are three ways. I would not say one is the best way, each approach has it's merits depending on the specific data set, coder comfort and long term maintainability.
Way 1 - UPDATE statement
Perhaps the simplest of the coding approaches goes like this:
make a data set that has one row per id and the same columns as the original data.
use the UPDATE statement to replace each like named variable with a non-missing value.
Example:
data want_base_table(label="One row per id, original columns");
set have;
by id;
if first.id;
run;
* use have as a transaction data set in the update statement;
data want_by_update;
update want_base_table have;
by id;
run;
Way 2 - DOW loop
Others will involve arrays and first. and last. flag variables of the BY group. This example shows a DOW loop that tracks the non-missing value and then uses them for the output of each ID:
data want_dow;
do until (last.id);
set have;
by id;
array myvar var1-var3 ;
array myhas has1-has3 ;
do _i = 1 to dim(myvar);
if not missing (myvar(_i)) then
myhas(_i) = myvar(_i);
end;
end;
do _i = 1 to dim(myhas);
myvar(_i) = myhas(_i);
end;
output;
drop _i has1-has3;
run;
A loop is most often called a DOW loop when there is a SET statement inside the DO; END; block and the loop termination is triggered by the last. flag variable. A similar non DOW approach (not shown) would use the implicit loop and first. to initialize the tracking array and last. for copying the values (tracked within group) into the columns for output.
Way 3 - Merging column slices
data want_by_column_slices;
merge
have (keep=id var1 where=(var1 ne .))
have (keep=id var2 where=(var2 ne .))
have (keep=id var3 where=(var3 ne .))
;
by id;
if last.id;
run;

SAS: in the DATA step, how to calculate the grand mean of a subset of observations, skipping over missing values

I'm trying to calculate the grand mean of a subset of observations (e.g., observation 20 to observation 50) in the data step. In this calculation, I also want to skip over (ignore) any missing values.
I've tried to play around with the mean function using various if … then statements, but I can't seem to fit all of it together.
Any help would be much appreciated.
For reference, here's the basic outline of my data steps:
data sas1;
infile '[file path]';
input v1 $ 1-9 v2 $ 11 v3 13-17 [redacted] RPQ 50-53 [redacted] v23 101-106;
v1=translate(v1,"0"," ");
format [redacted];
label [redacted];
run;
data gmean;
set sas1;
id=_N_;
if id = 10-40 then do;
avg = mean(RPQ);
end;
/*Here, I am trying to calculate the grand mean of the RPQ variable*/
/*but only observations 10 to 40, and skipping over missing values*/
run;
Use the automatic variable /_N_/ to id the rows. Use a sum value that is retained row to row and then divide by the number of observations at the end. Use the missing() function to determine the number of observations present and whether or not to add to the running total.
data stocks;
set sashelp.stocks;
retain sum_total n_count 0;
if 10<=_n_<=40 and not missing(open) then do;
n_count=n_count+1;
sum_total=sum_total+open;
end;
if _n_ = 40 then average=sum_total/n_count;
run;
proc print data=stocks(obs=40 firstobs=40);
var average;
run;
*check with proc means that the value is correct;
proc means data=sashelp.stocks (firstobs=10 obs=40) mean;
var open;
run;

Create a loop in SAS which filters 2 variables simultaneously

The question might be quite vague but I could not come up with a decent concise title.
I have data where there are id ,date, amountA and AmtB as my variables. The task is to pick the dates that are within 10 days of each other and then see if their amountA are within 20% and if they are then pick the one with highest amountB. I have used to this code to achieve this
id date amountA amountB
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
This is what I need
id date amountA amountB
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/25/2014 720 50
I wrote this code but the problem with this code is its not automatic and has to be done on a case to case basis.I need a way to loop it so that it automatically outputs the results.I am no pro at looping and hence am stuck.Any help is greatly appreciated
data test2;
set test1;
diff_days=abs(intck('days',first_dt,date));
if diff_days<=10 then flag=1;
else if diff_days>10 then flag=0;
run;
data test3 rem_test3;
set test2;
if flag=1 then output test3;
else output rem_test3;
run;
proc sort data=test3;
by id amountA;
run;
data all_within;
set test3;
by id amountA;
amtA_lag=lag1(amountA);
if first.id then
do;
counter=1;
flag1=1;
end;
if first.id=0 then
do;
counter+1;
diff=abs(amountA-amtA_lag);
if diff<(10/100*amountA) then flag1+1;
else flag1=0;
end;
if last.stay and flag1=counter then output all_within;
run;
If I understand the problem correctly, you want to group all records together that have (no skip of 10+ days) and (amt A w/in 20%)?
Looping isn't your problem - no explicitly coded loop is needed to do this (or at least, the way I think of it). SAS does the data step loop for you.
What you want to do is:
Identify groups. A group is the consecutive records that you want to, among them, collapse to one row. It's not perfectly clear to me how amountA has to behave here - does the whole group need to have less than a maximum difference of 10%, or a record to next record difference of < 10%, or a (current highest amtB of group) < 10% - but you can easily identify all of these rules. Use a RETAINed variable to keep track of the previous amountA, previous date, highest amountB, date associated with the highest amountB, amountA associated with highest amountB.
When you find a record that doesn't fit in the current group, output a record with the values of the previous group.
You shouldn't need two steps for this, although you can if you want to see it more easily - this may be helpful for debugging your rules. Set it so that you have a GroupNum variable, which you RETAIN, and you increment that any time you see a record that causes a new group to start.
I had trouble figuring out the rules...but here is some code that checks each record against the previous for the criteria I think you want.
Data HAVE;
input id date :mmddyy10. amountA amountB ;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
;
Proc Sort data=HAVE;
by id date;
Run;
Data WANT(drop=Prev_:);
Set HAVE;
Prev_Date=lag(date);
Prev_amounta=lag(amounta);
Prev_amountb=lag(amountb);
If not missing(prev_date);
If date-prev_date<=10 then do;
If (amounta-prev_amounta)/amounta<=.1 then;
If amountb<prev_amountb then do;
Date=prev_date;
AmountA=prev_amounta;
AmountB=prev_amountb;
end;
end;
Else delete;
Run;
Here is a method that I think should work. The basic approach is:
Find all the pairs of sufficiently close observations
Join the pairs with themselves to get all connected ids
Reduce the groups
Join to the original data and get the desired values
data have;
input
id
date :mmddyy10.
amountA
amountB;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
2 1/16/2014 1100 81
3 1/30/2014 700 50
4 2/05/2014 710 80
5 2/25/2014 720 50
;
run;
/* Count the observations */
%let dsid = %sysfunc(open(have));
%let nobs = %sysfunc(attrn(&dsid., nobs));
%let rc = %sysfunc(close(&dsid.));
/* Output any connected pairs */
data map;
array vals[3, &nobs.] _temporary_;
set have;
/* Put all the values in an array for comparison */
vals[1, _N_] = id;
vals[2, _N_] = date;
vals[3, _N_] = amountA;
/* Output all pairs of ids which form an acceptable pair */
do i = 1 to _N_;
if
abs(vals[2, i] - date) < 10 and
abs((vals[3, i] - amountA) / amountA) < 0.2
then do;
id2 = vals[1, i];
output;
end;
end;
keep id id2;
run;
proc sql;
/* Reduce the connections into groups */
create table groups as
select
a.id,
min(min(a.id, a.id2, b.id)) as group
from map as a
left join map as b
on a.id = b.id2
group by a.id;
/* Get the final output */
create table lookup (where = (amountB = maxB)) as
select
have.*,
groups.group,
max(have.amountB) as maxB
from have
left join groups
on have.id = groups.id
group by groups.group;
quit;
The code works for the example data. However, the group reduction is insufficient for more complicated data. Fortunately, approaches for finding all the subgraphs given a set of edges can be found here, here, here or here (using SAS/OR).

How to perform oversampling in SAS?

I have a data set with 1100 samples, target class isReturn, there are
800 isReturn='True'
300 isReturn='False'
How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?
Thanks in advance.
I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.
The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.
data have;
length isReturn $5;
do _n_=1 to 800;
isReturn='True';
output;
if _n_ le 300 then do;
isReturn='False';
output;
end;
end;
run;
data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
output;
output;
if ranuni(7) le k/n then do;
output;
k+-1;
end;
n+-1;
end;
run;
You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).
You could also use proc surveyselect to do this.
proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;
That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.
Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.
data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;
All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.