Unable to execute my hash table correctly/ SAS - sas

I have a data step before the code below called "simulation_tracking3", that outputs something like:
CDFx Allowed_Claims
.06 120
.12 13
.15 1400
I want my hash table to average the Allowed_Claims based on a randomly generated value (from 0 to 1). For example, let's call this Process A, if Px = rand('Uniform',0,1) yields .09, I want it to average between the Allowed_Claims values where Px = .06 and Px = 0.12, which is (120+13)/2
The role of the array is that it dictates how many iterations of Process A I want. The array is
Members {24} _temporary_ (5 6 8 10 12 15 20 25 30 40 50 60 70 80
90 100 125 150 175 200 250 300 400 500);
So when the loop starts, it will perform 5 iterations of Process A, thereby producing 5 averaged "allowed_claims" values. I want the sum of these five claims.
Then, the loop will continue and perform 6 iterations of Process A and produce 6 averaged "allowed_claims" values. Again, I want the sum of these 6 claims.
I want the output table to look like:
`
Member[i] Average_Expected_Claims
5 (sum of 5 'averaged 'claims)
6 (sum of 6 'averaged' claims)
8 (sum of 8 'averaged' claims)
The code that I currently have is below. My errors occur here:
do rc = hi_iter.first() by 0 until (hi_iter.next()_ ne 0 or CDFx gt rand_value);
rc = hi_iter.prev();
The error says, respectively:
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &,
*, **, +, -, /, <, <=, <>, =, >, ><, >=, AND, EQ, GE, GT, IN,
LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR, ^=, |, ||, ~=.
Blockquote
ERROR: DATA STEP Component Object failure. Aborted during the
COMPILATION phase.
data simulation_members; *simulates allowed claims for each member in member array;
call streaminit(454);
array members [24] _temporary_ (5 6 8 10 12 15 20 25 30 40 50
60 70 80 90 100 125 150 175 200 250 300 400 500); *any number of members here is fine;
if _n_ eq 1 then do; * initiliaze the hash tables;
if 0 then set simulation_tracking3; * defines the variables used;
declare hash _iter(dataset:'simulation_tracking3', ordered: 'a'); *ordered = ascending - do not need a sort first;
_iter.defineKey('CDFx'); * key is artificial, but has to exist;
_iter.defineData('CDFx','Allowed_Claims'); * data variables to retrieve;
_iter.defineDone();
declare hiter hi_iter('_iter'); * the iterator object;
end;
do _i_member = 1 to dim(members); * iterate over members array;
call missing(claims_simulated);
do _i_simul = 1 to members[_i_member]-1;
rand_value = rand('Uniform',0,1);
do rc = hi_iter.first() by 0 until (hi_iter.next()_ ne 0 or CDFx gt rand_value);
end;
ac_max = allowed_claims;
rc = hi_iter.prev();
ac_min = allowed_claims;
claims_simulated + mean(ac_max,ac_min);
put rand_value= claims_simulated=; *just for logging;
end;
putlog;
output; *drop unnecessary columns;
end;
stop;
run;

Related

Average over number of variables where number of variables is dictated by separate column

I would like to create a new column whose values equal the average of values in other columns. But the number of columns I am taking the average of is dictated by a variable. My data look like this, with 'length' dictating the number of columns x1-x5 that I want to average:
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
run;
I would like to end up with the below where 'avg' is the average of the specified columns.
data want;
input ID $ length avg
datalines;
A 5 87
B 4 156.5
C 3 558.3
D 5 39.6
;
run;
Any suggestions? Thanks! Sorry about the awful title, I did my best.
You have to do a little more work since mean(of x[1]-x[length]) is not valid syntax. Instead, save the values to a temporary array and take the mean of it, then reset it at each row. For example:
tmp1 tmp2 tmp3 tmp4 tmp5
8 234 79 36 78
8 26 589 3 .
19 892 764 . .
72 48 65 4 9
data want;
set have;
array x[*] x:;
array tmp[5] _temporary_;
/* Reset the temp array */
call missing(of tmp[*]);
/* Save each value of x to the temp array */
do i = 1 to length;
tmp[i] = x[i];
end;
/* Get the average of the non-missing values in the temp array */
avg = mean(of tmp[*]);
drop i;
run;
Use an array to average it by summing up the array for the length and then dividing by the length.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array x(5) x1-x5;
sum=0;
do i=1 to length;
sum + x(i);
end;
avg = sum/length;
keep id length avg;
format avg 8.2;
run;
#Reeza's solution is good, but in case of missing values in x it will produce not always desirable result. It's better to use a function SUM. Also the code is little simplified:
data want (drop=i s);
set have;
array a{*} x:;
s=0; nm=0;
do i=1 to length;
if missing(a{i}) then nm+1;
s=sum(s,a{i});
end;
avg=s/(length-nm);
run;
Rather than writing your own code to calculate means you could just calculate all of the possible means and then just use an index into an array to select the one you need.
data have;
input ID $ length x1 x2 x3 x4 x5;
datalines;
A 5 8 234 79 36 78
B 4 8 26 589 3 54
C 3 19 892 764 89 43
D 5 72 48 65 4 9
;
data want;
set have;
array means[5] ;
means[1]=x1;
means[2]=mean(x1,x2);
means[3]=mean(of x1-x3);
means[4]=mean(of x1-x4);
means[5]=mean(of x1-x5);
want = means[length];
run;
Results:

How to select a percentage of values from a column in SAS?

I have 70 databases of different sizes (same number of columns, different numbers of lines).
I need to get the 25% higher values and the 25% lower values considering a given column VAR1.
I have:
id VAR1
1 10
2 -5
3 -12
4 7
5 12
6 7
7 -9
8 -24
9 0
10 6
11 -18
12 22
Sorting by VAR1, I need to select the rows (all columns) containing the 3 smallest and the 3 largest (25% from each extreme), i.e.,
id VAR1
8 -24
11 -18
3 -12
7 -9
2 -5
9 0
10 6
4 7
6 7
1 10
5 12
12 22
I need to keep in the database the rows (all columns) that contain the VAR1 equal to -24, -18, -12, 10, 12 and 22.
id VAR1
8 -24
11 -18
3 -12
1 10
5 12
12 22
What I’ve been thinking:
Order column VAR1 in ascending order;
Create a numbered column from 1 to N (n=_N_) - in this case, N=12;
I do a=N*0.25 (to have the value that represents 25%);
I do b=N-a (to have the value that represents the "last" 25%).
So, I can use keep:
if N<a.... I will have the first 25% (the smallest).
if N>b.... I will have the last 25% (the largest).
I can calculate a and b.
But I’m not getting the maximum value of N in this case 12.
I will repeat this for the 70 database, I would not like to have to enter this maximum value every time (it varies from one database to another).
I need help to "fix" the maximum value (N) without having to type it (even if it is repeated in all the lines of another "auxiliary column").
Or if there’s some better way to get those 25% from each end.
My code:
proc sort data=have; by VAR1; run;
data want; set have;
seq=_N_;
N=max(seq); *N=max. value of lines. (I stopped here and don’t know if below is right);
a=N*0.25;
b=N-b;
if N<a;
if N>b;
run;
Thank you very much!
Proc RANK computes percentiles that you can use to select the desired rows.
Example:
data have1 have2 have3 have4 have5;
do id = 1 to 100;
X = ceil(rand('normal', 0, 10));
if id < 60 then output have1;
if id < 70 then output have2;
if id < 80 then output have3;
if id < 90 then output have4;
if id < 100 then output have5;
end;
run;
proc rank data=have1 percent out=want1(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have2 percent out=want2(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;
proc rank data=have3 percent out=want3(where=(pct not between 25 and 75)) ;
var x;
ranks pct;
run;

SAS_Conditional Cumulative Sum

My question is about the conditional cumulative sum in SAS. I think it can be explained better by using sample. I have following dataset:
Date Value
01/01/2001 10
02/01/2001 20
03/01/2001 30
04/01/2001 15
05/01/2001 25
06/01/2001 35
07/01/2001 20
08/01/2001 45
09/01/2001 35
I want to find the cumulative sum of value. My condition is if cumulative sum more than 70, it should be 70 and the next cumulative sum should be began from the excessive value over 70 and so on.. More preciesly, my new data should be:
Date Value Cumulative
01/01/2001 10 10
02/01/2001 20 30
03/01/2001 30 60
04/01/2001 15 70
05/01/2001 25 30 ( 75-70=5+25=30)
06/01/2001 35 65
07/01/2001 20 70
08/01/2001 45 60 ( 85-70=15+45=60)
09/01/2001 35 95 ( because its last value)
Many thanks in advance
Here is a solution, although there is bound to be one more elegant. It's split into two parts with if eof to satisfy the last observation condition.
data want;
set test end = eof;
if eof ^= 1 then do;
if cumulative = 70 then cumulative = extra;
Cumulative + value;
extra = cumulative - 70;
if extra > 0 then do;
cumulative = 70;
end;
end;
retain extra;
retain cumulative;
if eof = 1 then cumulative + value;
run;

SAS SymputX and Symget Function

I try to construct Table 2 by writing below SAS code but what I get is the Table 1. I could not figure out what I missed. Help very appreciated Thank you.
&counter = 4
data new;set set1;
total = 0;
a = 1;
do i = 1 to &counter;
call symputX('a',a);
total = total + Tem_&a.;
a = symget('a')+1;
call symputX('a',a);
end;
run;
Table 1
ID Amt Tem_1 Tem_2 Tem_3 Tem_4 total
4 500 1 4 5 900 3600
5 200 50 100 200 0 0
9 50 40 0 0 0 0
10 500 70 100 250 0 0
Table 2
ID Amt Tem_1 Tem_2 Tem_3 Tem_4 total
4 500 1 4 5 900 910
5 200 50 100 200 0 350
9 50 40 0 0 0 40
10 500 70 100 250 0 420
You cannot use SYMPUT and SYMGET that way, unfortunately. While you can use them to store/retrieve macro variable values, you cannot change the code sent to the compiler after execution.
Basically, SAS has to figure out the machine code for what it's supposed to do on every iteration of the data step loop before it looks at any data (this is called compiling). So the problem is, you can't define tem_&a. and expect to be allowed to change what _&a. is during execution, because it would change what that machine code needs to do, and SAS couldn't prepare for that sufficiently.
So, what you wrote the &a. would be resolved when the program compiled, and whatever value &a. had before your data step woudl be what tem_&a. would turn into. Presumably the first time you ran this it errored (&a. does not resolve and then an error about & being illegal in variable names), and then eventually the call symput did its job and &a got a 4 in it at the end of the loop, and forever more your tem_&a. resolved to tem_4.
The solution? Don't use macros for this. Instead, use arrays.
data new;
set set1;
total = 0;
array tem[&counter.] tem_1-tem_&counter.;
a = 1;
do i = 1 to &counter; *or do i = 1 to dim(tem);
total = total + Tem[i];
end;
run;
Or, of course, just directly sum them.
data new;
set set1;
total = sum(of tem_1-tem_4);
run;
If you REALLY like macro variables, you could of course do this in a macro do loop, though this is not recommended for this purpose as it's really better to stick with data step techniques. But this should work, anyway, if you run this inside a macro (this won't be valid in open code).
data new;
set set1;
total = 0;
%do i = 1 %to &counter;
total = total + Tem_&i.;
%end;
run;

SAS: Filling the missing values by block of data

Say that I have the following database:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
7 1 120
7 2 110
7 3 100
7 4 90
I need to have the database with the continuous values for minutes like this:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
3 1 100
3 2 90
3 3 80
3 4 70
4 1 100
4 2 90
4 3 80
4 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
6 1 110
6 2 100
6 3 90
6 4 80
6 5 70
7 1 120
7 2 110
7 3 100
7 4 90
How can I do this in SAS? I just need to replicate the previous minute. The number of observations per minute varies...it can be 4 or 5 or more.
It is not that hard to imagine code that would do this, the problem is that it quickly starts to look messy.
If your dataset is not too large, one approach you could consider the following approach:
/* We find all gaps. the output dataset is a mapping: the data of which minute (reference_minute) do we need to create each minute of data*/
data MINUTE_MAPPING (keep=current_minute reference_minute);
set YOUR_DATA;
by min;
retain last_minute 2; *set to the first minute you have;
if _N_ NE 1 and first.min then do;
/* Find gaps, map them to the last minute of data we have*/
if last_minute+1 < min then do;
do current_minute=last_minute+1 to min-1;
reference_minute=last_minute;
output;
end;
end;
/* For the available data, we map the minute to itself*/
reference_minute=min;
current_minute=min;
output;
*update;
last_minute=min;
end;
run;
/* Now we apply our mapping to the data */
*you must use proc sql because it is a many-to-many join, data step merge would give a different outcome;
proc sql;
create table RESULT as
select YD.current_minute as min, YD.rank, YD.qty
MINUTE_MAPPING as MM
join YOUR_DATA as YD
on (MM.reference_minute=YD.min)
;
quit;
The more performant approach would involve trickery with arrays.
But i find this approach a bit more appealing (disclaimer: at first thought), it is quicker to grasp (disclaimer again: imho) for someone else afterwards.
For good measure, the array approach:
data RESULT (keep=min rank qty);
set YOUR_DATA;
by min;
retain last_minute; *assume that first record really is first minute;
array last_data{5} _TEMPORARY_;
if _N_ NE 1 and first.min and last_minute+1 < min then do; *gap found;
do current_min=last_minute+1 to min-1;
*store data of current record;
curr_min=min;
curr_rank=rank;
curr_qty=qty;
*produce records from array with last available data;
do iter=1 to 5;
min = current_minute;
rank = iter;
qty = last_data{iter};
if qty NE . then output; *to prevent output of 5th element where there are only 4;
end;
*put back values of actual current record before proceeding;
min=curr_min;
rank=curr_rank;
qty=curr_qty;
end;
*update;
last_minute=min;
end;
*insert data for use on later missing minutes;
last_data{rank}=qty;
if last.min and rank<5 then last_data{5}=.;
output; *output actual current data point;
run;
Hope it helps.
Note, currently no access to a SAS client where i am. So untested code, might contain a couple of typo's.
Unless you have an absurd number of observations, I think transposing would make this easy.
I don't have access to sas at the moment so bear with me (I can test it out tomorrow if you can't get it working).
proc transpose data=data out=data_wide prefix=obs_;
by minute;
id rank;
var qty;
run;
*sort backwards so you can use lag() to fill in the next minute;
proc sort data=data_wide;
by descending minute;
run;
data data_wide; set data_wide;
nextminute = lag(minute);
run;
proc sort data=data_wide;
by minute;
run;
*output until you get to the next minute;
data data_wide; set data_wide;
*ensure that the last observation is output;
if nextminute = . then output;
do until (minute ge nextminute);
output;
minute+1;
end;
run;
*then you probably want to reverse the transpose;
proc transpose data=data_wide(drop=nextminute)
out=data_narrow(rename=(col1=qty));
by minute;
var _numeric_;
run;
*clean up the observation number;
data data_narrow(drop=_NAME_); set data_narrow;
rank = substr(_NAME_,5)*1;
run;
Again, I can't test this now, but it should work.
Someone else may have a clever solution that makes it so you don't have to reverse-sort/lag/forward-sort. I feel like I have dealt with this before but the obvious solution for me right now is to have it sorted backwards at whatever prior sort you do (you can do the transpose with a descending sort no problem) to save you an extra sort.