How can we do iteration in a sas dataset.
For example I have chosen the first. of a variable.
And want to find the occurence of a particular condition and set a value when it satisfy
SAS data step has a built-in loop over observations. You don't have to do any thing, unless you want to, for some reason. For instance, the following generates a random number for each observation:
data one;
set sashelp.class;
rannum = ranuni(0);
run;
If you want to loop over variables, then there are arrays. For example, the following initializes variables, var1 to var10, with random numbers:
data one;
array vars[1:10] var1-var10;
do i = 1 to 10;
vars[i] = ranuni(0);
end;
run;
The first. and last. flags are automatically generated when you set a (sorted) data with a by statement. An example:
proc sort data=sashelp.class out=class;
by age;
run;
data one;
set class;
by age;
first = first.age;
last = last.age;
run;
/* check */
proc print data=one;
run;
/* on lst
Obs Name Age first last
1 Joyce 11 1 0
2 Thomas 11 0 1
3 James 12 1 0
4 Jane 12 0 0
5 John 12 0 0
6 Louise 12 0 0
7 Robert 12 0 1
8 Alice 13 1 0
...
18 William 15 0 1
19 Philip 16 1 1
*/
Related
In this data, I need to subset by each variable by certain percentage.
For example,
Obs Group Score
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
I would need to subset 10 obs.
The sample must consist of all groups, and score of 1 takes higher priority.
Each group is given certain percent.
Let say 50% for A, 20% for B and 30% for C.
I tried using proc surveyselect but it failed. The number of alloc is not same as the strata.
proc surveyselect data=example out=test sampsize=10;
strata group score/alloc=(0.5 0.2 0.3);
run;
I don't know proc surveyselect too much, so I give the data step version.
data have;
input Obs Group$ Score;
cards;
1 A 1
2 A 2
3 B 1
4 B 1
5 C 3
6 C 1
7 C 1
8 A 1
9 A 3
10 A 1
11 A 2
12 B 3
13 C 2
;
run;
proc sort;
by Group Score;
run;
data want;
array _Dist_[3]$ _temporary_('A','B','C');
array _Upper_[3] _temporary_(5,2,3);
array _Count_[3] _temporary_;
do i = 1 to rec;
set have nobs=rec point=i;
do j = 1 to dim(_Dist_);
_Count_[j] + (Group=_Dist_[j]);
if _Count_[j] <= _Upper_[j] and Group = _Dist_[j] then output;
end;
end;
stop;
drop j;
run;
My dataset is like this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
what I want is this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 23 4 22 0 14 31 15
1 23 4 22 0 14 31 15
2 22 4 22 0 13 31 15
3 19 4 19 0 12 25 12
4 19 4 19 0 12 25 12
5 19 0 19 0 12 25 12
6 15 0 15 0 8 17 11
7 7 0 7 0 0 7 3
8 7 0 7 0 0 7 3
where the sum is the value for bucket 0 and 1 row the corresponding bucket 2 for column D_201009 =sum-original value(1) and later for bucket 3 for column D_201009 previous value(lag value) -3(value original) and label this column as original column name. I wrote the code to perform one column.
data test;
input bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103;
datalines;
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
;
run;
Saving these column names in a macro
proc contents data = test
out = vars(keep = varnum name)
noprint;
run;
proc sql noprint;
select distinct name
into :orderedvars2 separated by ' '
from vars
where varnum >=2
order by varnum;
quit;
Finding sum of one column only
proc sql;
select sum(D_201009) into :total from test;
quit;
Using lag to perform
data result(drop= D_201009 lag_D_201009 rename=(sum=D_201009));
set test;
retain sum;
if bucket < 2 then sum = &total;
sum = sum(sum, -lag(D_201009));
run;
how do I change the code to work for all columns where the column names are stored as macro &orderedvars2. ?
The way I'd approach it would be to transpose the data structure to a more useful data structure; then you don't have to use macro variables. You can use BY processing instead, and no lags.
The way I create the final output is to transpose the initial dataset so you have one row per bucket/D_var, then sort by the D_vars (_NAME_ holds that). Then use a Double DoW loop in order to first calculate the sum, and then to subtract the value. Note I don't have to use Retain or Lag here, I can just directly operate on the value since I'm in a DoW loop. I output before subtracting since that's what you seem to want. Then I retranspose back.
This might not be the fastest option if you have very large data, since it goes through several steps; if you do, you should be using a more efficient algorithm anyway. But it's likely the least fiddly if you don't always have the same columns.
proc transpose data=test out=test_t;
by bucket;
run;
proc sort data=test_t;
by _name_ bucket;
run;
data want_t;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
sum_var = sum(sum_var,col1);
end;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
output;
sum_var = sum_var - col1;
end;
run;
proc sort data=want_t;
by bucket _name_;
run;
proc transpose data=want_t out=want;
by bucket;
id _name_;
var sum_var;
run;
Use proc summary to get sum of each variable, then define multiple arrays.
proc summary data=test;
var D:;
output out=sum(drop=_:) sum=/autoname;
run;
data want;
set test;
if _n_=1 then set sum;
array var1 D_201009--D_201103;
array var2 D_201009_sum--D_201103_sum;
array var3 _D_201009 _D_201010 _D_201011 _D_201012 _D_201101 _D_201102 _D_201103;
array temp (7) _temporary_;
retain temp;
do i=1 to dim(var1);
lag=lag(var1(i));
if bucket<2 then var3(i)=var2(i);
else var3(i)=sum(temp(i),-lag);
temp(i)=var3(i);
end;
drop D: lag i;
run;
If I understand this right you want to sum the column and then subtract the value of each observation from the total?
Getting totals is easy, just use proc summary.
Then combine it with the original data. Here is a way that will work without having to worry about the actual variable names. In this program it will sum all variables that start with d_ but you could use any variable list you want. If you have more than 100 variables then change the dimension of the temporary array.
%let varlist=d_:;
* Get sums into variables with same names ;
proc summary data=have ;
var &varlist ;
output out=total sum= ;
run;
data want ;
set have(obs=0) /* Set variable order */
total(keep=&varlist) /* Get totals */
have(keep=&varlist) /* Get lagged variables */
;
array vars &varlist ;
array total (100) _temporary_;
set have (drop=&varlist); /* Get non-lagged variables */
do i=1 to dim(vars);
if _n_>1 then vars(i)=total(i)-vars(i);
total(i)=vars(i);
end;
drop i;
run;
If you have missing values you might want to add this line of code at beginning of the DO loop:
vars(i)=coalesce(vars(i),0);
I currently have a health injury data set of scores 0-6, where 0 is no injury and 6 is fatal injury. This is across 6 categorical body region variables. I'm attempting to construct an Abbreviated Injury Scale, where the three highest scores in an observation would be considered for the calculations. How do I filter the three highest in each row in SAS? Below is an example:
ID A B C D E F
1 0 0 0 3 4 0
2 1 2 1 4 0 0
3 0 0 5 0 0 0
4 1 2 1 5 4 0
So in OBS 1, scores 3, 4, and 0 would be used; OBS 2 - 4, 2, and 1; OBS 3 - 5, 0, and 0; OBS 4 - 5, 4, 2.
I've provided code below to do what you asked, and detailed out the steps enough that you should be able to modify it for many options/uses.
Basically, it takes your data, transposes it as Quentin suggested and then uses proc means to output the top 3 observations for each ID.
DATA NEW;
INPUT ID A B C D E F;
CARDS;
1 0 0 0 3 4 0
2 1 2 1 4 0 0
3 0 0 5 0 0 0
4 1 2 1 5 4 0
RUN;
PROC TRANSPOSE DATA=NEW OUT=T_OUT(RENAME=(_NAME_ = VARIABLE COL1=VALUES));
BY ID;
VAR A B C D E F;
PROC PRINT DATA=T_OUT;
RUN;
PROC MEANS DATA=T_OUT NOPRINT;
CLASS ID;
TYPES ID;
VAR VALUES;
OUTPUT OUT=TOP3LIST(RENAME=(_FREQ_=RANK VALUES_MEAN=INDEX_CRITERIA))SUM= MEAN=
IDGROUP(MAX(VALUES) OUT[3] (VALUES VARIABLE)=)/AUTOLABEL AUTONAME;
PROC PRINT DATA=TOP3LIST;
RUN;
***THEN YOU CAN MERGE THIS DATA SET TO YOUR ORIGINAL ONE BY ID TO GET YOUR INDEX CRITERIA ADDED TO IT***;
***THE INDEX_CRITERIA IS A MEAN FROM PROC MEANS BEFORE THE KEEPING OF JUST THE TOP3 VALUES***;
DATA FINAL (DROP=_TYPE_ RANK VALUES_Sum VALUES_1 VALUES_2 VALUES_3 VARIABLE_1 VARIABLE_2 VARIABLE_3);
MERGE NEW TOP3LIST;
INDEX_CRITERIA2=SUM(VALUES_1, VALUES_2, VALUES_3)/3; *THIS CRITERIA IS AVERAGE OF THE KEPT 3 VALUES;
BY ID;
PROC PRINT DATA=FINAL;
RUN;
Best regards,
john
I have a data set that has a person's name and how many times they scored a 1-10. For example, Bob scored 7 1s, 8 2s, and 7 4s, but did not receive any other scores.
Name 1 2 3 4 5 6 7 8 9 10
Bob 7 8 7 0 0 0 0 0 0 0
Hal 9 3 1 0 0 0 0 0 0 0
I want a data set that has a row for Bob that looks like this
Bob 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4
Hal 1 1 1 1 1 1 1 1 1 2 2 2 3
I'm doing this in SAS by the way.
I know I can write a macro to create variables named score1, score2, ..., scoreN.
I am having trouble populating the cells. Any help would be appreciated. Thanks.
Such things - changing the structure of the dataset - sometimes easier to do with PROC TRANSPOSE:
data have;
input Name $ v1 v2 v3 v4 v5 v6 v7 v8 v9 v10;
datalines;
Bob 7 8 7 0 0 0 0 0 0 0
;
run;
/*convert original wide dataset into long one*/
proc transpose data=have out=have_long;
var v:;
by Name;
run;
data want;
set have_long;
substr(_NAME_,1,1)=""; *to get rid of first 'v' in variables' names;
do i=1 to COL1;
new_var=_NAME_;
output;
end;
drop _NAME_ COL1 i;
run;
/*convert back to wide dataset*/
proc transpose data=want out=want(drop=_NAME_);
var new_var;
by Name;
run;
Say that I have the following database:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
7 1 120
7 2 110
7 3 100
7 4 90
I need to have the database with the continuous values for minutes like this:
Min Rank Qty
2 1 100
2 2 90
2 3 80
2 4 70
3 1 100
3 2 90
3 3 80
3 4 70
4 1 100
4 2 90
4 3 80
4 4 70
5 1 110
5 2 100
5 3 90
5 4 80
5 5 70
6 1 110
6 2 100
6 3 90
6 4 80
6 5 70
7 1 120
7 2 110
7 3 100
7 4 90
How can I do this in SAS? I just need to replicate the previous minute. The number of observations per minute varies...it can be 4 or 5 or more.
It is not that hard to imagine code that would do this, the problem is that it quickly starts to look messy.
If your dataset is not too large, one approach you could consider the following approach:
/* We find all gaps. the output dataset is a mapping: the data of which minute (reference_minute) do we need to create each minute of data*/
data MINUTE_MAPPING (keep=current_minute reference_minute);
set YOUR_DATA;
by min;
retain last_minute 2; *set to the first minute you have;
if _N_ NE 1 and first.min then do;
/* Find gaps, map them to the last minute of data we have*/
if last_minute+1 < min then do;
do current_minute=last_minute+1 to min-1;
reference_minute=last_minute;
output;
end;
end;
/* For the available data, we map the minute to itself*/
reference_minute=min;
current_minute=min;
output;
*update;
last_minute=min;
end;
run;
/* Now we apply our mapping to the data */
*you must use proc sql because it is a many-to-many join, data step merge would give a different outcome;
proc sql;
create table RESULT as
select YD.current_minute as min, YD.rank, YD.qty
MINUTE_MAPPING as MM
join YOUR_DATA as YD
on (MM.reference_minute=YD.min)
;
quit;
The more performant approach would involve trickery with arrays.
But i find this approach a bit more appealing (disclaimer: at first thought), it is quicker to grasp (disclaimer again: imho) for someone else afterwards.
For good measure, the array approach:
data RESULT (keep=min rank qty);
set YOUR_DATA;
by min;
retain last_minute; *assume that first record really is first minute;
array last_data{5} _TEMPORARY_;
if _N_ NE 1 and first.min and last_minute+1 < min then do; *gap found;
do current_min=last_minute+1 to min-1;
*store data of current record;
curr_min=min;
curr_rank=rank;
curr_qty=qty;
*produce records from array with last available data;
do iter=1 to 5;
min = current_minute;
rank = iter;
qty = last_data{iter};
if qty NE . then output; *to prevent output of 5th element where there are only 4;
end;
*put back values of actual current record before proceeding;
min=curr_min;
rank=curr_rank;
qty=curr_qty;
end;
*update;
last_minute=min;
end;
*insert data for use on later missing minutes;
last_data{rank}=qty;
if last.min and rank<5 then last_data{5}=.;
output; *output actual current data point;
run;
Hope it helps.
Note, currently no access to a SAS client where i am. So untested code, might contain a couple of typo's.
Unless you have an absurd number of observations, I think transposing would make this easy.
I don't have access to sas at the moment so bear with me (I can test it out tomorrow if you can't get it working).
proc transpose data=data out=data_wide prefix=obs_;
by minute;
id rank;
var qty;
run;
*sort backwards so you can use lag() to fill in the next minute;
proc sort data=data_wide;
by descending minute;
run;
data data_wide; set data_wide;
nextminute = lag(minute);
run;
proc sort data=data_wide;
by minute;
run;
*output until you get to the next minute;
data data_wide; set data_wide;
*ensure that the last observation is output;
if nextminute = . then output;
do until (minute ge nextminute);
output;
minute+1;
end;
run;
*then you probably want to reverse the transpose;
proc transpose data=data_wide(drop=nextminute)
out=data_narrow(rename=(col1=qty));
by minute;
var _numeric_;
run;
*clean up the observation number;
data data_narrow(drop=_NAME_); set data_narrow;
rank = substr(_NAME_,5)*1;
run;
Again, I can't test this now, but it should work.
Someone else may have a clever solution that makes it so you don't have to reverse-sort/lag/forward-sort. I feel like I have dealt with this before but the obvious solution for me right now is to have it sorted backwards at whatever prior sort you do (you can do the transpose with a descending sort no problem) to save you an extra sort.