Counting certain values in rows in SAS table - sas

I'm trying to count certain values in rows and require little help!
I have table that looks like this:
data test;
input a b c d;
cards;
1 0 9 1
1 1 0 0
0 9 1 1
0 0 9 1
1 0 9 9
0 1 1 0
1 9 9 1
1 9 0 0
0 0 9 1
9 1 0 0;
run;
Variables a,b,c and d can have values 1, 0 or 9. Now I need to to make a new variable that has value of 1 when there is two or more values of 9 in a row. How do I do this?

Your question needs clarifying... do mean two 9's anywhere in a single row, or two 9's in a row (i.e. consecutively)?
A simple way is to concatenate (using cats()) all the values into a string, and use the index() function to check for the '99', or count() to count the 9's...
data want ;
set have ;
array all{*} a b c d ;
vallist = cats(of all{*}) ;
has99 = (index(vallist,'99') > 0) ; /* flag any two consecutive 9's */
two9s = (count(vallist,'9') >= 2) ; /* two or more 9's */
drop vallist ;
run ;

Here's one way you could do it. Sum your rows and store it in a new variable, e, then if that sum is 18 or larger then you know there has to be at least 2 9's.
data test;
set test;
e = a+b+c+d;
IF e >= 18 THEN f = 1;
ELSE f = 0;
DROP e;
run;

Try this:
data want;
set test;
flag=sum(of _all_)>=18;
run;

Related

How to count the length of a nonzero sequence in SAS

I'd like to count the length of the non zero sequence in a data as below:
ID Value
1 0
1 0
1 2.5
1 3
1 0
1 4
1 2
1 5
1 0
So here the length of the first non zero sequence is 2 and the length of the second non zero sequence is 3. The new data will look like this:
ID Value Length
1 0 0
1 0 0
1 2.5 2
1 3 2
1 0 0
1 4 3
1 2 3
1 5 3
1 0 0
How can I write SAS code to accomplish this task with a large data like this. Thanks!
Here is one possible solution. It assumes there are no missing values in the Value variable and that your ID variable does not have any significance for this problem.
*creates new length variable that starts at 1 and increments by 1 from start to end of every non-zero sequence;
data step_one (drop=prev_val);
set orig_data;
retain prev_val length 0;
indx = _n_;
if value ne 0 and prev_val ne 0 then length = length + 1;
else if value ne 0 then length = 1;
else if value = 0 then length = 0;
prev_val = value;
run;
*sorts dataset in reverse order;
proc sort data=step_one;
by descending indx;
run;
*creates modified length variable that carries maximum length value for each sequence down to all observations included in that sequence;
data step_two (drop=length prev_length rename=(length_new=length));
set step_one;
retain length_new prev_length 0;
if length = 0 then length_new = 0;
else if length ne 0 and prev_length = 0 then
length_new = length;
prev_length = length;
run;
*re-sorts dataset back to its original order and outputs final dataset with just the needed variables;
proc sort data=step_two out=final_result (keep=ID value length);
by indx;
run;

sas expanding the code to multiple columns using lag function and performing an operation

My dataset is like this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
what I want is this
bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103
0 23 4 22 0 14 31 15
1 23 4 22 0 14 31 15
2 22 4 22 0 13 31 15
3 19 4 19 0 12 25 12
4 19 4 19 0 12 25 12
5 19 0 19 0 12 25 12
6 15 0 15 0 8 17 11
7 7 0 7 0 0 7 3
8 7 0 7 0 0 7 3
where the sum is the value for bucket 0 and 1 row the corresponding bucket 2 for column D_201009 =sum-original value(1) and later for bucket 3 for column D_201009 previous value(lag value) -3(value original) and label this column as original column name. I wrote the code to perform one column.
data test;
input bucket D_201009 D_201010 D_201011 D_201012 D_201101 D_201102 D_201103;
datalines;
0 0 0 0 0 0 0 0
1 1 0 0 0 1 0 0
2 3 0 3 0 1 6 3
3 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0
5 4 0 4 0 4 8 1
6 8 0 8 0 8 10 8
7 0 0 0 0 0 0 0
8 7 0 7 0 0 7 3
;
run;
Saving these column names in a macro
proc contents data = test
out = vars(keep = varnum name)
noprint;
run;
proc sql noprint;
select distinct name
into :orderedvars2 separated by ' '
from vars
where varnum >=2
order by varnum;
quit;
Finding sum of one column only
proc sql;
select sum(D_201009) into :total from test;
quit;
Using lag to perform
data result(drop= D_201009 lag_D_201009 rename=(sum=D_201009));
set test;
retain sum;
if bucket < 2 then sum = &total;
sum = sum(sum, -lag(D_201009));
run;
how do I change the code to work for all columns where the column names are stored as macro &orderedvars2. ?
The way I'd approach it would be to transpose the data structure to a more useful data structure; then you don't have to use macro variables. You can use BY processing instead, and no lags.
The way I create the final output is to transpose the initial dataset so you have one row per bucket/D_var, then sort by the D_vars (_NAME_ holds that). Then use a Double DoW loop in order to first calculate the sum, and then to subtract the value. Note I don't have to use Retain or Lag here, I can just directly operate on the value since I'm in a DoW loop. I output before subtracting since that's what you seem to want. Then I retranspose back.
This might not be the fastest option if you have very large data, since it goes through several steps; if you do, you should be using a more efficient algorithm anyway. But it's likely the least fiddly if you don't always have the same columns.
proc transpose data=test out=test_t;
by bucket;
run;
proc sort data=test_t;
by _name_ bucket;
run;
data want_t;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
sum_var = sum(sum_var,col1);
end;
do _n_ = 1 by 1 until (last._name_);
set test_t;
by _name_ bucket;
output;
sum_var = sum_var - col1;
end;
run;
proc sort data=want_t;
by bucket _name_;
run;
proc transpose data=want_t out=want;
by bucket;
id _name_;
var sum_var;
run;
Use proc summary to get sum of each variable, then define multiple arrays.
proc summary data=test;
var D:;
output out=sum(drop=_:) sum=/autoname;
run;
data want;
set test;
if _n_=1 then set sum;
array var1 D_201009--D_201103;
array var2 D_201009_sum--D_201103_sum;
array var3 _D_201009 _D_201010 _D_201011 _D_201012 _D_201101 _D_201102 _D_201103;
array temp (7) _temporary_;
retain temp;
do i=1 to dim(var1);
lag=lag(var1(i));
if bucket<2 then var3(i)=var2(i);
else var3(i)=sum(temp(i),-lag);
temp(i)=var3(i);
end;
drop D: lag i;
run;
If I understand this right you want to sum the column and then subtract the value of each observation from the total?
Getting totals is easy, just use proc summary.
Then combine it with the original data. Here is a way that will work without having to worry about the actual variable names. In this program it will sum all variables that start with d_ but you could use any variable list you want. If you have more than 100 variables then change the dimension of the temporary array.
%let varlist=d_:;
* Get sums into variables with same names ;
proc summary data=have ;
var &varlist ;
output out=total sum= ;
run;
data want ;
set have(obs=0) /* Set variable order */
total(keep=&varlist) /* Get totals */
have(keep=&varlist) /* Get lagged variables */
;
array vars &varlist ;
array total (100) _temporary_;
set have (drop=&varlist); /* Get non-lagged variables */
do i=1 to dim(vars);
if _n_>1 then vars(i)=total(i)-vars(i);
total(i)=vars(i);
end;
drop i;
run;
If you have missing values you might want to add this line of code at beginning of the DO loop:
vars(i)=coalesce(vars(i),0);

SAS: Getting variable name of last non-zero observation

I'm trying to figure this out. I have a table as follows and I'm trying to populate the final column with the variable name of the last non-zero value (as shown in final column):
ID MTH_1 MTH_2 MTH_3 MTH_4 MTH_5 MONTH_LAST_BALANCE
--------------------------------------------------------------
1 10 0 10 20 10 MTH_5
2 5 10 15 5 0 MTH_4
3 5 10 5 0 0 MTH_3
4 1 2 3 1 0 MTH_4
5 1 0 0 0 0 MTH_1
I'm guessing I need to use some sort of array to make this work but I don't know. As per row 1, I need the last non-zero value only, not the left-most one that some other code seems to retrieve.
Any help would be much appreicated.
Cheers
data want ;
set have ;
/* Load MTH_1 to MTH_5 into array */
array m{*} MTH_1-MTH_5 ;
length MONTH_LAST_BALANCE $5. ;
/* Iterate over array */
do i = 1 to dim(m) ;
/* Use vname function to get variable name from array element */
if m{i} > 0 then MONTH_LAST_BALANCE = vname(m{i}) ;
end ;
run ;

SAS, Filtering for Highest Values

I currently have a health injury data set of scores 0-6, where 0 is no injury and 6 is fatal injury. This is across 6 categorical body region variables. I'm attempting to construct an Abbreviated Injury Scale, where the three highest scores in an observation would be considered for the calculations. How do I filter the three highest in each row in SAS? Below is an example:
ID A B C D E F
1 0 0 0 3 4 0
2 1 2 1 4 0 0
3 0 0 5 0 0 0
4 1 2 1 5 4 0
So in OBS 1, scores 3, 4, and 0 would be used; OBS 2 - 4, 2, and 1; OBS 3 - 5, 0, and 0; OBS 4 - 5, 4, 2.
I've provided code below to do what you asked, and detailed out the steps enough that you should be able to modify it for many options/uses.
Basically, it takes your data, transposes it as Quentin suggested and then uses proc means to output the top 3 observations for each ID.
DATA NEW;
INPUT ID A B C D E F;
CARDS;
1 0 0 0 3 4 0
2 1 2 1 4 0 0
3 0 0 5 0 0 0
4 1 2 1 5 4 0
RUN;
PROC TRANSPOSE DATA=NEW OUT=T_OUT(RENAME=(_NAME_ = VARIABLE COL1=VALUES));
BY ID;
VAR A B C D E F;
PROC PRINT DATA=T_OUT;
RUN;
PROC MEANS DATA=T_OUT NOPRINT;
CLASS ID;
TYPES ID;
VAR VALUES;
OUTPUT OUT=TOP3LIST(RENAME=(_FREQ_=RANK VALUES_MEAN=INDEX_CRITERIA))SUM= MEAN=
IDGROUP(MAX(VALUES) OUT[3] (VALUES VARIABLE)=)/AUTOLABEL AUTONAME;
PROC PRINT DATA=TOP3LIST;
RUN;
***THEN YOU CAN MERGE THIS DATA SET TO YOUR ORIGINAL ONE BY ID TO GET YOUR INDEX CRITERIA ADDED TO IT***;
***THE INDEX_CRITERIA IS A MEAN FROM PROC MEANS BEFORE THE KEEPING OF JUST THE TOP3 VALUES***;
DATA FINAL (DROP=_TYPE_ RANK VALUES_Sum VALUES_1 VALUES_2 VALUES_3 VARIABLE_1 VARIABLE_2 VARIABLE_3);
MERGE NEW TOP3LIST;
INDEX_CRITERIA2=SUM(VALUES_1, VALUES_2, VALUES_3)/3; *THIS CRITERIA IS AVERAGE OF THE KEPT 3 VALUES;
BY ID;
PROC PRINT DATA=FINAL;
RUN;
Best regards,
john

replace missing value with non-zero values by column

Data:
A B C D E
2 3 4 . .
2 3 0 0 .
0 3 4 1 1
0 . 4 0 1
2 0 0 0 1
Ideal output:
A B C D E
2 3 4 1 1
2 3 0 0 1
0 3 4 1 1
0 3 4 0 1
2 0 0 0 1
For each column, there are only 3 possible values: an arbitrary integer, zero, and missing value.
I want to replace the missing values with the non-zero value in the corresponding column.
If the arbitrary integer is zero, then missing value should be replaced by zero.
For actual problem, the number of row and number of columns are not small.
Make two arrays--one with your column names and another with variables to hold the arbitrary integers. Loop through the data set once to get the integers (looping over the columns in the array), then again to output the values, replacing where necessary (again, looping through the columns in the array).
data want(drop=i int1-int5);
do until (eof);
set have end=eof;
array _col a--e;
array _int int1-int5;
do i = 1 to dim(_col);
if _col(i) not in (.,0) then _int(i)=_col(i);
end;
end;
do until (_eof);
set have end=_eof;
do i = 1 to dim(_col);
if missing(_col(i)) then _col(i)=_int(i);
end;
output;
end;
run;