SAS if-else if not working - sas

combd dataset
SAS is checking the first if statement and ignoring the others for some reason. For a couple obs "class" = 1 and for all the others it is blank. What do I do? I do need the parentheses to group those "or" statements right. Attached is the combd dataset. Just noticed for obs where SAS has class=1 it is wrong!
data clustered;
set combd;
if ((393821 ge avpm le 450041) or (337601 ge avpm le 393821) or (225161
ge avpm le 281381)) and (.8768 ge fsp le 1) then class='1';
else if ((112720 ge avpm le 168940) or (56500 ge avpm le 112720) or
(280.06 ge avpm le 56500)) and (.8768 ge fsp le 1) then class='2';
else if (280.06 ge avpm le 56500) and ((.507 ge fsp le .6303) or
(.3838 ge fsp le .507) or (.2606 ge fsp le .3838)) then class='3';
else if (280.06 ge avpm le 56500) and ((.1373 ge fsp le .2606) or
(.0141 ge fsp le .1373)) then class='4';
else if (280.06 ge avpm le 56500) and (.8768 ge fsp le 1) then
class='5';
else if (280.06 ge avpm le 56500) and ((.8768 ge fsp le 1) or
(.7535 ge fsp le .8768) or (.6303 ge fsp le .7535)) then
class='6';
run;

This is wrong:
(393821 ge avpm le 450041)
Any number less than 393821 will also be less than 450041.
You want this:
(393821 le avpm le 450041)
393821 less than AVPM less than 450041, meaning, avpm is between 393821 and 450041. Do that for all of your data.
Second: don't write your code this way. It's hard to read and difficult to debug. Instead use a data driven code method.
You have really a relationship here, right? [some AVPM values] and [some FSP values] -> [some CLASS value]
So let's make a table:
data class;
input class avpm_min avpm_max fsp_min fsp_max;
datalines;
1 393821 450041 .8768 1
1 337601 393821 .8768 1
1 225161 281381 .8768 1
2 112720 168940 .8768 1
2 056500 112720 .8768 1
2 280.06 056500 .8768 1
3 280.06 056500 .5070 .6303
3 280.06 056500 .3838 .5070
3 280.06 056500 .2606 .3838
6 280.06 056500 .6303 .7535
6 280.06 056500 .7535 .8768
;;;; /*more datalines of course */
run;
And then let's use PROC SQL to join this to the main table.
data your_data;
input avpm fsp;
obs=_n_;
datalines;
13026.14 .81888
1810.57 .84959
3859.84 .85593
3290.61 .57513
10704.72 .71414
;;;;
run;
proc sql;
select d.obs, d.avpm, d.fsp, c.class from
your_data d
left join class c
on c.avpm_min le d.avpm le c.avpm_max
and c.fsp_min le d.fsp le c.fsp_max
order by d.obs
;
quit;
There you go.
I'd also note that your IF/ELSE combinations don't really make sense for 4/5/6. 5 is impossible (it's entirely subsumed by 2) and part of 6 is also.

Related

Extract variable number of columns from beginning and end from a SAS dataset

I am having a SAS dataset where I want to keep, let's say, the firt 2 columns and last 4 columns, so to speak. In other words, only columns from beginning and from end.
data test;
input a b c d e f g h i j;
cards;
1 2 3 4 5 6 7 8 9 10
;
Initial output:
What I want is the following -
Output desired:
I checked on the net, people are trying something with varnum, as shown here, but I can't figure out. I don't want to use keep/drop, rather I want an automated way to solve this issue.
%DOSUBL can run code in a separate stream and be part of a code generation scheme at code submit (pre-run) time.
Suppose the requirement is to to slice the columns of a data set out based on meta data column position as indicated by varnum (i.e. places), and the syntax for places is:
p:q to select the range of columns whose varnum position is between p and q
multiple ranges can be specified, separated by spaces ()
a single column position, p, can be specified
negative values select the position downward from the highest position.
also, the process should honor all incoming data set options specified, i.e. keep= drop=
All the complex logic for implementing the requirements could be done in pure macro code using only %sysfunc and data functions such as open, varnum, varname, etc... That code would be pretty unwieldy.
The selection of names from meta data can be cleaner using SAS features such as Proc CONTENTS and Proc SQL executed within DOSUBL.
Example:
Macro logic is used to construct (or map) the filtering criteria statement based on varnum. Metadata retrieval and processing done with Procs.
%macro columns_slice (data=, places=);
%local varlist temp index p token part1 part2 filter joiner;
%let temp = __&sysmacroname._%sysfunc(monotonic());
%do index = 1 %to %sysfunc(countw(&places,%str( )));
%let token = %scan(&places,&index,%str( ));
%if NOT %sysfunc(prxmatch(/^(-?\d+:)?-?\d+$/,&token)) %then %do;
%put ERROR: &sysmacname, invalid places=&places;
%return;
%end;
%let part1 = %scan (%superq(token),1,:);
%let part2 = %scan (%superq(token),2,:);
%if %qsubstr(&part1,1,1) = %str(-) %then
%let part1 = max(varnum) + 1 &part1;
%if %length(&part2) %then %do;
%if %qsubstr(&part2,1,1) = %str(-) %then
%let part2 = max(varnum) + 1 &part2;
%end;
%else
%let part2 = &part1;
%let filter=&filter &joiner (varnum between &part1. and &part2.) ;
%let joiner = OR;
%end;
%put NOTE: &=filter;
%if 0 eq %sysfunc(dosubl(%nrstr(
options nonotes;
proc contents noprint data=&data out=&temp(keep=name varnum);
proc sql noprint;
select name
into :varlist separated by ' '
from &temp
having &filter
order by varnum
;
drop table &temp;
quit;
)))
%then %do;&varlist.%end;
%else
%put ERROR: &sysmacname;
%mend;
Using the slicer
* create sample table for demonstration;
data lotsa_columns(label='A silly 1:1 merge');
if _n_ > 10 then stop;
merge
sashelp.class
sashelp.cars
;
run;
%put %columns_slice (data=lotsa_columns, places=1:3);
%put %columns_slice (data=lotsa_columns, places=-1:-5);
%put %columns_slice (data=lotsa_columns, places=2:4 -2:-4 6 7 8);
1848 %put %columns_slice (data=lotsa_columns, places=1:3);
NOTE: FILTER=(varnum between 1 and 3)
Name Sex Age
1849 %put %columns_slice (data=lotsa_columns, places=-1:-5);
NOTE: FILTER=(varnum between max(varnum) + 1 -1 and max(varnum) + 1 -5)
Horsepower MPG_City MPG_Highway Wheelbase Length
1850 %put %columns_slice (data=lotsa_columns, places=2:4 -2:-4 6 7 8);
NOTE: FILTER=(varnum between 2 and 4) OR (varnum between max(varnum) + 1 -2 and max(varnum) + 1
-4) OR (varnum between 6 and 6) OR (varnum between 7 and 7) OR (varnum between 8 and 8)
Sex Age Height Make Model Type MPG_City MPG_Highway Wheelbase
Honoring options
data have;
array x(100);
array y(100);
array z(100);
run;
%put %columns_slice (data=have(keep=x:), places=2:4 8:10 -2:-4 -25:-27 -42);
1858 %put %columns_slice (data=have(keep=x:), places=2:4 8:10 -2:-4 -25:-27 -42);
NOTE: FILTER=(varnum between 2 and 4) OR (varnum between 8 and 10) OR (varnum between max(varnum)
+ 1 -2 and max(varnum) + 1 -4) OR (varnum between max(varnum) + 1 -25 and max(varnum) + 1 -27) OR
(varnum between max(varnum) + 1 -42 and max(varnum) + 1 -42)
x2 x3 x4 x8 x9 x10 x59 x74 x75 x76 x97 x98 x99
If you don't know number of variables, you can use this macro(you should specify num of first variables and num of last variables to keep in data set, libname and name of dataset):
%macro drop_vars(num_first_vars,num_end_vars,lib,dataset); %macro d;%mend d;
proc sql noprint;;
select sum(num_character,num_numeric) into:ncolumns
from dictionary.tables
where libname=upcase("&lib") and memname=upcase("&dataset");
select name into: vars_to_drop separated by ','
from dictionary.columns
where libname=upcase("&lib") and
memname=upcase("&dataset") and
varnum between %eval(&num_first_vars.+1) and %eval(&ncolumns-&num_end_vars);
alter table &lib..&dataset
drop &vars_to_drop;
quit;
%mend drop_vars;
%drop_vars(2,3,work,test);
Dataset before macro execution:
+---+---+---+---+---+---+---+---+---+----+
| a | b | c | d | e | f | g | h | i | j |
+---+---+---+---+---+---+---+---+---+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
+---+---+---+---+---+---+---+---+---+----+
Dataset after macro execution:
+---+---+---+---+----+
| a | b | h | i | j |
+---+---+---+---+----+
| 1 | 2 | 8 | 9 | 10 |
+---+---+---+---+----+
If the names follow a pattern just generate the list using the pattern. So if the names look like month names you just need to know one month to generate the other.
%let last_month = '01JAN2019'd ;
%let first_var = %sysfunc(intnx(month,&last_month,-12),monyy7.);
%let last_var = %sysfunc(intnx(month,&last_month,-0),monyy7.);
data want;
set have(keep= id1 id2 &first_var -- &last_var);
run;
If you cannot find a SAS function or format that generates the names in the style your variables use then write your own logic.
data _null_;
array month_abbr [12] $3 _temporary_ ('JAN' 'FEB' 'MAR' 'APR' 'MAY' 'JUN' 'JUL' 'AUG' 'SEP' 'OKT' 'NOV' 'DEK' );
last_month=today();
first_month=intnx('month',last_month,-12);
call symputx('first_var',catx('_',month_abbr[month(first_month)],year(first_month)));
call symputx('last_var',catx('_',month_abbr[month(last_month)],year(last_month)));
run;

Isolate Patients with 2 diagnoses but diagnosis data is on different lines

I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;

SAS Chi Square Test

I have the following dataset on which I intend to perform a chi square test (all variables being categorical).
Indicator Area Range1 Range2
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 0-17 25-50
1 B 17-25 25-50
1 B 0-17 17-25
1 B 17-25 25-50
The test is required to be perform at all levels namely for range1,range2 & area.One way to do it is to create a macro to do the same.But I have around 300 variables & to call the macro 300 times is not efficient. The code that I use for 3 variables is as follows:
options mprint mlogic symbolgen;
%macro chi_test(vars_test);
proc freq data =testdata.AllData;
tables &vars_test*Indicator/ norow nocol nopercent chisq ;
output out=stats_&vars_test &vars_test PCHI;
run;
data all_chi;
set stats_:;
run;
%mend chi_test;
%chi_test(Range1);
%chi_test(Range2);
%chi_test(Area);
Can any one help out?
Why not just transpose the data and use BY group processing.
First add a unique row identifier so that PROC TRANSPOSE can convert your variables to a single column.
data have_extra;
row+1;
set have;
run;
proc transpose data=have_extra out=tall ;
by row indicator ;
var area range1 range2 ;
run;
Then order the records by the original variable name.
proc sort; by _name_ ; run;
Then you can run your CHI-SQ for each of your original variables.
proc freq data =tall ;
by _name_;
tables col1*Indicator/ norow nocol nopercent chisq ;
output out=all_chi PCHI;
run;
If all your variables are categorical then you can use _all_in the tables statement, along with ods output for the dataset. This creates a single dataset with all the combinations of variables * Indicator.
If you wanted, you can apply dataset options (where=, keep=, drop= etc) against the output dataset.
data have;
input Indicator Area $ Range1 $ Range2 $;
datalines;
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 17-25 25-50
0 A 0-17 25-50
1 B 17-25 25-50
1 B 0-17 17-25
1 B 17-25 25-50
;
run;
ods select chisq;
ods output chisq=want;
proc freq data=have;
tables _all_*Indicator/ norow nocol nopercent chisq;
run;

Replace the missing values in SAS

I want to replace the missing values with the next variables by pushing the values towards H1, Please see the example below. I have placed the desired output below.
b
SN OP_NAME H1 H2 H3 H4 H5
115060 NORS . 2331
115060 WIDE .
115061 .
115061 AIR . 7680
115061 ALLI .
115061 SKYW 1594
115062 NORS . .
115062 WIDE 3130 .
115063 NORS . 5414
115063 WIDE .
115064 ATLA 5231 . 11259 .
115066 ATLA 9637 . 5191 .
115067 LUXA .
115069 ATLA . 5963 .
115070 AMER 7457
115070 ATLA 10181
115070 WEST .
115072 JETS 10517
115073 SKYW . . 5515 . .
115074 MIDW .
115075 SKYW . . 4291 3499 11549
115076 DLTN 3918
Output looks like:`
SN OP_NAME H1 H2 H3
115060 NORS 2331
115060 WIDE .
115061 .
115061 AIR 7680
115061 ALLI .
115061 SKYW 1594
115062 NORS . .
115062 WIDE 3130 .
115063 NORS 5414
115063 WIDE .
115064 ATLA 5231 11259
115066 ATLA 9637 5191
115067 LUXA .
115069 ATLA 5963 .
115070 AMER 7457
115070 ATLA 10181
115070 WEST .
115072 JETS 10517
115073 SKYW 5515 .
115074 MIDW .
115075 SKYW 4291 3499 11549
115076 DLTN 3918
A cheeky double proc transpose ought to do the trick (the first datastep is some test code):
data test_code;
serial=1; h1=3; h2=.; h3=55; output;
serial=2; h1=.; h2=.; h3=32; h4=.; output;
serial=3; h1=45; h2=23; h3=.; h4=99; output;
serial=4; h1=.; h2=.; h3=5; output;
proc sort;
by serial;
run;
proc transpose data=test_code out=test_code_tran(drop=_:);
by serial;
var h:;
proc transpose data=test_code_tran prefix=h out=final_output(drop=_:);
by serial;
var col1;
where col1;
run;
As programmed above though, it will only work with numeric values in the h* variables
The simplest way is probably a double-counter loop.
data want;
set have;
array hs h:;
_counter=2;
do _t = 1 to dim(hs)-1 while (_counter le dim(hs));
if missing(hs[_t]) then do;
do while (missing(hs[_counter]));
_counter+1;
if _counter > dim(hs) then leave;
end;
put _t= _counter=;
if _counter le dim(hs) then do;
hs[_t] = hs[_counter];
call missing(hs[_counter]);
_counter+1;
end;
end;
end;
run;
The PROC TRANSPOSE option is less code and more flexible; this may be faster if you have a ton of rows.
In a similar method to Joe's, I'd use arrays for this kind of processing...
%LET NVARS = 5 ;
data want ;
set have ;
array _t{&NVARS} _TEMPORARY_ ;
array _n H1-H&NVARS ;
t = 0 ;
/* Load non-missing values into temporary array */
do i = 1 to dim(_n) ;
if not missing(_n{i}) then do ;
t + 1 ;
_t{t} = _n{i} ;
end ;
end ;
/* Load temporary array back into source array */
call missing(of _n{*}) ;
do i = 1 to t ;
_n{i} = _t{i} ;
end ;
drop i t ;
run ;

SAS code to strip apostrophe from end of data values

I have a variable Deck1-Deck100 (Deck1 through Deck100 representing 100 trials of a task) with four possible values A' B' C' D'. I need to recode this variable so that A' and B' = 0 and C' and D' = 1. Can you help? I know nothing about how to compress variables.
or the informat approach:
/* generate a fake data set */
data trials;
array trial(100) $2.;
array t(4) $2. _temporary_ ("A'","B'","C'","D'");
do i = 1 to 100;
j=ceil(ranuni(112233)*4);
trial[i] = t[j];
end;
drop i j ;
run;
proc print noobs;
var trial1-trial10;
run;
/* create a 'recoding' format */
proc format;
invalue trials (upcase)
"A'","B'"=0
"C'","D'"=1
other=.;
run;
/* convert the values */
data newtrials;
set trials;
array trial(*) $2. trial1-trial100;
array rtrial(100);
do i = 1 to 100;
rtrial[i]=input(trial[i], trials.);
end;
drop i trial:;
run;
proc print noobs;
var rtrial1-rtrial10;
run;
which produces this output:
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
trial1 trial2 trial3 trial4 trial5 trial6 trial7 trial8 trial9 trial10
D' D' B' A' A' D' A' B' C' D'
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
rtrial1 rtrial2 rtrial3 rtrial4 rtrial5 rtrial6 rtrial7 rtrial8 rtrial9 rtrial10
1 1 0 0 0 1 0 0 1 1
Assuming that you have all 100 variables in one observation and just want to recode the values, the following might work. Just replace the number 4 with the number 100.
data sample;
deck1='A''';
deck2='B''';
deck3='C''';
deck4='D''';
run;
proc print data=sample;
run;
data result;
set sample;
array alldecks(4) deck1-deck4;
do i=1 to 4;
if alldecks(i) eq 'A''' or alldecks(i) eq 'B''' then alldecks(i)='0';
if alldecks(i) eq 'C''' or alldecks(i) eq 'D''' then alldecks(i)='1';
end;
drop i;
run;
proc print data=result;
run;
This gives you the following output:
Obs deck1 deck2 deck3 deck4
1 A' B' C' D'
Obs deck1 deck2 deck3 deck4
1 0 0 1 1
Striping the single quote is easy, just compress() it out:
data _null_;
length old new $8.;
old = "A'";
new = compress(old, "'");
put (_all_) (=/);
run;
/* on log
old=A'
new=A
*/