Conditionally Counting a Response Across Multiple Variables in SAS - sas

I am fairly comfortable programming in R but am working on a scholarly statistical analysis that my PI would much prefer would be done in SAS. I am using SAS University Edition and thus cannot use the new submit / R to do the things I am uncomfortable doing in SAS. In any case, I am trying to conditionally count the frequency of a given character result across multiple columns. using the below toy data set:
DATA example;
INPUT X01_d3 $ X02_d3 $ X03_d3 $ X04_d3 $;
CARDS;
H H F D
H H H H
H D D D
F F F D
F F D D
H . . .
H F . D
;
RUN;
I am wanting to count the number of times that "H" appears for a given observation and put it into a new variable called Num_H. How I would typically code this in R would be:
example$Num_H<-rowSums(example[,1:4] == "H")
giving me the following output:
> example
X01_d3 X02_d3 X03_d3 X04_d3 Num_H
1 H H F D 2
2 H H H H 4
3 H D D D 1
4 F F F D 0
5 F F D D 0
6 H . . . 1
7 H F . D 1
I could easily write this in a data step using if/then statements but based on the size of the data set I would prefer not to. Is there and easier way to do this in SAS in a DATA step, PROC SQL, or otherwise? Thank you in advance for the help.

First off: in using SAS vs R, you're going to find things that are easier to do in one versus the other all the time. Since R is a matrix language, and Base SAS is not, things like 'scan every element in this list ...' will be one of the things R does more efficiently than SAS.
That said, there's an easy way to do this:
data want;
set example;
num_h = lengthn(trimn(compress(cats(of _character_),'H','k')));
run;
COMPRESS eliminates characters not 'H' and then the other things make it so it works properly (trimn/lengthn make it so it doesn't count empty ' ' as one, cats takes all of the char variables and makes them a single string).
If your data were more complicated, where you couldn't use this trick (such as multiple character strings), you could certainly loop over the variables to get your result.
data want;
set example;
array xvars x01_d3 -- x04_d3;
do _i = 1 to dim(xvars);
num_h = sum(num_h, xvars[_i]='H');
end;
drop _i;
run;
A little longer of course to write, but gets the job done pretty easily.

As an alternate option, if you are using SAS University Edition, you have access to SAS/IML, which is SAS's matrix language (i.e., similar to R). IML isn't identical to R, and you'll still have some issues adjusting to it undoubtedly, but it is a matrix language, so you'll probably find this a bit easier.
Here's the IML program that would produce the vector you're asking for.
proc iml;
use work.example;
read all var _CHAR_ into char_mat;
for_num_h = countc('H',char_mat)[,+];
print for_num_h;
quit;
Here, I apply the SAS function countc to generate a matrix of 1/0 (it's done at the cell level); then use the subscript reduction operator for addition to sum them into a vector.

I would do it this way:
Data want;
set example;
Num_H = sum((X01_d3="H"), (X02_d3="H"),(X03_d3="H"),(X04_d3="H"));
run;
In fact (X01_d3="H") creates a dummy variable 0/1. So all you have to do is to sum this values!
Hope it helps!
MK

Related

SAS - How to make dataset wide to long when some values are missing?

I have a dataset that looks basically like this:
LOCID
Name
Addtl Loc 1
Addtl Loc 2
Addtl Loc 3
1
A
2
3
5
1
B
2
1
C
2
4
And I would like to make it look like this:
LOCID
Name
Gender
1
A
F
2
A
F
3
A
F
5
A
F
1
B
M
2
B
M
1
C
F
2
C
F
4
C
F
So, I'd like to keep the attributes for each person but have a row for each of their locations. I also don't currently have a unique ID or any variable to identify each of the people but I could make one. I'm working in SAS. Does anyone have suggestions on how to do this?
I have been looking up wide to long methods but am having trouble understanding them.
It looks to me like you could just use a DO LOOP to transpose the data.
So assuming your input data set has LOCID and ADD_LOCID1 to ADD_LOCID3 plus any other variables, such as NAME and GENDER, you could just do the following to add an extra observation for every non-missing value found in the extra locid variables.
data want;
set have;
array list add_locid1 - add_locid3;
output;
do index=1 to dim(list);
locid = list[index];
if not missing(locid) then output;
end;
drop index add_locid1-add_locid3 ;
run;

SAS - if and then condition statements

My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.

SAS, combine strings

I have a dataset test1, I want to generate a key which is the combination of any of the specified variables. For example, the key in ideal_1, or the key in ideal_2. I need to write a macro for this, but the challenges for me is that the number of the vars are not fixed, as you can see in ideal1, it is the combination of 2, and in ideal3 it is the combination of 3.
data test1;
input var1$ var2$ var3$ var4$ var5$ var6$;
datalines;
1 a a b e
2 a f b e
3 a a a a
1 b a a a
2 a f b e
;
run;
data ideal_1;
set test1;
key=strip(var1)||strip(var2);
run;
data ideal_2;
set test1;
key=strip(var1)||strip(var2)||strip(var5);
run;
Just use a variable list. You could store the list into a macro variable to make it easier to edit.
%let keylist=var1 var2 var5 ;
Then you can use the macro variable where ever you need it.
data ideal_2;
set test1;
key=cats(of &keylist);
run;
If the variables have a naming convention as in your example you can use something like the following, which uses the colon operator to concatenate all of the variables that start with the prefix VAR.
key = catt(of var:);

how to select any 2 or more variables in SAS

I have a question to create a new variable.I have several variables named A,B,C,D,E,F,G.All variables are 0/1 binary variable.So I want to create a new variable which shows any 3 or more those variables equal to 1.
For example,
new_variable =0;
if ANY 3 or more variables(A,B,C,D,E,F,G) =1 then new_variable =1;
There's no way sort of a way to do the syntax like you have, but since you're smart and have 0/1 binaries, there's a very easy way if you think about it a sec, to see if 3 or more are 1.
if sum(of a b c d e f g) >= 3 then new_Variable=1;
Actually a bit simpler:
new_Variable = (sum(of a b c d e f g) GE 3);
as true=1 false=0 when you evaluate a boolean expression.
If your data are in an array or with a common prefix, there is a way to do that more easily:
new_variable = (sum(of arrayname[*]) GE 3);
or
new_variable = (sum(of varprefix:) GE 3);
where arrayname is your array or varprefix is the common prefix your variables (and only your variables) share.
Edit: There is, sort of, a way to do this in a similar kind of syntax. Using countc:
data have;
call streaminit(7);
array vars[7] a b c d e f g;
do _n_ = 1 to 20;
do _i = 1 to dim(vars);
vars[_i] = rand('Binomial',.2,1);
end;
output;
end;
run;
data want;
set have;
if countc(cats(of a--g),'1') ge 3;
run;
If you had something other than 1/0, you could use catx to delimit them with a space or something, and then countw to look for the complete value; here, 11 will look like two 1s not eleven, if that were possible in the data.
There are a lot of other solutions, by the way; maybe some others will come and mention them. CALL SORTN and then look for the first instance of 1, for example.

How do I assign numeric values to the alphabet in SAS

I'm trying to convert a character string to a numeric variable and then sum the values of each character to use as a unique identifier for that field.
So for example, I would like A=1, B=2, C=3.....X=24 Y=25 Z=26.
Say my string is "CAB" so after running the code I would like the result to be an intermidiary column of numbers, where the value for CAB IS 3 1 2 and the result column would be derived by summing the string 3+1+2= 6 and show the value of the intermideate column, so the final value woud be 6.
Here is the sas code I used to convert the characters to numbers, but I need help with the result column.
DATA CHAR_VALUE;
SET WORK.XYZ;
CHAR_2_NUM=TRANSLATE(MY_VAR_CHAR, '1 2 3 ...24 25 26', 'A B C ...X Y Z');
NUM_CHAR=INPUT(CHAR_2_NUM,32.);
RUN;
Thanks in advance...I appreciate any help or suggestions.
-rachel
RANK will give the ASCII numeric value underlying a character; so A=65, B=66, Z=90, a=97, z=122.
So this should work (if you want only the uppercase values - not a different value for a than A):
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,rank(char(upcase(charval),_t))-64);
end;
put _all_;
run;
Another option (Based on the comments below), is to build an informat with the relationships between letter and value. My loop iterates over each character A to Z, you can then put whatever value you want for each letter as label (I just put 1,2,3,4... but label= will change that).
data fmts;
retain fmtname 'CHARNUM' type 'i';
do _t=65 to 90;
start=byte(_t); *the character, so byte(65)='A';
label=_t-64; *the resulting number;
output;
end;
run;
proc format cntlin=fmts;
quit;
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
end;
put _all_;
run;
Finally, if you want to be able to construct this in the same datastep, you could construct the relationships in a hash table and look up the result. I can explain that if desired, though I'd like to see a more detailed example of what you want to do in terms of defining the relationship between a letter and its code.
If you need to see the intermediate values, you can do that by inserting a CAT function in the loop- I recommend CATX:
data test;
charval='CAB';
format intermed $100.;
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
intermed=catx('|',intermed,input(char(upcase(charval),_t),CHARNUM.)); *or the RANK portion from earlier;
end;
put _all_;
run;
That would give you 3|1|2, which you could then do math on via SCAN:
do _t = 1 to countc(intermed,'|')+1;
numval2 = sum(numval2,scan(intermed,_t,'|'));
end;
Your method to try and translate is a good attempt, but it will not really work. Here is a simple solution:
DATA CHAR_VALUE;
retain all_chars 'ABCDEFGHIJKLMMOPQRSTUVXXYZ';
set XYZ;
length CHAR_2_NUM $200;
CHAR_2_NUM = ' ';
NUM_CHAR = 0;
do i=1 to length(MY_VAR_CHAR);
if i=1 then CHAR_2_NUM = substr(MY_VAR_CHAR,i,1);
else CHAR_2_NUM = trim(CHAR_2_NUM) || ' ' || substr(MY_VAR_CHAR,i,1);
NUM_CHAR + index(all_chars,substr(MY_VAR_CHAR,i,1));
end;
drop i all_chars;
RUN;
This takes advantage of the fact that the indexed position of each character of your source variable in the all_chars variable corresponds to the mapping you desired.
UPDATED to also create your CHAR_2_NUM variable, which I overlooked in the original question.
Another simple solution is based on the collate function:
To convert a variable called MyNumbers (in the range of 1 to 26) to English upper-case characters, one can use:
collate(64 + MyNumbers, 64 + MyNumbers)
To obtain lower-case characters, one can use:
collate(96 + MyNumbers, 96 + MyNumbers)
Here's a quick example:
data _null_;
do MyNumbers = 1 to 26;
MyLettersUpper = collate(64 + MyNumbers, 64 + MyNumbers);
MyLettersLower = collate(96 + MyNumbers, 96 + MyNumbers);
put MyNumbers MyLettersUpper MyLettersLower;
end;
run;
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
7 G g
8 H h
9 I i
10 J j
11 K k
12 L l
13 M m
14 N n
15 O o
16 P p
17 Q q
18 R r
19 S s
20 T t
21 U u
22 V v
23 W w
24 X x
25 Y y
26 Z z
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds