I have a question to create a new variable.I have several variables named A,B,C,D,E,F,G.All variables are 0/1 binary variable.So I want to create a new variable which shows any 3 or more those variables equal to 1.
For example,
new_variable =0;
if ANY 3 or more variables(A,B,C,D,E,F,G) =1 then new_variable =1;
There's no way sort of a way to do the syntax like you have, but since you're smart and have 0/1 binaries, there's a very easy way if you think about it a sec, to see if 3 or more are 1.
if sum(of a b c d e f g) >= 3 then new_Variable=1;
Actually a bit simpler:
new_Variable = (sum(of a b c d e f g) GE 3);
as true=1 false=0 when you evaluate a boolean expression.
If your data are in an array or with a common prefix, there is a way to do that more easily:
new_variable = (sum(of arrayname[*]) GE 3);
or
new_variable = (sum(of varprefix:) GE 3);
where arrayname is your array or varprefix is the common prefix your variables (and only your variables) share.
Edit: There is, sort of, a way to do this in a similar kind of syntax. Using countc:
data have;
call streaminit(7);
array vars[7] a b c d e f g;
do _n_ = 1 to 20;
do _i = 1 to dim(vars);
vars[_i] = rand('Binomial',.2,1);
end;
output;
end;
run;
data want;
set have;
if countc(cats(of a--g),'1') ge 3;
run;
If you had something other than 1/0, you could use catx to delimit them with a space or something, and then countw to look for the complete value; here, 11 will look like two 1s not eleven, if that were possible in the data.
There are a lot of other solutions, by the way; maybe some others will come and mention them. CALL SORTN and then look for the first instance of 1, for example.
Related
I have two groups, A and B, and two numeric variables, X and Y. I want to create two new variables, new1 and new2, based on the values of X and Y (respectively) for group B (i.e., IF group = B THEN new1 = X, new2 = Y). I want to take those newly created variables, append them to group A, and then delete group B. In the end, there should be one row for group A containing X, Y, new1, and new2. I'm uncertain how to accomplish this.
I've looked into using PROC TRANSPOSE, but I'm unsure if that's the right starting point. My internet searches are lacking because I'm not even sure what to call what I'm attempting to do, though I'm betting this is a common procedure requiring a common solution.
EXAMPLE
Not sure how to generalize the problem, but for the given problem this will work:
/* Just reversing the records */
proc sort data = have;
by descending group;
run;
data want;
set have;
retain new1 new2;
if _N_ = 1 then do;
new1 = x;
new2 = y;
end;
else output;
run;
This sounds like a case of 1 to 1 merging (merge with out BY).
data have; input
group $1. x y; datalines;
A 3 4
B 2 6
run;
data want;
merge
have(where=( group='A'))
have(where=(Bgroup='B') rename=(x=Bx y=By group=Bgroup))
;
drop Bgroup;
run;
My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.
I have a dataset test1, I want to generate a key which is the combination of any of the specified variables. For example, the key in ideal_1, or the key in ideal_2. I need to write a macro for this, but the challenges for me is that the number of the vars are not fixed, as you can see in ideal1, it is the combination of 2, and in ideal3 it is the combination of 3.
data test1;
input var1$ var2$ var3$ var4$ var5$ var6$;
datalines;
1 a a b e
2 a f b e
3 a a a a
1 b a a a
2 a f b e
;
run;
data ideal_1;
set test1;
key=strip(var1)||strip(var2);
run;
data ideal_2;
set test1;
key=strip(var1)||strip(var2)||strip(var5);
run;
Just use a variable list. You could store the list into a macro variable to make it easier to edit.
%let keylist=var1 var2 var5 ;
Then you can use the macro variable where ever you need it.
data ideal_2;
set test1;
key=cats(of &keylist);
run;
If the variables have a naming convention as in your example you can use something like the following, which uses the colon operator to concatenate all of the variables that start with the prefix VAR.
key = catt(of var:);
I am fairly comfortable programming in R but am working on a scholarly statistical analysis that my PI would much prefer would be done in SAS. I am using SAS University Edition and thus cannot use the new submit / R to do the things I am uncomfortable doing in SAS. In any case, I am trying to conditionally count the frequency of a given character result across multiple columns. using the below toy data set:
DATA example;
INPUT X01_d3 $ X02_d3 $ X03_d3 $ X04_d3 $;
CARDS;
H H F D
H H H H
H D D D
F F F D
F F D D
H . . .
H F . D
;
RUN;
I am wanting to count the number of times that "H" appears for a given observation and put it into a new variable called Num_H. How I would typically code this in R would be:
example$Num_H<-rowSums(example[,1:4] == "H")
giving me the following output:
> example
X01_d3 X02_d3 X03_d3 X04_d3 Num_H
1 H H F D 2
2 H H H H 4
3 H D D D 1
4 F F F D 0
5 F F D D 0
6 H . . . 1
7 H F . D 1
I could easily write this in a data step using if/then statements but based on the size of the data set I would prefer not to. Is there and easier way to do this in SAS in a DATA step, PROC SQL, or otherwise? Thank you in advance for the help.
First off: in using SAS vs R, you're going to find things that are easier to do in one versus the other all the time. Since R is a matrix language, and Base SAS is not, things like 'scan every element in this list ...' will be one of the things R does more efficiently than SAS.
That said, there's an easy way to do this:
data want;
set example;
num_h = lengthn(trimn(compress(cats(of _character_),'H','k')));
run;
COMPRESS eliminates characters not 'H' and then the other things make it so it works properly (trimn/lengthn make it so it doesn't count empty ' ' as one, cats takes all of the char variables and makes them a single string).
If your data were more complicated, where you couldn't use this trick (such as multiple character strings), you could certainly loop over the variables to get your result.
data want;
set example;
array xvars x01_d3 -- x04_d3;
do _i = 1 to dim(xvars);
num_h = sum(num_h, xvars[_i]='H');
end;
drop _i;
run;
A little longer of course to write, but gets the job done pretty easily.
As an alternate option, if you are using SAS University Edition, you have access to SAS/IML, which is SAS's matrix language (i.e., similar to R). IML isn't identical to R, and you'll still have some issues adjusting to it undoubtedly, but it is a matrix language, so you'll probably find this a bit easier.
Here's the IML program that would produce the vector you're asking for.
proc iml;
use work.example;
read all var _CHAR_ into char_mat;
for_num_h = countc('H',char_mat)[,+];
print for_num_h;
quit;
Here, I apply the SAS function countc to generate a matrix of 1/0 (it's done at the cell level); then use the subscript reduction operator for addition to sum them into a vector.
I would do it this way:
Data want;
set example;
Num_H = sum((X01_d3="H"), (X02_d3="H"),(X03_d3="H"),(X04_d3="H"));
run;
In fact (X01_d3="H") creates a dummy variable 0/1. So all you have to do is to sum this values!
Hope it helps!
MK
I have a SAS dataset as follow :
Key A B C D E
001 1 . 1 . 1
002 . 1 . 1 .
Other than keeping the existing varaibales, I want to replace variable value with the variable name if variable A has value 1 then new variable should have value A else blank.
Currently I am hardcoding the values, does anyone has a better solution?
The following should do the trick (the first dstep sets up the example):-
data test_data;
length key A B C D E 3;
format key z3.; ** Force leading zeroes for KEY;
key=001; A=1; B=.; C=1; D=.; E=1; output;
key=002; A=.; B=1; C=.; D=1; E=.; output;
proc sort;
by key;
run;
data results(drop = _: i);
set test_data(rename=(A=_A B=_B C=_C D=_D E=_E));
array from_vars[*] _:;
array to_vars[*] $1 A B C D E;
do i=1 to dim(from_vars);
to_vars[i] = ifc( from_vars[i], substr(vname(from_vars[i]),2), '');
end;
run;
It all looks a little awkward as we have to rename the original (assumed numeric) variables to then create same-named character variables that can hold values 'A', 'B', etc.
If your 'real' data has many more variables, the renaming can be laborious so you might find a double proc transpose more useful:-
proc transpose data = test_data out = test_data_tran;
by key;
proc transpose data = test_data_tran out = results2(drop = _:);
by key;
var _name_;
id _name_;
where col1;
run;
However, your variables will be in the wrong order on the output dataset and will be of length $8 rather than $1 which can be a waste of space. If either points are important (they rsldom are) and both can be remedied by following up with a length statement in a subsequent datastep:-
option varlenchk = nowarn;
data results2;
length A B C D E $1;
set results2;
run;
option varlenchk = warn;
This organises the variables in the right order and minimises their length. Still, you're now hard-coding your variable names which means you might as well have just stuck with the original array approach.