I have a dataset similar to this (Se is variable Serious):
Se D L C Di H O
N
N
N
Y N Y N N N N
N
N
N
Y N Y N Y N N
variables Death (D), Life_Threat (L), Congenital_anom (C), Disability (Di), Hospitalizaion (H) and Other (O) have to match a value from 1 (for variable Death) to 6 (for variableOther) and I have to record these values in a new variable that in the case for example of the last line where I have two 'Y' takes into account both values in this way 2,4. How can I do it? (I have lines where all 6 variables are Y).
Thanks in advance
I don’t know how to set the new variable so that it considers all the 6 variables for which I have Y
You can use the function catx to combine values to one string:
Generating your sample data:
data have;
input Se $1. D $1. L $1. C $1. Di $1. H $1. O $1.;
datalines;
N
N
N
YNYNNNN
N
N
N
YNYNYNN
YYYYYYY
run;
Here, an array is defined that consists of the six input variables and the function catx is used. In "synthesis" you will find the lists of Y-values in the form "2" or "2,4", respectively, or "1,2,3,4,5,6" in the last example.
data want (drop=i);
set have;
length synthesis $12;
array yn {6} D L C Di H O;
do i=1 to 6;
if yn{i}="Y" then synthesis=catx(',',synthesis,put(i,$1.));
end;
run;
Related
I have the following SAS dataset.
correlation
policynum
risknum
A
X
Y
A
X
Y
A
X
Y
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
C
Z
M
C
Z
M
C
Z
M
D
Z
M
D
Z
M
D
Z
M
In SAS, I want to filter the above dataset so I get my final output as:
correlation
policynum
risknum
B
X
Y
B
X
Y
B
X
Y
B
X
L
B
X
L
B
X
L
D
Z
M
D
Z
M
D
Z
M
i.e. for each group of policynum and risknum, if multiple values exist for correlation, I want to keep the second value and get rid of the first value.
If only a single value of correlation exists for a group of policynum and risknum, I want to retain that group in my final output too.
What would be the best way to do this? It might be something simple as I am relatively new to SAS.
Thanks in advance!
If the order of the correlation values, in sort order, is the same ordering as they appear row-wise in the data set you can use SQL. Otherwise, SQL, being based on set theory, which does not have implicit row numbers, can not be used. A DATA step with DOW loop can be used.
Example:
FYI, one common situation in which SAS coders use the phrase 'DOW loop' is when SET & BY statements occur inside a DO loop.
data have;
input correlation $ policynum $ risknum $;
datalines;
A X Y
A X Y
A X Y
B X Y
B X Y
B X Y
B X L
B X L
B X L
C Z M
C Z M
C Z M
D Z M
D Z M
D Z M
;
/* keep last group of a nested group */
* SQL can be used only if correlation wanted is ALWAYS highest valued correlation;
proc sql;
create table want as
select * from have
group by policynum, risknum
having correlation = max(correlation)
;
* DATA Step DOW loops can be used when correlation wanted is last occurring correlation within by group;
data want;
do _n_ = 1 by 1 until (last.policynum);
set have;
by policynum risknum notsorted; /* presume at least contiguous */
end;
_want_correlation = correlation;
do _n_ = 1 to _n_;
set have;
if _want_correlation = correlation then OUTPUT;
end;
run;
I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;
I am fairly comfortable programming in R but am working on a scholarly statistical analysis that my PI would much prefer would be done in SAS. I am using SAS University Edition and thus cannot use the new submit / R to do the things I am uncomfortable doing in SAS. In any case, I am trying to conditionally count the frequency of a given character result across multiple columns. using the below toy data set:
DATA example;
INPUT X01_d3 $ X02_d3 $ X03_d3 $ X04_d3 $;
CARDS;
H H F D
H H H H
H D D D
F F F D
F F D D
H . . .
H F . D
;
RUN;
I am wanting to count the number of times that "H" appears for a given observation and put it into a new variable called Num_H. How I would typically code this in R would be:
example$Num_H<-rowSums(example[,1:4] == "H")
giving me the following output:
> example
X01_d3 X02_d3 X03_d3 X04_d3 Num_H
1 H H F D 2
2 H H H H 4
3 H D D D 1
4 F F F D 0
5 F F D D 0
6 H . . . 1
7 H F . D 1
I could easily write this in a data step using if/then statements but based on the size of the data set I would prefer not to. Is there and easier way to do this in SAS in a DATA step, PROC SQL, or otherwise? Thank you in advance for the help.
First off: in using SAS vs R, you're going to find things that are easier to do in one versus the other all the time. Since R is a matrix language, and Base SAS is not, things like 'scan every element in this list ...' will be one of the things R does more efficiently than SAS.
That said, there's an easy way to do this:
data want;
set example;
num_h = lengthn(trimn(compress(cats(of _character_),'H','k')));
run;
COMPRESS eliminates characters not 'H' and then the other things make it so it works properly (trimn/lengthn make it so it doesn't count empty ' ' as one, cats takes all of the char variables and makes them a single string).
If your data were more complicated, where you couldn't use this trick (such as multiple character strings), you could certainly loop over the variables to get your result.
data want;
set example;
array xvars x01_d3 -- x04_d3;
do _i = 1 to dim(xvars);
num_h = sum(num_h, xvars[_i]='H');
end;
drop _i;
run;
A little longer of course to write, but gets the job done pretty easily.
As an alternate option, if you are using SAS University Edition, you have access to SAS/IML, which is SAS's matrix language (i.e., similar to R). IML isn't identical to R, and you'll still have some issues adjusting to it undoubtedly, but it is a matrix language, so you'll probably find this a bit easier.
Here's the IML program that would produce the vector you're asking for.
proc iml;
use work.example;
read all var _CHAR_ into char_mat;
for_num_h = countc('H',char_mat)[,+];
print for_num_h;
quit;
Here, I apply the SAS function countc to generate a matrix of 1/0 (it's done at the cell level); then use the subscript reduction operator for addition to sum them into a vector.
I would do it this way:
Data want;
set example;
Num_H = sum((X01_d3="H"), (X02_d3="H"),(X03_d3="H"),(X04_d3="H"));
run;
In fact (X01_d3="H") creates a dummy variable 0/1. So all you have to do is to sum this values!
Hope it helps!
MK
I have a question to create a new variable.I have several variables named A,B,C,D,E,F,G.All variables are 0/1 binary variable.So I want to create a new variable which shows any 3 or more those variables equal to 1.
For example,
new_variable =0;
if ANY 3 or more variables(A,B,C,D,E,F,G) =1 then new_variable =1;
There's no way sort of a way to do the syntax like you have, but since you're smart and have 0/1 binaries, there's a very easy way if you think about it a sec, to see if 3 or more are 1.
if sum(of a b c d e f g) >= 3 then new_Variable=1;
Actually a bit simpler:
new_Variable = (sum(of a b c d e f g) GE 3);
as true=1 false=0 when you evaluate a boolean expression.
If your data are in an array or with a common prefix, there is a way to do that more easily:
new_variable = (sum(of arrayname[*]) GE 3);
or
new_variable = (sum(of varprefix:) GE 3);
where arrayname is your array or varprefix is the common prefix your variables (and only your variables) share.
Edit: There is, sort of, a way to do this in a similar kind of syntax. Using countc:
data have;
call streaminit(7);
array vars[7] a b c d e f g;
do _n_ = 1 to 20;
do _i = 1 to dim(vars);
vars[_i] = rand('Binomial',.2,1);
end;
output;
end;
run;
data want;
set have;
if countc(cats(of a--g),'1') ge 3;
run;
If you had something other than 1/0, you could use catx to delimit them with a space or something, and then countw to look for the complete value; here, 11 will look like two 1s not eleven, if that were possible in the data.
There are a lot of other solutions, by the way; maybe some others will come and mention them. CALL SORTN and then look for the first instance of 1, for example.
I'm trying to convert a character string to a numeric variable and then sum the values of each character to use as a unique identifier for that field.
So for example, I would like A=1, B=2, C=3.....X=24 Y=25 Z=26.
Say my string is "CAB" so after running the code I would like the result to be an intermidiary column of numbers, where the value for CAB IS 3 1 2 and the result column would be derived by summing the string 3+1+2= 6 and show the value of the intermideate column, so the final value woud be 6.
Here is the sas code I used to convert the characters to numbers, but I need help with the result column.
DATA CHAR_VALUE;
SET WORK.XYZ;
CHAR_2_NUM=TRANSLATE(MY_VAR_CHAR, '1 2 3 ...24 25 26', 'A B C ...X Y Z');
NUM_CHAR=INPUT(CHAR_2_NUM,32.);
RUN;
Thanks in advance...I appreciate any help or suggestions.
-rachel
RANK will give the ASCII numeric value underlying a character; so A=65, B=66, Z=90, a=97, z=122.
So this should work (if you want only the uppercase values - not a different value for a than A):
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,rank(char(upcase(charval),_t))-64);
end;
put _all_;
run;
Another option (Based on the comments below), is to build an informat with the relationships between letter and value. My loop iterates over each character A to Z, you can then put whatever value you want for each letter as label (I just put 1,2,3,4... but label= will change that).
data fmts;
retain fmtname 'CHARNUM' type 'i';
do _t=65 to 90;
start=byte(_t); *the character, so byte(65)='A';
label=_t-64; *the resulting number;
output;
end;
run;
proc format cntlin=fmts;
quit;
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
end;
put _all_;
run;
Finally, if you want to be able to construct this in the same datastep, you could construct the relationships in a hash table and look up the result. I can explain that if desired, though I'd like to see a more detailed example of what you want to do in terms of defining the relationship between a letter and its code.
If you need to see the intermediate values, you can do that by inserting a CAT function in the loop- I recommend CATX:
data test;
charval='CAB';
format intermed $100.;
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
intermed=catx('|',intermed,input(char(upcase(charval),_t),CHARNUM.)); *or the RANK portion from earlier;
end;
put _all_;
run;
That would give you 3|1|2, which you could then do math on via SCAN:
do _t = 1 to countc(intermed,'|')+1;
numval2 = sum(numval2,scan(intermed,_t,'|'));
end;
Your method to try and translate is a good attempt, but it will not really work. Here is a simple solution:
DATA CHAR_VALUE;
retain all_chars 'ABCDEFGHIJKLMMOPQRSTUVXXYZ';
set XYZ;
length CHAR_2_NUM $200;
CHAR_2_NUM = ' ';
NUM_CHAR = 0;
do i=1 to length(MY_VAR_CHAR);
if i=1 then CHAR_2_NUM = substr(MY_VAR_CHAR,i,1);
else CHAR_2_NUM = trim(CHAR_2_NUM) || ' ' || substr(MY_VAR_CHAR,i,1);
NUM_CHAR + index(all_chars,substr(MY_VAR_CHAR,i,1));
end;
drop i all_chars;
RUN;
This takes advantage of the fact that the indexed position of each character of your source variable in the all_chars variable corresponds to the mapping you desired.
UPDATED to also create your CHAR_2_NUM variable, which I overlooked in the original question.
Another simple solution is based on the collate function:
To convert a variable called MyNumbers (in the range of 1 to 26) to English upper-case characters, one can use:
collate(64 + MyNumbers, 64 + MyNumbers)
To obtain lower-case characters, one can use:
collate(96 + MyNumbers, 96 + MyNumbers)
Here's a quick example:
data _null_;
do MyNumbers = 1 to 26;
MyLettersUpper = collate(64 + MyNumbers, 64 + MyNumbers);
MyLettersLower = collate(96 + MyNumbers, 96 + MyNumbers);
put MyNumbers MyLettersUpper MyLettersLower;
end;
run;
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
7 G g
8 H h
9 I i
10 J j
11 K k
12 L l
13 M m
14 N n
15 O o
16 P p
17 Q q
18 R r
19 S s
20 T t
21 U u
22 V v
23 W w
24 X x
25 Y y
26 Z z
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds