Merge 3 variable into 1 - sas

i have table as below. i want to combine these 3 variable in one. if EX1 has value, the rest is null.
id ex1 ex2 ex3
2 12
3 13
4 13
5 14
i need this table
id final
2 12
3 13
4 13
5 14

The coalesce function returns the first non-missing argument from a list of arguments. So for example:
data want;
set have;
final=coalesce(of x1-x3);
run;
Returns the first nonmissing value from x1, x2, x3.
coalescec is the character version of the function (it returns a character value).
Another option would be to sum the values, so
data want;
set have;
final = sum(of x1-x3);
run;
or in character, cats (or catx with delimiter) will concatenate them. These will behave differently than coalesce/coalescec if more than one value is present, and sum will behave differently if 0 values are present, but will behave identically if exactly one value is always present.

Related

SAS - How to make dataset wide to long when some values are missing?

I have a dataset that looks basically like this:
LOCID
Name
Addtl Loc 1
Addtl Loc 2
Addtl Loc 3
1
A
2
3
5
1
B
2
1
C
2
4
And I would like to make it look like this:
LOCID
Name
Gender
1
A
F
2
A
F
3
A
F
5
A
F
1
B
M
2
B
M
1
C
F
2
C
F
4
C
F
So, I'd like to keep the attributes for each person but have a row for each of their locations. I also don't currently have a unique ID or any variable to identify each of the people but I could make one. I'm working in SAS. Does anyone have suggestions on how to do this?
I have been looking up wide to long methods but am having trouble understanding them.
It looks to me like you could just use a DO LOOP to transpose the data.
So assuming your input data set has LOCID and ADD_LOCID1 to ADD_LOCID3 plus any other variables, such as NAME and GENDER, you could just do the following to add an extra observation for every non-missing value found in the extra locid variables.
data want;
set have;
array list add_locid1 - add_locid3;
output;
do index=1 to dim(list);
locid = list[index];
if not missing(locid) then output;
end;
drop index add_locid1-add_locid3 ;
run;

Calculating median across multiple rows and columns in SAS 9.4

I tried searching multiple places but have not been able to find a solution yet. I was wondering if someone here would be able to please help me?
I am trying to calculate a median value (with Q1 and Q3) across multiple rows and columns in SAS 9.4 The dataset I am working with looks like the following:
Obs tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
The context is this is for a medical condition where a person may have 1 (or more) tumors. Each row represents 1 person. Each person may have up to 4 tumors. I would like to determine the median size of all tumors for the entire cohort (not just the median size for each person). Is there a way to calculate this? Thank you in advance.
A transpose of the data will yield a data structure (form) that is amenable to median and quartile computations, at a variety of aggregate combinations, made with PROC SUMMARY and a CLASS statement.
Example:
data have;
input
patient tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4; datalines;
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
;
proc transpose data=have out=new_have;
by patient;
var tumor:;
run;
proc summary data=new_have;
class patient;
var col1;
output out=want Q1=Q1 Q3=Q3 MEDIAN=MEDIAN N=N;
run;
Results
patient _TYPE_ _FREQ_ Q1 Q3 MEDIAN N
. 0 20 1 3.50 2.25 10
1 1 4 1 2.75 1.25 4
2 1 4 2 2.50 2.25 2
3 1 4 3 3.00 3.00 1
4 1 4 4 4.00 4.00 1
5 1 4 1 3.50 2.25 2
The _TYPE_ column describes the ways in which the CLASS variables are combined in order to achieve the results for the requested statistics. The _TYPE_ = 0 case is for all values, and, in this problem, the _FREQ_ = 20 indicates 20 inputs went into the computation consideration, and that N = 10 of those were non-missing and were involved in the actual computation. The role of _TYPE_ becomes more obvious when there is more than one CLASS variable.
From the Output Data Set documentation:
the variable _TYPE_ that contains information about the class variables. By default _TYPE_ is a numeric variable. If you specify CHARTYPE in the PROC statement, then _TYPE_ is a character variable. When you use more than 32 class variables, _TYPE_ is automatically a character variable.
and
The value of _TYPE_ indicates which combination of the class variables PROC MEANS uses to compute the statistics. The character value of _TYPE_ is a series of zeros and ones, where each value of one indicates an active class variable in the type. For example, with three class variables, PROC MEANS represents type 1 as 001, type 5 as 101, and so on.
A far less elegant way to compute the median of all is to store all the values in an oversized array and use the MEDIAN function on the array after the last row is read in:
data median_all;
set have end=lastrow;
array values [1000000] _temporary_;
array sizes tumor_size_1-tumor_size_4;
do sIndex = 1 to dim(sizes);
/* if not missing (sizes[sIndex]) then do; */ %* decomment for dense fill;
vIndex + 1;
values[vIndex] = sizes[sIndex];
/* end; */ %* decomment for dense fill;
end;
if lastrow then do;
median_all_tumor_sizes = median (of values(*));
output;
put (median:) (=);
end;
keep median:;
run;
-------- LOG -------
median_all_tumor_sizes=2.25

Replacing missing values with closest non-missing value

I have a data set with some missing values and I would like to replace those missing values with the following non-missing value OR if the value occurs in the last variable, then with the previous value.
Eg of data that I have:
x var1 var2 var3 var4
e1 1 2 3 4
e2 . . 5 7
e3 5 8 . .
e4 2 3 1 9
Eg of data that I want:
x var1 var2 var3 var4
e1 1 2 3 4
e2 **5****5** 5 7
e3 5 8 **8** **8**
e4 2 3 1 9
I have tried the following code:
set have;
array t(*) var1--var4;
do _n_=1 to dim(t);
if t(_n_)=. then t(_n_)=coalesce(of t(*));
end;
run;```
However, this only replaces the missing value with the following one ie, if the missing value occurs in the var4 then it takes the value from var1 of that row (e3) instead of var2 from row e3.
If I understand, the value from the next row will be brought into the current row, only when var1 is missing, otherwise missing values in the current row are propogations of the value to the left (even when that value itself is from a prior left to right propogation).
The next row retrieval, also known as lead, can be accomplished using a 1:1 reflexive merge with one self advanced by one row using option firstobs=.
data have; input
x& $8. var1 var2 var3 var4; datalines;
e1 1 2 3 4
e2 . . 5 7
e3 5 8 . .
e4 2 3 1 9
run;
data want;
* reflexive 1:1 merge;
merge
have
have(firstobs=2 keep=var1 rename=var1=lead1)
;
if missing(var1) then var1=lead1;
array v var1-var4;
do _i_ = 2 to dim(v);
if missing(v(_i_)) then v(_i_)=v(_i_-1);
end;
drop lead:;
run;
An intuitive approach is simply to loop through the array until you encounter a missing value. Then loop through the remaining part of the array looking for the next non-missing value. If the missing value occurs at the end (more precisely: with no non-missing values in the remaining part of the array), we will still have missing values at the end of these loops.
We can then do the same procedure in reverse, starting at the end of the array and working our way to the start.
I'd avoid using _n_ as a variable name, as it is an automatic variable in SAS.
data want;
set have;
array t(*) var1--var4;
/* Following value*/
do n=1 to dim(t)-1;
inner=n;
do while (t(n)=. and inner lt dim(t));
t(n)=t(inner+1);
inner+1;
end;
end;
/* If there was no following value, we still have missing values, and finds previous instead*/
do n=dim(t) to 2 by -1;
inner=n;
do while (t(n)=. and inner gt 0);
t(n)=t(inner-1);
inner+ (-1);
end;
end;
drop n inner;
run;

Column combine two datasets of different size

I have two datasets of the following structure
ID1 Cat1
1 a
2 a
3 b
5 b
5 b
6 c
7 d
and
ID2 Cat2
11 z
12 z
13 z
14 y
15 x
I want to column-combine then and then have the unmatched rows just be missing. So ultimately I want:
ID1 Cat1 ID2 Cat2
1 a 11 z
2 a 12 z
3 b 13 z
4 b 14 y
5 b 15 x
6 c
7 d
The purpose of this is that I have two sorted datasets (by ID) and want to do a matching of the first category (Cat1) with the second (Cat2). The second category has a predefined number of "slots" and those slots should be matched on the order of the IDs. The only relationship between ID1 and ID2 is that they are ordered the same way. So the two lowest should be a match and so on.
You want a one to one merge.
The documentation is here
In order to do a one to one merge you just need to merge without a by statement
This type of merge simply matches the observations based on its row number, so be careful, it may give you unintended results if you are missing a row you thought you had or something else wasn't as you expected.
for example:
proc sort data = have1; run;
proc sort data = have2; run;
data want;
merge have1 have2;
run;

What does this if mean in a data step?

In this data step I do not understand what if last.y do...
Could you tell me ?
data stop2;
set stop2;
by x y z t;
if last.y; /*WHAT DOES THIS DO ??*/
if t ne 999999 then
t=t+1;
else do;
t=0;
z=z+1;
end;
run;
LAST.Y refers to the row immediately before a change in the value of Y. So, in the following dataset:
data have;
input x y z;
datalines
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 3 1
2 3 2
2 3 3
;;;;
run;
LAST.Y would occur on the third, sixth, ninth, and twelfth rows in that dataset (on each row where Z=3). The first two times are when Y is about to change from 1 to 2, and when it is about to change from 2 to 3. The third time is when X is about to change - LAST.Y triggers when Y is about to change or when any variable before it in the BY list changes. Finally, the last row in the dataset is always LAST.(whatever).
In the specific dataset above, the subsetting if means you only take the last row for each group of Ys. In this code:
data want;
set have;
by x y z;
if last.y;
run;
You would end up with the following dataset:
data want;
input x y z;
datalines;
1 1 3
1 2 3
1 3 3
2 3 3
;;;;
run;
at the end.
One thing you can do if you want to see how FIRST and LAST operate is to use PUT _ALL_;. For example:
data want;
set have;
by x y z;
put _all_;
if last.y;
run;
It will show you all of the variables, including FIRST.(whatever) and LAST.(whatever) on the dataset. (FIRST.Y and LAST.Y are actually variables.)
In SAS, first. and last. are variables created implicitly within a data step.
Each variable will have a first. and a last. corresponding to each record in the DATA step. These values will be wither 0 or 1. last.y is same as saying if last.y = 1.
Please refer here for further info.
That is an example of subsetting IF statement. Which is different than an IF/THEN statement. It basically means that if the condition is not true then stop this iteration of the data step right now.
So
if last.y;
is equivalent to
if not last.y then delete;
or
if not last.y then return;
or
if last.y then do;
... rest of the data step before the run ...
end;