creating a new column based on observations in other columns - sas

I am having trouble creating a new variable using conditions, I've tried data steps but to no avail.
My data set looks like this right now:
A B C D E
1 . 1 1 .
. 1 . . .
1 . 1 . 1
I need to look like this
A B C D E R
. . . . 1
. 1 . . . .
. . . . . 1
So the idea that i've used is if the sum of a -- d is greater than 1 then set R equal to 1 else . and then drop the observations if 1 is present in a & b & c & d & e but its not doing it for me perhaps its due to missing values.
code i've used so far:
data campZ;
set campY;
select;
when (sum(Macroscopic -- Symbolic > 1)) Random = 1;
otherwise; end;
run;
I've tried Proc SQL as well but I have been mainly focusing on the data step but any help will be great.
Thank you!
Will

It looks like you want to both SET R and clear the other variables. You need to add the OF keyword when using a variable list as an argument to a function.
data campZ;
set campY;
if sum(of Macroscopic -- Symbolic) > 1 then do;
Random = 1;
call missing(of Macroscopic -- Symbolic);
end;
run;

SELECT A, B, C, D, E,
CASE WHEN A+B+C+D > 1 THEN 1 END AS R
FROM Table;
(Apologies if I've made any syntax slips, my SAS SQL is a bit rusty.)

You can execute a query to do this . . . although I think a data step is quite reasonable. Here is one way to do the above in proc sql.
proc sql
select (case when cnt <= 1 then a end) as a,
(case when cnt <= 1 then b end) as b,
(case when cnt <= 1 then c end) as c,
(case when cnt <= 1 then d end) as d,
(case when cnt <= 1 then e end) as e,
(case when cnt > 1 then 1 end) as r
from (select z.*,
((case when a is null then 0 else 1 end) +
(case when b is null then 0 else 1 end) +
(case when c is null then 0 else 1 end) +
(case when d is null then 0 else 1 end) +
(case when e is null then 0 else 1 end)
) as cnt
from campz z
) z ;
This just returns the values. If you want them in a new data set, then use create table as.

data a;
input A B C D E;
cards;
1 . 1 1 .
. 1 . . .
1 . 1 . 1
;
proc sql noprint;
create table a1 as
select *, case
when sum(a,b,c,d,e)>1 then 1
when sum(a,b,c,d,e)<=1 then .
end as R from a;
update a1 set A=., B=., C=., D=., E=.
where R=1;
quit;
OutPut
Obs A B C D E R
1 . . . . . 1
2 . 1 . . . .
3 . . . . . 1

Related

Collapsing multiple rows into a single row based on a common identifier

Working in Stata, suppose I have a data table like this...
Household Identifier
Person Identifier
Var1
Var2
1
1
a
b
1
1
c
d
1
2
e
f
2
1
g
h
2
1
i
j
2
1
k
l
2
2
m
n
2
2
o
p
3
1
q
r
I want to be able to combine these so there is just one observation per household, i.e. like this
Household Identifier
Person1_Var1_1
Person1_Var2_1
Person1_Var1_2
Person1_Var2_2
Person1_Var3_1
Person1_Var3_2
Person2_Var1_1
Person2_Var2_1
Person2_Var1_2
Person2_Var2_2
Person2_Var3_1
Person2_Var3_2
1
a
b
c
d
.
.
e
f
.
.
.
.
2
g
h
i
j
k
l
m
n
o
p
.
.
3
q
r
.
.
.
.
.
.
.
.
.
.
Is there a straightforward way of doing this?
You can use reshape wide twice. Note that when I create rowid, I add an underscore to it; I also add underscore to the var1 and var2 columns. In the first reshape call, I use string to identify rowid as a string variable
bysort householdidentifier personidentifier: gen rowid = strofreal(_n) + "_"
rename var* =_
reshape wide var1 var2, i(householdidentifier personidentifier) j(rowid) string
reshape wide var*, i(householdidentifier) j(personidentifier)
Output:
househ~r var1_1_1 var2_1_1 var1_2_1 var2_2_1 var1_3_1 var2_3_1 var1_1_2 var2_1_2 var1_2_2 var2_2_2 var1_3_2 var2_3_2
1. 1 a b c d e f
2. 2 g h i j k l m n o p
3. 3 q r

How to collapse two columns into one in SAS

What I have
A dataset with 8 row x4 col
"Condition" "A_1" "B_1"
A 1 .
A 3 .
A 2 .
A 4 .
B . 4
B . 3
B . 5
B . 6
[
What I want is either:
What I want 1
(1)
"Condition" "A_1" "B_1"
A 1 .
A 3 .
A 2 .
A 4 .
B 4 .
B 3 .
B 5 .
B 6 .
OR, (2):
What I want 2
"Condition" "A_1" "B_1" "AB_1"
A 1 . 1
A 3 . 3
A 2 . 2
A 4 . 4
B . 4 4
B . 3 3
B . 5 5
B . 6 6
It was easy with STATA, R, and Excel (of course), but for the life of me I can't figure out this simple thing in SAS.
I tried,
data want;
if condition = "B" then A_1 = B_1;
set have;
run;
I also tried
data want;
if condition = "A" then AB_1 = A_1;
else AB_1 = B_1;
set have;
run;
The second code almost does the job except that the resulting AB_1 lags by 1 row.
What the hack...
Use coalesce. You also need your set statement before doing any of your logic. SAS reads a row when it encounters the set statement.
data want;
set have;
AB_1 = coalesce(A_1, B_1);
run;
for (2) you can try to use cats() function
data want;
set have;
AB_1 = cats(A_1,B_1);
run;
But in order to do the concatenate columns of different types you should also use the explicit PUT() function.

Needing to retain Lab category tests based on individual positive test result

Hello so this is a sample of my data (There is an additional column of LBCAT =URINALYSIS for those panel of tests)
I've been asked to only include the panel of tests where LBNRIND is populated for any of those tests and the rest to be removed. Some subjects have multiple test results at different visit timepoints and others only have 1.I can't utilise a simple where LBNRIND ne '' in the data step because I need the entire panel of Urinalysis tests and not just that particular test result. What would be the best approach here? I think transposing the data would be too messy but maybe putting the variables in an array/macro and utilising a do loop for those panel of tests?.
Update:I've tried this code but it doesn't keep the corresponding tests for where lb_nrind >0. If I apply the sum(lb_nrind > '' ) the same when applying lb_nrind > '' to the having clause
*proc sql;
*create table want as
select * from labUA
group by ptno and day and lb_cat
having sum(lb_nrind > '') > 0 ;
data want2;
do _n_ = 1 by 1 until (last.ptno);
set labUA;
by ptno period day hour ;
if not flag_group then flag_group = (lb_nrind > '');
end;
do _n_ = 1 to _n_;
set want;
if flag_group then output;
end;
drop flag_group; run;*
You can use a SQL HAVING clause to retain rows of a group meeting some aggregate condition. In your case that group might be a patientid, panelid and condition at least one LBNRIND not NULL
Example:
Consider this example where a group of rows is to be kept only if at least one of the rows in the group meets the criteria result7=77
Both code blocks use the SAS feature that a logical evaluation is 1 for true and 0 for false.
SQL
data have;
infile datalines missover;
input id test $ parm $ result1-result10;
datalines;
1 A P 1 2 . 9 8 7 . . . .
1 B Q 1 2 3
1 C R 4 5 6
1 D S 8 9 . . . 6 77
1 E T 1 1 1
1 F U 1 1 1
1 G V 2
2 A Z 3
2 B K 1 2 3 4 5 6 78
2 C L 4
2 D M 9
3 G N 8
4 B Q 7
4 D S 6
4 C 1 1 1 . . 5 0 77
;
proc sql;
create table want as
select * from have
group by id
having sum(result7=77) > 0
;
DOW Loop
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if not flag_group then flag_group = (result7=77);
end;
do _n_ = 1 to _n_;
set have;
if flag_group then output;
end;
drop flag_group;
run;

Assign columname (month) as value to another column SAS

Please help friends
data have;
input v_202002 $1. v_202003 $1. v_202001 $1.;
datalines;
a . b
. . b
a b b
. b a
b b a
;
What I am looking for - First time the value became 'b'
want dataset:
v_202002 v_202003 v_202001 output
a . b 202001
. . b 202001
a b b 202001
. b a 202003
b b a 202002
You can use the WHICHC() function to find the index into an array where the value appears. Then use the VNAME() function to get the name.
data want;
set have;
array vlist v: ;
index=whichc('b',of vlist[*]);
if index then output = substr(vname(vlist[index]),3);
run;
Results
Obs v_202002 v_202003 v_202001 index output
1 a b 3 202001
2 b 3 202001
3 a b b 2 202003
4 b a 2 202003
5 b b a 1 202002

SAS Comparing values across multiple columns of the same observation?

I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1