SAS: merge/ join datasets by dynamic columns in the look up table - sas

I want to join sas dataset with the look up table but the column/key for joining is a value in the look up table
Dataset: table4
ID lev1 lev2 lev3 lev4 lev5
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
Look up table:
context table_name column operator value
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
Now I want to join table4 with lookup table but it is only possible with fields lev2 and lev3(it is dynamic so could be changed in the future, so don't want to hardcode in it).
I have tried below code but doesn't want to hard code as the fields are dynamic( someone might add lev4 as well in future).
proc sql ;
create table want as
select ID
from table4 as a
inner join lookup as b
on a.lev2 = input(value,12.) or a.lev3=input(value,12.)
where Context="unprod";
quit;
Thanks heaps in advance.

That does not look like a lookup table. It appears to be a set of rules. You could use it to generate code. Let's simplify the process by making the table contain actual code instead of three columns. But you could easily write the code to convert from your current format into code strings.
data rules ;
infile cards truncover ;
input context $ table_name $ rule $100. ;
cards;
extract table1 col1 = xyd
asset table2 var1 = 11111
asset table2 var2 = 25858
prod table3 x1 = 87999
unprod table4 lev2 = 14589
unprod table4 lev2 = 14589
unprod table4 lev3 = 55255
;
So now it looks like you want to take the rules that have a specific value of CONTEXT and use that to generate a new dataset from the dataset named in TABLE_NAME. Not sure what name you want to use for the generated table or what you want to do when more than one table is mentioned in the same "context".
%let context=unprod ;
filename code temp;
data _null_;
set rules ;
where context=symget('context');
by table_name ;
file code ;
if first.table_name then table_no+1;
if first.table_name then put
'data want' table_no ';'
/ ' set ' table_name ';'
/ ' where 1=0'
;
put ' or (' rule ')' ;
if last.table_name then put
';'
/ 'run;'
;
run;
%include code / source2 ;
Which results in code like this:
130 +data want1 ;
131 + set table4 ;
132 + where 1=0
133 + or (lev2 = 14589 )
134 + or (lev2 = 14589 )
135 + or (lev3 = 55255 )
136 +;
137 +run;
NOTE: There were 3 observations read from the data set WORK.TABLE4.
WHERE (lev2=14589) or (lev3=55255);
NOTE: The data set WORK.WANT1 has 3 observations and 6 variables.

Here is a sample code that would get what I understood you are trying to do. This code is based on the comment by #Reeza. If this is not what you are trying to do, please send a sample output file.
data table4;
input ID $ lev1 $ lev2 $ lev3 $ lev4 $ lev5 $;
datalines;
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
;
run;
data look_up;
input context $ table_name $ column $ operator $ value $;
datalines;
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
;
run;
PROC transpose DATA=work.table4 out=temp1 prefix=value;
by ID;
VAR lev1-lev5;
run;
proc sql;
create table want as
select a.*, b.ID
from look_up as a
inner join temp1 as b
on a.value=b.value1 and a.column=_Name_;
quit;

Related

Proc tabulate,in my column i have more than 50 variable how can i reduce it to only 5?

proc tabulate data=D.Arena out=work.Arena ;
class Row1 Column1/ order=freq ;
table Row1,Column1 ;
run;
after running this i received these results and now i want to restrict the columns to only 5 variables
Use a 'where' statement to restrict the col1 values being tabulated.
You can restrict based on a value property such as starts with the letter A
where col1 =: 'A';
You can restrict based on a value list:
where col1 in ('Apples', 'Lentils', 'Oranges', 'Sardines', 'Cucumber');
Sample data:
data have;
call streaminit(123);
array col1s[50] $20 _temporary_ (
"Apples" "Avocados" "Bananas" "Blueberries" "Oranges" "Strawberries" "Eggs" "Lean beef" "Chicken breasts" "Lamb" "Almonds" "Chia seeds" "Coconuts" "Macadamia nuts" "Walnuts" "Asparagus" "Bell peppers" "Broccoli" "Carrots" "Cauliflower" "Cucumber" "Garlic" "Kale" "Onions" "Tomatoes" "Salmon" "Sardines" "Shellfish" "Shrimp" "Trout" "Tuna" "Brown rice" "Oats" "Quinoa" "Ezekiel bread" "Green beans" "Kidney beans" "Lentils" "Peanuts" "Cheese" "Whole milk" "Yogurt" "Butter" "Coconut oil" "Olive oil" "Potatoes" "Sweet potatoes" "Vinegar" "Dark chocolate"
);
do row1 = 1 to 20;
do _n_ = 1 to 1000;
col1 = col1s[ceil(rand('uniform',50))];
x = ceil(rand('uniform',250));
output;
end;
end;
run;
Frequency tabulation, also showing ALL counts
* col1 values shown in order by value;
proc tabulate data=have;
class row1 col1;
table ALL row1,col1;
run;
* col1 values shown in order by ALL frequency;
proc tabulate data=have;
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
* Letter T col1 values shown in order by ALL frequency;
proc tabulate data=have;
where col1 =: 'T';
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
A top 5 only list of Col1s would require a step that determines which col1s meet that criteria. A list of those col1s can be used as part of a where in clause.
* determine the 5 col1s with highest frequency count;
proc sql noprint outobs=5;
select
quote(col1) into :top5_col1_list separated by ' '
from
( select col1, count(*) as N from have
group by col1
)
order by N descending;
quit;
proc tabulate data=have;
where col1 in (&top5_col1_list);
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
Col1s in order of value
Col1s in order of frequency
T Col1s
Top 5 Col1s

SAS : Map a column from 1 table to any of the multiple columns in another table

I have a table1 that contains 4 different kind of ids
Data table1;
Input id1 $ id2 $ id3 $ final_id $;
Datalines;
1 a a1 p
2 b b2 q
- c c2 r
3 d - s
4 - d4 t
A table2 contains any of the ids from id1, id2 or id3 of table1:
Data table1;
Input id $ col1 $ col2 $;
Datalines;
1 gsh ywu
b hsjs kall
c2 jsjs ywe
3 sja weei
d4 ase uwh
I want to left join table1 on table2 such that I get a new column in table2 giving me final_id from table1.
How do i go about this problem?
Please help.
Thank you.
You can do it using SQL:
proc SQL noprint;
create table merged as
select b.final_id, a.*
from table2 as a left join table1 as b
on (a.id eq b.id1 or a.id eq b.id2 or a.id eq b.id3)
;
quit;

Counting categorical variables on row in SAS

Sample Data
I was wondering if it is possible to use data instead of proc to count the number of categorical variables on a row as shown in 'count' example above. This will allow me to further use the data e.g COUNT=1 or COUNT > 1 to check morbidity.
Also will it be possible to then count the number of each diagnosis in the entire data set per patient while accounting for duplicates if there is any? For example there are 3 CB's and 2 AA's in this data set but CB should be 2 because patient 2 had it recorded twice.
Thank you for your time and have a lovely new year.
Your question is not clear but your could manage your diag using union all and count distinct
selec patient count(distinct diag )
from (
select patient, diag1 as diag
from my_table
uniona all
select patient, diag2
from my_table
uniona all
select patient, diag3
from my_table
uniona all
select patient, diag4
from my_table
) t
group by patient
or simply union and count
selec patient count(diag )
from (
select patient, diag1 as diag
from my_table
uniona
select patient, diag2
from my_table
uniona
select patient, diag3
from my_table
uniona
select patient, diag4
from my_table
) t
group by patient
The image indicates that for each row you want a count of the number of columns with non-missing values. Additionally, you apparently have some way to do this using a PROC step, but would like to know how using a DATA step.
In DATA step you can count the number of non-missing values indirectly using CMISS, or directly using COUNTC against a constructed value:
data have;
attrib pid length=8 diag1-diag4 length=$5;
input pid & diag1-diag4;
datalines;
1 AA J9 HH6 .
2 CB . . CB
3 J10 AA CB J10
4 B B . F90 .
5 J10 . . .
6 . . . .
run;
data have_with_count;
set have;
count = 4 - cmiss (of diag1-diag4);
count_way2 = countc(catx('~', of diag1-diag4, 'SENTINEL'), '~');
run;
In order to work again MySQL data source you will also need a libref that connects you to that remote data server.
Added
Counting distinct values across a row can be accomplished using a hash or sortc. Consider this example that sorts a copy of the row data (as an array) and counts the unique values within:
data want;
set have;
array diag diag1-diag4;
array v(4) $5 _temporary_;
do _n_ = 1 to dim(diag);
v(_n_) = diag(_n_);
end;
call sortc(of v(*));
uniq = 0;
do _n_ = 1 to dim(v);
if missing(v(_n_)) then continue;
if uniq = 0 then
uniq + 1;
else
uniq + ( v(_n_) ne v(_n_-1) );
end;
run;
With Richard's dummy data to count number of diagnosis and unique number of diagnosis:
data want;
set have;
array var diag:;
length temp $30.;
call missing(diag_num);
do over var;
if not missing(var) then do;
diag_num+1;
temp=ifc(whichc(var, temp),temp,catx(' ',temp,var));
end;
end;
unique_diag=countw(temp);
drop temp;
run;

Extracting info by matching two datasets in SAS

I have two datasets. Both have a common column- ID. I would like to check if ID from df1 lies in df2 and extract all such rows from df1. I'm doing this in SAS.
It is easily done in one sql query.
proc sql;
create table extract_from_df1 as
select
*
from
df1
where
id in (select id from df2)
;
quit;
There are lots of ways to do this. For example:
proc sql;
create table compare as select distinct
a.id as id1, b.id as id2
from table1 as a
left join table2 as b
on a.id = b.id;
quit;
and then keep matches. Or you can try:
proc sql;
delete from table2 where id2 in select distinct id1 from table1;
quit;
data df1;
input id name $;
cards;
1 abc
2 cde
3 fgh
4 ijk
;
run;
data df2;
input id address $;
cards;
1 abc
2 cde
5 ggh
6 ihh
7 jjj
;
run;
data c;
merge df1(in=x) df2(in=y);
if x and y;
keep id name;
run;
proc print data=c;
run;

How do I use PROC EXPAND to fill in time series observations within a panel (longitudinal) data set?

I'm using this SAS code:
data test1;
input cust_id $
month
category $
status $;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc sql;
create view test2 as
select cust_id, input(put(month, 6.), yymmn6.) as month format date9.,
category, status from test1 order by cust_id, month asc;
quit;
proc expand data=test2 out=test3 to=month method=none;
by cust_id;
id month;
quit;
proc print data=test3;
title "after expand";
quit;
and I want to create a dataset that looks like this:
Obs cust_id month category status
1 A 01MAR2000 ABC C
2 A 01APR2000 DEF C
3 A 01MAY2000 . .
4 A 01JUN2000 XYZ 3
5 B 01OCT1999 ASD X
6 B 01NOV1999 . .
7 B 01DEC1999 ASD C
but the output from proc expand just says "Nothing to do. The data set WORK.TEST3 has 0 observations and 0 variables." I don't want/need to change the frequency of the data, just interpolate it with missing values.
What am I doing wrong here? I think proc expand is the correct procedure to use, based on this example and the documentation, but for whatever reason it doesn't create the data.
You need to add a VAR statement. Unfortunately, the variables need to be numeric. So just expand the month by cust_id. Then join back the original values.
proc expand data=test2 out=test3 to=month ;
by cust_id;
id month;
var _numeric_;
quit;
proc sql noprint;
create table test4 as
select a.*,
b.category,
b.status
from test3 as a
left join
test2 as b
on a.cust_id = b.cust_id
and a.month = b.month;
quit;
proc print data=test4;
title "after expand";
quit;