Extracting info by matching two datasets in SAS - sas

I have two datasets. Both have a common column- ID. I would like to check if ID from df1 lies in df2 and extract all such rows from df1. I'm doing this in SAS.

It is easily done in one sql query.
proc sql;
create table extract_from_df1 as
select
*
from
df1
where
id in (select id from df2)
;
quit;

There are lots of ways to do this. For example:
proc sql;
create table compare as select distinct
a.id as id1, b.id as id2
from table1 as a
left join table2 as b
on a.id = b.id;
quit;
and then keep matches. Or you can try:
proc sql;
delete from table2 where id2 in select distinct id1 from table1;
quit;

data df1;
input id name $;
cards;
1 abc
2 cde
3 fgh
4 ijk
;
run;
data df2;
input id address $;
cards;
1 abc
2 cde
5 ggh
6 ihh
7 jjj
;
run;
data c;
merge df1(in=x) df2(in=y);
if x and y;
keep id name;
run;
proc print data=c;
run;

Related

Conditional Merging in SAS

Hi all i have the below two datasets in which i need to map Date1 with Date2 in a range of (+/- 7)days within an ID.
Data set1;
input ID Date1 ddmmyy8.;
Format Date1 Date11.;
Datalines;
001 02-08-15
001 04-08-15
001 06-08-15
002 11-09-15
002 14-09-15
002 17-09-15
;
run;
Data set2;
input ID TYPE $ Date2 ddmmyy8.;
Format Date2 Date11.;
Datalines;
001 TYPE1 02-08-15
001 TYPE2 11-08-15
001 TYPE3 06-08-15
002 TYPE1 07-09-15
002 TYPE2 04-09-15
002 TYPE3 09-08-15
;
run;
Proc sql;
create table out as select a.ID, a.Date1, b.Date2,
intck('days', Date1, Date2) as Diff
from set1 as a full join set2 as b
on a.ID = b.ID and (Date1 + 7 >= Date2 >= Date1 - 7)
group by a.ID, Date1 having diff = min(diff);
quit;
I get the below output
i need output
Expected Output
***The output i get is highlighted in yellow when i map using min of Diff.
but the output i need is highlighted in green it is because i have to maintain the values in Date2 as distinct and does not repeat.
(i.e) Because 02-Aug-2015 is already mapped with 02-Aug-2015 as well as 09-Aug-2015 mapped with 09-Aug-2015 of Date1 i need the 04-Aug-2015 of Date1 to be mapped with the remaining 11-Aug-2015***
So judging on the comments above the rules are quite complex:
* join based on ID
* Absolute difference between date 1 and date 2 should be less then 7
* Preference for matching dates should be given to dates that are equal
* In case dates aren't equal, the solution should be so that as much as combinations are
returned as possible
* Date1 & Date2 need to be unique
I'm not sure whether the solution can be done in one proc sql step.
You can't just work with a simple rule as it is not always the result with the lowest difference between dates that needs to be returned. Next to that, you need to sort of 'remember' which dates per ID you have selected for an output since you can't select them anymore further on.
First I made an inner join on ID where the difference in dates was <= 7. This gives all possible valid combinations. Then I've put all the dates where the difference in dates was 0 in a macro variable so I can make a table of all possible combinations except dates being in the macro variable. In the last step then you want to have the solution returning the most possible combinations of date1 - date2 where both are distinct.
Data set1;
input ID Date1 ddmmyy8.;
Format Date1 Date11.;
Datalines;
001 02-08-15
001 04-08-15
001 06-08-15
002 11-09-15
002 14-09-15
002 17-09-15
;
run;
Data set2;
input ID TYPE $ Date2 ddmmyy8.;
Format Date2 Date11.;
Datalines;
001 TYPE1 02-08-15
001 TYPE2 11-08-15
001 TYPE3 06-08-15
002 TYPE1 07-09-15
002 TYPE2 04-09-15
002 TYPE3 09-08-15
;
run;
Proc sql noprint;
/*Create a macro variable with all the dates for which the difference between date1
and date2 = 0 */
select distinct put(Date1,yymmddn8.) into: dates seperated by ','
from set1 as a inner join set2 as b
on a.ID = b.ID and abs(intck('days', Date1, Date2)) = 0;
/*Create table with all lines where difference <= 7 but date is not in the ones with
difference = 0 */
create table out as select a.ID, a.Date1, b.Date2,
intck('days', Date1, Date2) as Diff
from set1 as a inner join set2 as b
on a.ID = b.ID where abs(intck('days', Date1, Date2)) <= 7 and
find("&dates.",put(Date1,yymmddn8.)) = 0 and find("&dates.",put(Date2,yymmddn8.)) = 0;
/* Check the number of possible combinations */
create table out as
select a.*,
b.cnt1 + c.cnt2 as combos
from out a left join (select distinct id,
date1,
count(*) as cnt1
from out
group by id, date1) b on a.date1 = b.date1 and a.id = b.id
left join (select distinct id,
date2,
count(*) as cnt2
from out
group by id, date2) c on a.date2 = c.date2 and a.id = c.id
order by id, combos;
quit;
/* Select unique dates per date1, date2 */
data out(keep = id date1 date2);
retain mem1 mem2;
length mem1 mem2 $100.;
set out;
by id combos;
if first.id then do;
mem1 = "0";
mem2 = "0";
end;
date10 = put(date1,yymmddn8.);
date20 = put(date2,yymmddn8.);
if find(mem1,date10) = 0 and find(mem2,date20) = 0 then do;
mem1 = catx(',',mem1,date10);
mem2 = catx(',',mem2,date20);
output;
end;
run;
/* Create a union between lines with no difference in date and lines with difference
in date*/
proc sql;
create table final as
select * from out
union
select a.ID, a.Date1, b.Date2
from set1 as a inner join set2 as b
on a.ID = b.ID and abs(intck('days', Date1, Date2)) = 0;
quit;
So this gives a table like:
Final Table

SAS : Map a column from 1 table to any of the multiple columns in another table

I have a table1 that contains 4 different kind of ids
Data table1;
Input id1 $ id2 $ id3 $ final_id $;
Datalines;
1 a a1 p
2 b b2 q
- c c2 r
3 d - s
4 - d4 t
A table2 contains any of the ids from id1, id2 or id3 of table1:
Data table1;
Input id $ col1 $ col2 $;
Datalines;
1 gsh ywu
b hsjs kall
c2 jsjs ywe
3 sja weei
d4 ase uwh
I want to left join table1 on table2 such that I get a new column in table2 giving me final_id from table1.
How do i go about this problem?
Please help.
Thank you.
You can do it using SQL:
proc SQL noprint;
create table merged as
select b.final_id, a.*
from table2 as a left join table1 as b
on (a.id eq b.id1 or a.id eq b.id2 or a.id eq b.id3)
;
quit;

How to merge 2 datasets with different lengths?

I would like to merge 2 datasets with 2 different dimensions.
TABLE1: people
gender name
M raa
F chico
M july
F sergio
TABLE2: serial_numbers
gender serial
M 4
F 5
I want the result to be
result
gender name serial
M raa 4
F chico 5
M july 4
F sergio 5
I'm creating here the datasets to illustrate how to merge both datasets:
data people;
infile cards;
length gender $1
name $10;
input gender name;
cards;
M raa
F chico
M july
F sergio
;
run;
data serial_numbers;
length gender $1
serial 8;
infile cards;
input gender serial;
cards;
M 4
F 5
;
run;
Solution 1: use a proc sql to perform the join.
proc sql;
create table result as
select a.gender, a.name, b.serial
from people a LEFT JOIN serial_numbers b
on a.gender=b.gender;
quit;
proc print data=result;
run;
Solution 2: use a data step to merge both datasets. This requires the datasets to be sorted:
proc sort data=people;
by gender;
run;
proc sort data=serial_numbers;
by gender;
run;
data result;
merge people serial_numbers;
by gender;
run;
proc print data=result;
run;

How do I use PROC EXPAND to fill in time series observations within a panel (longitudinal) data set?

I'm using this SAS code:
data test1;
input cust_id $
month
category $
status $;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc sql;
create view test2 as
select cust_id, input(put(month, 6.), yymmn6.) as month format date9.,
category, status from test1 order by cust_id, month asc;
quit;
proc expand data=test2 out=test3 to=month method=none;
by cust_id;
id month;
quit;
proc print data=test3;
title "after expand";
quit;
and I want to create a dataset that looks like this:
Obs cust_id month category status
1 A 01MAR2000 ABC C
2 A 01APR2000 DEF C
3 A 01MAY2000 . .
4 A 01JUN2000 XYZ 3
5 B 01OCT1999 ASD X
6 B 01NOV1999 . .
7 B 01DEC1999 ASD C
but the output from proc expand just says "Nothing to do. The data set WORK.TEST3 has 0 observations and 0 variables." I don't want/need to change the frequency of the data, just interpolate it with missing values.
What am I doing wrong here? I think proc expand is the correct procedure to use, based on this example and the documentation, but for whatever reason it doesn't create the data.
You need to add a VAR statement. Unfortunately, the variables need to be numeric. So just expand the month by cust_id. Then join back the original values.
proc expand data=test2 out=test3 to=month ;
by cust_id;
id month;
var _numeric_;
quit;
proc sql noprint;
create table test4 as
select a.*,
b.category,
b.status
from test3 as a
left join
test2 as b
on a.cust_id = b.cust_id
and a.month = b.month;
quit;
proc print data=test4;
title "after expand";
quit;

SAS Proc SQL Count Issue

I have one column of data and the column is named (Daily_Mileage). I have 15 different types of daily mileages and 250 rows. I want a separate count for each of the 15 daily mileages. I am using PROC SQL in SAS and it does not like the Cross join command. I am not really sure what I should do but this is what I started:
PROC SQL;
select A, B
From (select count(Daily_Mileage) as A from Work.full where Daily_Mileage = 'Farm Utility Vehicle (Class 7)') a
cross join (select count(Daily_Mileage) as B from Work.full where Daily_Mileage = 'Farm Truck Light (Class 35)') b);
QUIT;
Use case statements to define your counts as below.
proc sql;
create table submit as
select sum(case when Daily_Mileage = 'Farm Utility Vehicle (Class 7)'
then 1 else 0 end) as A,
sum(case when Daily_Mileage = 'Farm Truck Light (Class 35)'
then 1 else 0 end) as B
from Work.full
;
quit ;
Can't you just use a proc freq?
data example ;
input #1 Daily_Mileages $5. ;
datalines ;
TYPE1
TYPE1
TYPE2
TYPE3
TYPE3
TYPE3
TYPE3
;
run ;
proc freq data = example ;
table Daily_Mileages ;
run ;
/* Create an output dataset */
proc freq data = example ;
table Daily_Mileages /out=f_example ;
run ;
You can first create another column of ones, then SUM that column and GROUP BY Daily_Mileage. Let me know if I'm misunderstanding your questions.
PROC SQL;
CREATE TABLE tab1 AS
SELECT Daily_Mileage, 1 AS Count, SUM(Count) AS Sum
FROM <Whatever table your data is in>
GROUP BY Daily_Mileage;
QUIT;