Just a general question lets say I have two datasets called dataset1 and dataset2 and If I want to compare the rows of dataset1 with the complete dataset2 so essentially compare each row of dataset1 with dataset2. Below is just an example of the two datasets
Dataset1
EmployeeID Name Employeer
12345 John Microsoft
1234567 Alice SAS
1234565 Jim IBM
Dataset1
EmployeeID2 Name DateAbsent
12345 John 25/06/2009
12345 John 26/06/2009
1234567 Alice 27/06/2010
1234567 Alice 30/06/2011
1234567 Alice 2/8/2012
12345 John 28/06/2009
12345 John 25/07/2009
12345 John 25/08/2009
1234565 Jim 26/08/2009
1234565 Jim 27/08/2010
1234565 Jim 28/08/2011
1234565 Jim 29/08/2012
I have written some programming logic its not sas code, this is just my logic
for item in dataset1:
for item2 in dataset2:
if item.EmployeeID=item2.EmployeeID2 and item.Name=item2.Name then output newSet
This is an inner join.
proc sql noprint;
create table output as
select a.EmployeeId,
a.Name,
a.Employeer,
b.DateAbsent
from dataset1 as a
inner join
dataset2 as b
on a.EmployeeID = b.EmployeeID2
and a.Name = b.name;
quit;
I recommend reading the SAS documentation on PROC SQL if you are unfamiliar with the syntax
To do this in a Data step, the data sets need to be sorted by the variables to join on (or indexed). Also the variable names need to be the same, so I will assume both variables are EmployeeID.
/*sort*/
proc sort data=dataset1;
by EmployeeID Name;
run;
proc sort data=dataset2;
by EmployeeID Name;
run;
data output;
merge dataset1 (in=ds1) dataset2 (inds2);
by EmployeeID Name;
if ds1 and ds2;
run;
The data step does the loop for you. It needs sorted sets because it only takes 1 pass over the data sets. The if clause checks to make sure you are getting a value from both data sets.
Is your goal to compare the two dataset and see where there are differences? Proc Compare will do this for you. You can compare specific columns or the entire dataset.
Related
I have two datasets described below
data1:
$restaurant $reviewers
A Tom
B Jack.Mary.Joan
C Tom.Joan
D Rose
data2 (sorted by the friends numbers):
$user $friends
Tom Joan.Mary.Jack
Jack Tom.Rose
Mary Tom
Joan Tom
The question is to calculate the overlap in the reviews of these users with the reviews of their friends.
Take an example of Tom, the restaurants Toms friends reviewed are B and C, from which C was also reviewed by Tom. So here the percentage is C/B+C = 1/2, so the overlap is 50%.
I think I need a loop to work across two datasets, but with very basic knowledge of SAS, I don't know how. Has anybody an idea?
Thank you very much.
You should try something like this.
data reviews;
infile datalines dsd dlm=",";
input restaurant $ reviewer $;
datalines;
A,Tom
B,Jack
B,Mary
B,Joan
C,Tom
C,Joan
D,Rose
;
run;
data users;
infile datalines dsd dlm=",";
input user $ friend $;
datalines;
Tom,Joan
Tom,Mary
Tom,Jack
Jack,Tom
Jack,Rose
Mary,Tom
Joan,Tom
;
run;
proc sql;
create table want as
select t1.user
,sum(case when t3.restaurant=t2.restaurant then 1 else 0 end)/count(*) as percentage
from users t1
inner join reviews t2
on t1.user=t2.reviewer
inner join reviews t3
on t1.friend=t3.reviewer
group by t1.user
;
quit;
I did'nt get your 0,5 value for Tom, but maybe you have a mistake.
So you can adapt the code as needed.
I followed the logic from here :
How to check percentage overlap in SAS
The following inherited simplified code is meant to replace missing values of a column with the values of not missing entries in a group:
DATA WORK.TOYDATA;
INPUT Category $ PRICE;
DATALINES;
Cat1 2
Cat1 .
Cat1 .
Cat2 .
Cat2 3
Cat2 .
;
DATA WORK.OUTTOYDATA;
SET WORK.TOYDATA;
BY Category ;
RETAIN _PRICE;
IF FIRST.Category THEN _PRICE=PRICE;
IF NOT MISSING(PRICE) THEN _PRICE=PRICE;
ELSE PRICE=_PRICE;
DROP _PRICE;
RUN;
Unfortunately, this will not work if the first entry in a group is missing. How could this be fixed?
As SAS works row by row through the dataset there is no value to replace if the first value is missing.
You could sort the data by Category and Price DESCENDING to circumvent this.
proc sort data= WORK.TOYDATA; by Category DESCENDING PRICE; run;
Or if there is only one NON-missing value by category you could use a sql join e.g.
proc sql;
create table WORK.OUTTOYDATA as
select a.Category, coalesce(a.PRICE, b.PRICE) as PRICE
from WORK.TOYDATA a
left join (select distinct Category, PRICE
from WORK.TOYDATA
where PRICE ne .
) b
on a.Category eq b.Category
;
quit;
As #Jetzler pointed out, the easiest way is just to sort the data. However, if you have multiple columns with missing values then you'd need to do multiple sorts, which isn't efficient.
Another option from doing a join is proc stdize which can be used to replace missing values with a simple measure (mean, median, sum etc). The default method will suffice in your example, you just need to add the reponly option which only replaces missing values and does not standardize the data.
DATA WORK.TOYDATA;
INPUT Category $ PRICE;
DATALINES;
Cat1 2
Cat1 .
Cat1 .
Cat2 .
Cat2 3
Cat2 .
;
run;
proc stdize data=TOYDATA out=want reponly;
by category;
var price;
run;
I am using the below code but in the final output I am not able to get the name in the first entry where income is 234234. How do I get name entry here.
data names;
input name $ age;
datalines;
John 10
Mary 12
Sally 12
Fred 1
Paul 2
;
run;
data check;
input name $ income;
datalines;
Mary 121212
Fred 334343
Ben 234234
;
Proc sql;
title 'Inner Join';
create table common_names as
select * from names as n right join check as c on
n.name = c.name;
run;
Proc print data = common_names;
run;
Output
Inner Join
Obs name age income
1 . 234234
2 Fred 1 334343
3 Mary 12 121212
You cannot create two variables with the same name, in this case the variable NAME. So either create two variables
select n.name as name1, c.name as name2, ....
or use the COALESCE() function to create a single variable.
select coalesce(n.name,c.name) as name, ....
You might also what to look at SAS's NATURAL join. That will link tables on variables with the same name and automatically coalesce the key variable values.
create table common_names as
select *
from names as n
natural right join check as c
;
NAME DATE
---- ----------
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
DATA newdata;
SET mydata;
count + 1;
IF FIRST.name THEN count=1;
BY name DESCENDING date;
run;
here i got count group wise 1,2,3 so on..I want the output of name(all obs of bob) if count> 3. please help me..
The simplest way to do that is to output the last row for each ID if it is > 3, then merge that dataset back to your master dataset, keeping only matches. You could also use PROC FREQ to generate the dataset of counts and merge to that.
You can do it in a single datastep using a DoW loop, but that's more complicated, so I wouldn't recommend a new user do that.
I think this shows the power of SQL - though some would say since this generates a NOTE in the log it isn't good practice. Use the GROUP & HAVING clause in SQL to create a count of the names that you then limit to 3.
proc sql;
create table want as
select *
from have
group by name
having count(name)>3;
quit;
Here are a couple different ways to do this using SUBQUERIES in PROC SQL
Data HAVE;
Length NAME $50;
Input Name $ Date: ddmmyy10.;
Format date ddmmyy10.;
datalines;
BOB 24/05/2013
BOB 12/06/2012
BOB 19/10/2011
BOB 05/02/2010
BOB 05/01/2009
CARL 15/05/2011
LOUI 15/01/2014
LOUI 15/05/2013
LOUI 15/05/2012
;
Run;
Using a multiple-value subquery in the Where statement
Proc sql;
Create table WANT1 as
Select *
From Have
Where Name in (Select name from have b group by b.name having count(b.name)>3);
Quit;
Using a subquery in the From clause
Proc sql;
Create table WANT2 as
Select a.name, a.date
From Have a Inner Join (select name, count(name) as Count from have b group by b.name having Count>3)
On a.name=b.name
;
Quit;
I'm just learning SAS. This is a pretty simple question -- I'm probably overthinking it.
I have a data set called people_info and one of the variables is SocialSecurityNum. I have another table called invalid_ssn with a single variable: unique and invalid SocialSecurityNum observations.
I would like to have a DATA step (or PROC SQL step) that outputs to invalid_people_info if the SocialSecurityNum of the person (observation) matches one of the values in the invalid_ssn table. Otherwise, it will output back to people_info.
What's the best way to do this?
Edit: More info, to clarify...
people_info looks like this:
name SocialSecurityNum
joe 123
john 456
mary 876
bob 657
invalid_ssn looks like this:
SocialSecurityNum
456
876
What I want is for people_info to change (in place) and look like this:
name SocialSecurityNum
joe 123
bob 657
and a new table, called invalid_people_info to look like this:
name SocialSecurityNum
john 456
mary 876
The data step shown by Hong Ooi is great, but youou could also do this with proc sql without the need to sort first and also without actually doing a full merge.
proc sql noprint;
create table invalid_people_info as
select *
from people_info
where socialsecuritynum in (select distinct socialsecuritynum from invalid_ssn)
;
create table people_info as
select *
from people_info
where socialsecuritynum not in (select distinct socialsecuritynum from invalid_ssn)
;
quit;
This simply selects all rows where ssn is (not) in the distinct list of invalid ssn's.
Your requirement isn't clear. Do you want to remove all the invalid SSNs from people_info and put them into a new dataset? If so, this should work. You'll have to sort your datasets by SocialSecurityNum first.
data people_info invalid_people_info;
merge people_info (in=a) invalid_ssn (in=b);
by SocialSecurityNum;
if b then output invalid_people_info;
else output people_info;
run;