Comparison of two data sets in SAS - sas

I have the following data set:
data data_one;
length X 3
Y $ 20;
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
run;
data data_two;
length Z 3
A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;
What I would like to have is a data set which tells me how often column Y in data_one contains the same string of column A in data_two. The result should look like this one:
Obs test test1 test2
1 4 3 0
Thanks in advance!

First we need the counts for those values of Y present in data_one.
Then we create a sorted (for the next merge) list of the values present in data_two.
The data_one Y counts from 1. are merged with the list from 2.
The Y values present in data_two but not in data_one (b and not a) are assigned count=0, the Y values not present in data_two are discarded (if b).
The last passage transposes the vertical list of counts in an horizontal set of variables.
proc freq data=data_one noprint;
table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
by a;
run;
data count_all;
merge count_one (in=a) list_two (in=b);
by y;
if (b and not a) then count=0;
if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
id y;
run;
The first 3 steps can be replaced with one proc SQL:
proc sql;
create table count_all as
select distinct
coalesce(t1.y,t2.a) as y,
case
when missing(t1.y) then 0
else count(t1.y)
end as N
from data_one as t1
right join data_two as t2
on t1.y=t2.a
group by 1
order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
id y;
run;

Related

Add new empty rows to a SAS table with names from another table

Assume I have table foo which contains a (dynamic) list of new rows which I want to add to another table have, so that it yields a table want looking e.g. like this:
x y p_14 p_15
1 2 2 99
2 4 7 24
Example data for foo:
id row_name
14 p_14
15 p_15
Example data for have:
x y p Z
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
What I have so far is the following which is not yet in macro shape:
proc sql;
create table want as
select old.*, t1.p_14, t2.p_15 /* choosing non-duplicate rows */
from (select x, y from have) old
left join (select x, y, z as p_14 from have where p=14) t1
on old.x=t1.x and old.y=t1.y
left join (select x, y, z as p_15 from have where p=15) t2
on old.x=t2.x and old.y=t2.y
;
quit;
Ideally, I am aiming for a macro where which takes foo as input and automatically creates all the joins from above. Also, the solution should not spit out any warnings in the console. My challenge is how to dynamically choose the correct (non-duplicate) rows.
PS: This is a follow-up question of Populate SAS macro-variable using a SQL statement within another SQL statement? The important bit is that it is not a full transpose, I guess.
You can go from HAVE to WANT with PROC TRANSPOSE.
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
by x y ;
id p ;
var z;
run;
To limit it to the values of P that occur in FOO you could use a macro variable (as long as the number of observations in FOO is small enough).
proc sql noprint ;
select id into :idlist separated by ' ' from foo ;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
If the issue is you want variable P_17 to be in the result even if 17 does not appear in HAVE then add a little more complexity. For example add another data step that will force the creation of the empty variables. You can generate the list of variable names from the list of id's in FOO.
proc sql noprint ;
select id , cats('p_',id)
into :idlist separated by ' '
, :varlist separated by ' '
from foo
;
quit;
proc transpose data=have out=want(drop=_name_) prefix=p_ ;
where p in (&idlist) ;
by x y ;
id p ;
var z;
run;
data want ;
set want (keep=x y);
array all &varlist ;
set want ;
run;
Results:
Obs x y p_14 p_15 p_17
1 1 2 2 99 .
2 2 4 7 24 .
If the number of values is too large to store in a single macro variable (limit 64K bytes) you could generate the WHERE statement with a data step to a file and use %INCLUDE to add the WHERE statement into the code.
filename where temp;
data _null_;
set foo end=eof;
file where ;
if _n_=1 then put 'where p in (' #;
put id # ;
if eof then put ');' ;
run;
proc transpose ... ;
%include where / source2;
...
Use macro program:
data have;
input x y p Z;
cards;
1 2 14 2
1 2 15 99
1 2 16 59
2 4 14 7
2 4 15 24
2 4 16 58
;
data foo;
input id row_name $;
cards;
14 p_14
15 p_15
;
%macro test(dsn);
proc sql;
select count(*) into:n trimmed from &dsn;
select id into: value separated by ' ' from &dsn;
create table want as
select distinct a.x,a.y,
%do i=1 %to &n;
%let cur=%scan(&value,&i);
t&i..p_&cur
%if &i<&n %then ,;
%else ;
%end;
from have a
%do i=1 %to &n;
%let cur=%scan(&value,&i);
left join have (where=(p=&cur) rename=(z=p_&cur.)) t&i.
on a.x=t&i..x and a.y=t&i..y
%end;
;
quit;
%mend;
%test(foo);

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

SAS data merge for existence

I am new to sas, i have two datasets as following,
data datasetA;
input a $1;
datalines;
1
2
3
4
5
6
7
;
run;
data datasetB;
input a $1;
datalines;
1
3
5
7
;
run;
If a appears B, then my desired output should be
1 Y
2 N
3 Y
4 N
5 Y
6 N
7 Y
This can be achived by in at least two ways:
merge via the data step or
left join with the proc sql.
This pdf compares pros and cons of merge versus sql in sas.
Since rbet showed you how to do this with the merge step, I'll show you how to do it with the proc sql.
proc sql;
create table work.result as
select t1.a, case when t2.a is not missing then 'Y' else 'N' end as exists
from work.datasetA t1
left join work.datasetB t2 on t1.a = t2.a order by t1.a;
data _a;
format a 3.;
do i = 1 to 7;
a = i;
output;
end;
drop i;
data _b;
format b 3.;
do i = 1 to 7 by 2;
b = i;
output;
end;
drop i;
run;
data _c;
merge _a(in=_a) _b(in=_b rename=(b=a));
format St $2.;
if _a and _b then St = 'Y';
else St = 'N';
by a;
run;
I would suggest you to google SAS merge or SAS proc sql join for introduction to basic concepts.

Select observations using variables from a second data set

I have two datasets in SAS. They both contain the same variable x. In the first data set, I want to remove those observations whose x value is also in the x values in the second data set.
Example,
data set1;
input x y z;
datalines;
1 1.5 2.2
1 2.1 9.0
2 4.2 4.4
3 4.5 2.4
;
run;
data set2;
input x y;
datalines;
1 15
2 44
;
run;
In set 1, I want to remove those observations if x=1 or x=2 where 1 and 2 come from the x values from second data set. I only want to keep the last row in set 1.
So your final answer should only include the 3? There are a few ways, but I find this the clearest method for understanding.
proc sql;
create table want as
select *
from set1
where x not in (select x from set2);
quit;
Data step version:
data want;
merge set1(in = _1)
set2(in = _2 keep = x);
by x;
if _1 and not(_2);
run;
This assumes that set1 and set2 have both either been sorted by x or have an index on x.

How to add new observation to already created dataset in SAS?

How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !