I have below two datasets and need the third dataset as an output.
ONE TWO
---------- ----------
ID FLAG NUMB
1 N 2
2 Y 3
3 Y 9
4 N 2
5 N 3
9 Y 9
10 Y
OUTPUT
-------
ID FLAG NEW
1 N N
2 Y Y
3 Y Y
4 N N
5 N N
9 Y Y
10 Y N
If ONE.ID is found in TWO.NUMB and it's ONE.FLAG = Y then the new variable NEW = Y
else NEW = N
I was able to do this using PROC SQL as below.
proc sql;
create table output as
(
select distinct id, flag, case when numb is null then 'N' else 'Y' end as NEW
from one
left join
two
on id = numb
and flag = 'Y'
);
quit;
Could this be done in DATA step/MERGE?
since you have a sql step attempt here's an improvement on that
--this sql step does not require a merge--
proc sql noprint;
create table output as
select distinct *, case
when id in (select distinct numb from two) then "Y"
else "N"
end as new
from one
;
quit;
Related
I am trying to create unique customer groups which are determined by customer interactivity across transactions.
Here is an example of the data:
Transaction #
Primary Customer
Cosigner
WANT: Customer Group
1
1
2
A
2
1
3
A
3
1
4
A
4
1
2
A
5
2
5
A
6
3
6
A
7
2
1
A
8
3
1
A
9
7
8
B
10
9
C
In this example, customer 1 is connected to customers 2-6 either directly or indirectly, so all transactions associated with customers 1-6 would be a part of an "A" group. Customer 7 and 8 are directly connected and would be labeled as a "B" group. Customer 9 has no connections and are the single member of the "C" group.
Any suggestions are appreciated!
Your data can be considered the edges of a graph. So your request is to find the connected subgraphs of that graph. That question has an answer on Stackoverflow and SAS Communities. But this question is more on topic than that older SO question. So let's post the subnet SAS macro from the SAS Communities answer here on SO where it will be easier to find.
This simple macro uses repeated PROC SQL queries to build the list of connected subgraphs until all of the original records have been assigned to a subgraph.
The macro is setup to let you pass in the name of the source dataset and the names of the two variables that hold the ids of the nodes.
So first let's convert your printout into an actual SAS dataset.
data have;
input id primary cosign want $;
cards;
1 1 2 A
2 1 3 A
3 1 4 A
4 1 2 A
5 2 5 A
6 3 6 A
7 2 1 A
8 3 1 A
9 7 8 B
10 9 . C
;
Now we can call the macro and tell it that PRIMARY and COSIGN are the variables with the node ids and that SUBNET is the name for the new variable to hold the ids of the connected subgraphs. NOTE: This version treats the graph as directed by default.
%subnet(in=have,out=want,from=primary,to=cosign,subnet=subnet);
Results:
Obs id primary cosign want subnet
1 1 1 2 A 1
2 2 1 3 A 1
3 3 1 4 A 1
4 4 1 2 A 1
5 5 2 5 A 1
6 6 3 6 A 1
7 7 2 1 A 1
8 8 3 1 A 1
9 9 7 8 B 2
10 10 9 . C 3
Here is the code of the %SUBNET() macro.
%macro subnet(in=,out=,from=from,to=to,subnet=subnet,directed=1);
/*----------------------------------------------------------------------
SUBNET - Build connected subnets from pairs of nodes.
Input Table :FROM TO pairs of rows
Output Table:input data with &subnet added
Work Tables:
NODES - List of all nodes in input.
NEW - List of new nodes to assign to current subnet.
Algorithm:
Pick next unassigned node and grow the subnet by adding all connected
nodes. Repeat until all unassigned nodes are put into a subnet.
To treat the graph as undirected set the DIRECTED parameter to 0.
----------------------------------------------------------------------*/
%local subnetid next getnext ;
%*----------------------------------------------------------------------
Put code to get next unassigned node into a macro variable. This query
is used in two places in the program.
-----------------------------------------------------------------------;
%let getnext= select node into :next from nodes where subnet=.;
%*----------------------------------------------------------------------
Initialize subnet id counter.
-----------------------------------------------------------------------;
%let subnetid=0;
proc sql noprint;
*----------------------------------------------------------------------;
* Get list of all nodes ;
*----------------------------------------------------------------------;
create table nodes as
select . as subnet, &from as node from &in where &from is not null
union
select . as subnet, &to as node from &in where &to is not null
;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Set subnet to next id ;
*----------------------------------------------------------------------;
%let subnetid=%eval(&subnetid+1);
update nodes set subnet=&subnetid where node=&next;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Get list of connected nodes for this subnet ;
*----------------------------------------------------------------------;
create table new as
select distinct a.&to as node
from &in a, nodes b, nodes c
where a.&from= b.node
and a.&to= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%if "&directed" ne "1" %then %do;
insert into new
select distinct a.&from as node
from &in a, nodes b, nodes c
where a.&to= b.node
and a.&from= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%end;
*----------------------------------------------------------------------;
* Update subnet for these nodes ;
*----------------------------------------------------------------------;
update nodes set subnet=&subnetid
where node in (select node from new )
;
%end;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%end;
*----------------------------------------------------------------------;
* Create output dataset by adding subnet number. ;
*----------------------------------------------------------------------;
create table &out as
select distinct a.*,b.subnet as &subnet
from &in a , nodes b
where a.&from = b.node
;
quit;
%mend subnet ;
You can use Hashes to compute your group identities and their members:
Example:
Proc DS2 is used for the succinctness of hash declaration and clarity that can be coded. The final pair Q H bridges two groups that were independent up-to that linkage point and requires the two groups to merge.
data customer;
length id1-id2 $8;
input id1-id2 ##; output;
datalines;
A B A C B A B D C A C D D C D .
E F E . F E F .
H J H K K L K M
P Q Q R R S S T
Q H
;
run;
%if %sysfunc(exist(vs)) %then %do;
proc delete data=vs;
proc delete data=gs;
%end;
options nosource;
proc ds2 ;
data _null_ ;
declare char(8) v1 v2 v;
declare double g gnew;
declare package hash vs([v], [v g], 0, '', 'ascending');
declare package hash gs([g], [g v], 0, '', 'ascending', '', '', 'multidata');
method add11(char(8) x1, char(8) x2); /* neither vertex has been seen before */
g + 1;
v = x1; vs.add(); gs.add();
v = x2; vs.add(); gs.add();
* put 'add00' x1 $char1. x2 $char1. ' ' g;
end;
method add10(char(8) x1, char(8) x2); /* x1 is not in a group, x2 is */
v = x2; vs.find(); * get group;
v = x1; vs.add(); * apply group to x2;
gs.add();
* put 'add10' x1 $char1. x2 $char1. ' ' g;
end;
method add01(char(8) x1, char(8) x2); /* x1 is in a group, x2 is not */
v = x1; vs.find(); * get group;
v = x2; vs.add(); * apply group to x1;
gs.add();
* put 'add01' x1 $char1. x2 $char1. ' ' g;
end;
method add00(char(8) x1, char(8) x2); /* both x1 and x2 are in a group */
declare double g1 g2;
v = x1; vs.find(); g1 = g; * get group of x1;
v = x2; vs.find(); g2 = g; * get group of x2;
if g1 ^= g2 then do;
* merge groups, v of higher group moved to lower group;
gnew = min(g1,g2);
g = max(g1,g2);
gs.find();
vs.replace([v], [v gnew]);
do while (gs.has_next() = 0);
gs.find_next();
vs.replace([v], [v gnew]);
end;
gs.removeall();
end;
* put 'add00' x1 $char1. x2 $char1. ' ' g g1 g2;
end;
method run();
declare int e1 e2;
declare char(2) f;
set customer;
if not missing(id1) and not missing(id2);
e1 = vs.check([id1]);
e2 = vs.check([id2]);
select (cats(e1^=0,e2^=0));
when ('11') add11(id1,id2);
when ('10') add10(id1,id2);
when ('01') add01(id1,id2);
when ('00') add00(id1,id2);
otherwise stop;
end;
end;
method term();
vs.output('vs');
gs.output('gs');
end;
run;
quit;
I would like to add a new column to a dataset but I am not sure how to do so. My dataset has a variable called KEYVAR (character variable) with three different values. A participant can appear multiple times in my dataset, with each row containing a similar or different value for KEYVAR. What I want to do is create a new variable call NEWVAR that counts how many times a participant has a specific value for KEYVAR; when a participant does not have an observation for that specific value, I want NEWVAR to have a result of zero.
Here's an example of the dataset I would like (in this example, I want to count every instance of "Y" per participants as newvar):
have
PARTICIPANT KEYVAR
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
want
PARTICIPANT KEYVAR NEWVAR
A Y 1
A N 1
B Y 3
B Y 3
B Y 3
C W 0
C N 0
C W 0
D Y 2
D N 2
D N 2
D Y 2
D W 2
You can use Proc SQL to compute an aggregate result over a group meeting a criteria, and have that aggregate value automatically merged into the result set.
-OR-
Use a MEANS, TRANSPOSE, MERGE approach
Sample Code (SQL)
data have;
input ID $ value $; datalines;
A Y
A N
B Y
B Y
B Y
C W
C N
C W
D Y
D N
D N
D Y
D W
E X
;
proc sql;
create table want as
select ID, value
, sum(value='Y') as Y_COUNT /* relies on logic eval 'math' 0 false, 1 true */
, sum(value='N') as N_COUNT
, sum(value='W') as W_COUNT
from have
group by ID
;
Sample Code (PROC and MERGE)
* format for PRELOADFMT and COMPLETETYPES;
proc format;
value $eachvalue
'Y' = 'Y'
'N' = 'N'
'W' = 'W'
other = '-';
;
run;
* Count how many per combination ID/VALUE;
proc means noprint data=have nway completetypes;
class ID ;
class value / preloadfmt;
format value $eachvalue.;
output out=freqs(keep=id value _freq_);
run;
* TRANSPOSE reshapes to wide (across) data layout, one row per ID;
proc transpose data=freqs suffix=_count out=counts_across(drop=_name_);
by id;
id value;
var _freq_;
where put(value,$eachvalue.) ne '-';
run;
* MERGE;
data want_way_2;
merge have counts_across;
by id;
run;
How to do below codes in proc sql.
Two proc statement and one merge given below.
proc sort data=new out=new1 nodupkey;
by id;
where roll=100;
run;
proc sort data new2 out =new4 nodupkey
by id;
where roll=100;
run;
data score;
merge new4 (in=a) new1;
by id;
if a;
run;
The merge you show is equivalent to SQL left-join. You want all the rows from "new2" and ignore all the rows from "new" that don't have a common id. The uniqueness of the id (per the pre-sorts) further supports a left-join equivalence.
Proc SQL;
select new.*, new2.*
from new2
left join new on new.id = new2.id
where roll=100
order by id;
quit;
For the scenario of atypical data where there is many:many ids in the merge, the left-join is not equivalent.
I did leave out the NODUPKEY equivalent. Presuming option EQUALS is in effect, the selection of a groups first row would be equivalent. The undocumented MONOTONIC() function can be used to apply a default row order to a sub-query, which can then be used in a by group having expression.
data LEFT;
input id x1 x2 x3;
datalines;
1 1 1 1
1 2 2 2
1 3 3 3
2 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
;
run;
data RIGHT;
input id y1 y2 y3 x1;
datalines;
1 1 1 1 11
2 1 1 1 22
3 1 2 3 4
3 2 3 4 5
3 3 4 5 6
4 1 1 1 44
6 6 6 6 6
;
run;
proc sql;
select
LEFT.id
, coalesce(RIGHT.x1,LEFT.x1) as x1
, LEFT.x2
, LEFT.x3
, RIGHT.y1
, RIGHT.y2
, RIGHT.y3
from
(
select * from (select monotonic() as _seq_, * from LEFT) group by id having _seq_ = min(_seq_)
)
as LEFT
left join
(
select * from (select monotonic() as _seq_, * from RIGHT) group by id having _seq_ = min(_seq_)
)
as RIGHT
on
LEFT.id = RIGHT.id
;
I feel the need to reiterate that SQL left join is not always the same a merge, and SQL does not have common variable 'overlaying' that is implicit in DATA Step. When LEFT and RIGHT collide on non-key variables, you need to select a coalescence of the common variables into a new like-named variable in the output.
I have the following data set:
data data_one;
length X 3
Y $ 20;
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
run;
data data_two;
length Z 3
A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;
What I would like to have is a data set which tells me how often column Y in data_one contains the same string of column A in data_two. The result should look like this one:
Obs test test1 test2
1 4 3 0
Thanks in advance!
First we need the counts for those values of Y present in data_one.
Then we create a sorted (for the next merge) list of the values present in data_two.
The data_one Y counts from 1. are merged with the list from 2.
The Y values present in data_two but not in data_one (b and not a) are assigned count=0, the Y values not present in data_two are discarded (if b).
The last passage transposes the vertical list of counts in an horizontal set of variables.
proc freq data=data_one noprint;
table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
by a;
run;
data count_all;
merge count_one (in=a) list_two (in=b);
by y;
if (b and not a) then count=0;
if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
id y;
run;
The first 3 steps can be replaced with one proc SQL:
proc sql;
create table count_all as
select distinct
coalesce(t1.y,t2.a) as y,
case
when missing(t1.y) then 0
else count(t1.y)
end as N
from data_one as t1
right join data_two as t2
on t1.y=t2.a
group by 1
order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
id y;
run;
How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !