Identifying groups/networks of customers - sas

I am trying to create unique customer groups which are determined by customer interactivity across transactions.
Here is an example of the data:
Transaction #
Primary Customer
Cosigner
WANT: Customer Group
1
1
2
A
2
1
3
A
3
1
4
A
4
1
2
A
5
2
5
A
6
3
6
A
7
2
1
A
8
3
1
A
9
7
8
B
10
9
C
In this example, customer 1 is connected to customers 2-6 either directly or indirectly, so all transactions associated with customers 1-6 would be a part of an "A" group. Customer 7 and 8 are directly connected and would be labeled as a "B" group. Customer 9 has no connections and are the single member of the "C" group.
Any suggestions are appreciated!

Your data can be considered the edges of a graph. So your request is to find the connected subgraphs of that graph. That question has an answer on Stackoverflow and SAS Communities. But this question is more on topic than that older SO question. So let's post the subnet SAS macro from the SAS Communities answer here on SO where it will be easier to find.
This simple macro uses repeated PROC SQL queries to build the list of connected subgraphs until all of the original records have been assigned to a subgraph.
The macro is setup to let you pass in the name of the source dataset and the names of the two variables that hold the ids of the nodes.
So first let's convert your printout into an actual SAS dataset.
data have;
input id primary cosign want $;
cards;
1 1 2 A
2 1 3 A
3 1 4 A
4 1 2 A
5 2 5 A
6 3 6 A
7 2 1 A
8 3 1 A
9 7 8 B
10 9 . C
;
Now we can call the macro and tell it that PRIMARY and COSIGN are the variables with the node ids and that SUBNET is the name for the new variable to hold the ids of the connected subgraphs. NOTE: This version treats the graph as directed by default.
%subnet(in=have,out=want,from=primary,to=cosign,subnet=subnet);
Results:
Obs id primary cosign want subnet
1 1 1 2 A 1
2 2 1 3 A 1
3 3 1 4 A 1
4 4 1 2 A 1
5 5 2 5 A 1
6 6 3 6 A 1
7 7 2 1 A 1
8 8 3 1 A 1
9 9 7 8 B 2
10 10 9 . C 3
Here is the code of the %SUBNET() macro.
%macro subnet(in=,out=,from=from,to=to,subnet=subnet,directed=1);
/*----------------------------------------------------------------------
SUBNET - Build connected subnets from pairs of nodes.
Input Table :FROM TO pairs of rows
Output Table:input data with &subnet added
Work Tables:
NODES - List of all nodes in input.
NEW - List of new nodes to assign to current subnet.
Algorithm:
Pick next unassigned node and grow the subnet by adding all connected
nodes. Repeat until all unassigned nodes are put into a subnet.
To treat the graph as undirected set the DIRECTED parameter to 0.
----------------------------------------------------------------------*/
%local subnetid next getnext ;
%*----------------------------------------------------------------------
Put code to get next unassigned node into a macro variable. This query
is used in two places in the program.
-----------------------------------------------------------------------;
%let getnext= select node into :next from nodes where subnet=.;
%*----------------------------------------------------------------------
Initialize subnet id counter.
-----------------------------------------------------------------------;
%let subnetid=0;
proc sql noprint;
*----------------------------------------------------------------------;
* Get list of all nodes ;
*----------------------------------------------------------------------;
create table nodes as
select . as subnet, &from as node from &in where &from is not null
union
select . as subnet, &to as node from &in where &to is not null
;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Set subnet to next id ;
*----------------------------------------------------------------------;
%let subnetid=%eval(&subnetid+1);
update nodes set subnet=&subnetid where node=&next;
%do %while (&sqlobs) ;
*----------------------------------------------------------------------;
* Get list of connected nodes for this subnet ;
*----------------------------------------------------------------------;
create table new as
select distinct a.&to as node
from &in a, nodes b, nodes c
where a.&from= b.node
and a.&to= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%if "&directed" ne "1" %then %do;
insert into new
select distinct a.&from as node
from &in a, nodes b, nodes c
where a.&to= b.node
and a.&from= c.node
and b.subnet = &subnetid
and c.subnet = .
;
%end;
*----------------------------------------------------------------------;
* Update subnet for these nodes ;
*----------------------------------------------------------------------;
update nodes set subnet=&subnetid
where node in (select node from new )
;
%end;
*----------------------------------------------------------------------;
* Get next unassigned node ;
*----------------------------------------------------------------------;
&getnext;
%end;
*----------------------------------------------------------------------;
* Create output dataset by adding subnet number. ;
*----------------------------------------------------------------------;
create table &out as
select distinct a.*,b.subnet as &subnet
from &in a , nodes b
where a.&from = b.node
;
quit;
%mend subnet ;

You can use Hashes to compute your group identities and their members:
Example:
Proc DS2 is used for the succinctness of hash declaration and clarity that can be coded. The final pair Q H bridges two groups that were independent up-to that linkage point and requires the two groups to merge.
data customer;
length id1-id2 $8;
input id1-id2 ##; output;
datalines;
A B A C B A B D C A C D D C D .
E F E . F E F .
H J H K K L K M
P Q Q R R S S T
Q H
;
run;
%if %sysfunc(exist(vs)) %then %do;
proc delete data=vs;
proc delete data=gs;
%end;
options nosource;
proc ds2 ;
data _null_ ;
declare char(8) v1 v2 v;
declare double g gnew;
declare package hash vs([v], [v g], 0, '', 'ascending');
declare package hash gs([g], [g v], 0, '', 'ascending', '', '', 'multidata');
method add11(char(8) x1, char(8) x2); /* neither vertex has been seen before */
g + 1;
v = x1; vs.add(); gs.add();
v = x2; vs.add(); gs.add();
* put 'add00' x1 $char1. x2 $char1. ' ' g;
end;
method add10(char(8) x1, char(8) x2); /* x1 is not in a group, x2 is */
v = x2; vs.find(); * get group;
v = x1; vs.add(); * apply group to x2;
gs.add();
* put 'add10' x1 $char1. x2 $char1. ' ' g;
end;
method add01(char(8) x1, char(8) x2); /* x1 is in a group, x2 is not */
v = x1; vs.find(); * get group;
v = x2; vs.add(); * apply group to x1;
gs.add();
* put 'add01' x1 $char1. x2 $char1. ' ' g;
end;
method add00(char(8) x1, char(8) x2); /* both x1 and x2 are in a group */
declare double g1 g2;
v = x1; vs.find(); g1 = g; * get group of x1;
v = x2; vs.find(); g2 = g; * get group of x2;
if g1 ^= g2 then do;
* merge groups, v of higher group moved to lower group;
gnew = min(g1,g2);
g = max(g1,g2);
gs.find();
vs.replace([v], [v gnew]);
do while (gs.has_next() = 0);
gs.find_next();
vs.replace([v], [v gnew]);
end;
gs.removeall();
end;
* put 'add00' x1 $char1. x2 $char1. ' ' g g1 g2;
end;
method run();
declare int e1 e2;
declare char(2) f;
set customer;
if not missing(id1) and not missing(id2);
e1 = vs.check([id1]);
e2 = vs.check([id2]);
select (cats(e1^=0,e2^=0));
when ('11') add11(id1,id2);
when ('10') add10(id1,id2);
when ('01') add01(id1,id2);
when ('00') add00(id1,id2);
otherwise stop;
end;
end;
method term();
vs.output('vs');
gs.output('gs');
end;
run;
quit;

Related

Aggregate multiple vars on different groupings in one Proc SQL query

I need to aggregate about ten different vars on different groupings using Proc SQL;
Is there a way to achieve SUM () OVER ( [ partition_by_clause ] order_by_clause) in one sql query with different partition by clauses.
I've made an example here
data have;
infile cards;
input a b c d e f;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
proc sql;
create table want as
select *,
sum a over partiton by (b,c) as a1,
sum b over partiton by (c,d) as b1
sum c over partiton by (d,e) as c1
sum d over partiton by (a,c) as d1
from have
;
quit;
I don't want to wirte multiple sql queries and grouping on different vars and calculating one var in each step.
Hope that makes sense.
Proc SQL does not implement windowing functions and thus partition syntax therein as found in other SQL implementations. You can only do partition by with passthrough SQL to a connection that allows such syntax.
You could perform such a computation in DATA step using hashes.
data have;
infile cards;
input a b c d e ;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
data want;
if 0 then set have;
length a1 b1 c1 d1 8;
declare hash a1s();
a1s.defineKey('b', 'c');
a1s.defineData('a1');
a1s.defineDone();
declare hash b1s();
b1s.defineKey('c', 'd');
b1s.defineData('b1');
b1s.defineDone();
declare hash c1s();
c1s.defineKey('d', 'e');
c1s.defineData('c1');
c1s.defineDone();
declare hash d1s();
d1s.defineKey('a', 'c');
d1s.defineData('d1');
d1s.defineDone();
do while (not end);
set have end=end;
if a1s.find() = 0 then a1+a; else a1=a; a1s.replace();
if b1s.find() = 0 then b1+b; else b1=b; b1s.replace();
if c1s.find() = 0 then c1+c; else c1=c; c1s.replace();
if d1s.find() = 0 then d1+d; else d1=d; d1s.replace();
end;
do while (not last);
set have end=last;
a1s.find();
b1s.find();
c1s.find();
d1s.find();
output;
end;
format _numeric_ 4.;
stop;
run;

Dynamic n in function LAG<n> (variable) SAS_part2

do you know how to use n in function LAGn(variable) that refer to another macro variable in the program-> max in my case by V1?
data example1;
input V1 value V2;
datalines;
a 1.0 2.0
a 1.0 1.0
a 1.0 1.0
b 1.0 1.0
b 1.0 1.0
;
proc sql;
select max(V2) format = 1. into :n
from example1;
quit;
data example1;
set example1;
by V1;
lagval=lag&n(V2);
run;
Code from user667489 and works for one column. Now n changes by V1.
I expect:
MAX LAG
a 1.0 2.0 2 .
a 1.0 1.0 2 .
a 1.0 1.0 2 2
b 1.0 1.0 1 .
b 1.0 1.0 1 1
;
Forget about LAG(). Just add a counter variable and join on that.
Let's fix your example data step so it works.
data example1;
input V1 $ value V2;
datalines;
a 1 2
a 1 1
a 1 1
b 1 1
b 1 1
;
Now add a unique row id within each BY group.
data step1;
set example1;
by v1;
if first.v1 then row=0;
row+1;
run;
Now just join this dataset with itself.
proc sql ;
create table want as
select a.*,b.v2 as lag_v2
from (select *,max(v2) as max_v2 from step1 group by v1) a
left join step1 b
on a.v1= b.v1 and a.row = b.row + a.max_v2
;
quit;
Results:
Obs V1 value V2 row max_v2 lag_v2
1 a 1 2 1 2 .
2 a 1 1 2 2 .
3 a 1 1 3 2 2
4 b 1 1 1 1 .
5 b 1 1 2 1 1
Hopefully your real use case makes more sense than than this example.
The LAG<n> function is an in-place stack of fixed depth that is specific to it's code use location and thus step state at invocation. The stack is of depth and can not be altered dynamically at runtime.
A dynamic lag can be implemented in SAS DATA step using a hash object. The double DOW technique allows a group to be measured and then subsequently it's items operated upon.
Sample code
This example uses a defines a hash object that maintains a stack of values within a group. A first DOW loop computes the maximum of a field that becomes the dynamic stack height. The second DOW loop iterates of the group and retrieves the lag value while also building up the stack for future item lags.
* some faux data;
data have (keep=group value duration);
do group = 1 to 10;
limit = ceil(4 * ranuni(6));
put group= limit=;
do _n_ = 1 to 8 + 10*ranuni(123);
value = group*10 + _n_;
duration = 1 + floor(limit*ranuni(123));
output;
end;
end;
run;
* dynamic lag provided via hash;
data want;
if _n_ = 1 then do;
retain index lag_value .;
declare hash lag_stack();
lag_stack.defineKey('index');
lag_stack.defineData('lag_value');
lag_stack.defineDone();
end;
do _n_ = 1 by 1 until (last.group);
set have;
by group;
max_duration = max(max_duration, duration);
end;
* max_duration within group is the lag lag_stack height;
* pre-fill missings ;
do index = 1-max_duration to 0;
lag_stack.replace(key: index, data: .);
end;
do _n_ = 1 to _n_;
set have;
lag_stack.replace(key: _n_, data: value);
lag_stack.find(key: _n_ - max_duration);
output;
end;
drop index;
run;
Another technique would involve a fixed length ring-array instead of a hash-stack, but you would need to compute the maximum lag over all groups prior to coding the DATA step using the array.

Do it in proc sql

How to do below codes in proc sql.
Two proc statement and one merge given below.
proc sort data=new out=new1 nodupkey;
by id;
where roll=100;
run;
proc sort data new2 out =new4 nodupkey
by id;
where roll=100;
run;
data score;
merge new4 (in=a) new1;
by id;
if a;
run;
The merge you show is equivalent to SQL left-join. You want all the rows from "new2" and ignore all the rows from "new" that don't have a common id. The uniqueness of the id (per the pre-sorts) further supports a left-join equivalence.
Proc SQL;
select new.*, new2.*
from new2
left join new on new.id = new2.id
where roll=100
order by id;
quit;
For the scenario of atypical data where there is many:many ids in the merge, the left-join is not equivalent.
I did leave out the NODUPKEY equivalent. Presuming option EQUALS is in effect, the selection of a groups first row would be equivalent. The undocumented MONOTONIC() function can be used to apply a default row order to a sub-query, which can then be used in a by group having expression.
data LEFT;
input id x1 x2 x3;
datalines;
1 1 1 1
1 2 2 2
1 3 3 3
2 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
;
run;
data RIGHT;
input id y1 y2 y3 x1;
datalines;
1 1 1 1 11
2 1 1 1 22
3 1 2 3 4
3 2 3 4 5
3 3 4 5 6
4 1 1 1 44
6 6 6 6 6
;
run;
proc sql;
select
LEFT.id
, coalesce(RIGHT.x1,LEFT.x1) as x1
, LEFT.x2
, LEFT.x3
, RIGHT.y1
, RIGHT.y2
, RIGHT.y3
from
(
select * from (select monotonic() as _seq_, * from LEFT) group by id having _seq_ = min(_seq_)
)
as LEFT
left join
(
select * from (select monotonic() as _seq_, * from RIGHT) group by id having _seq_ = min(_seq_)
)
as RIGHT
on
LEFT.id = RIGHT.id
;
I feel the need to reiterate that SQL left join is not always the same a merge, and SQL does not have common variable 'overlaying' that is implicit in DATA Step. When LEFT and RIGHT collide on non-key variables, you need to select a coalescence of the common variables into a new like-named variable in the output.

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

How to add new observation to already created dataset in SAS?

How to add new observation to already created dataset in SAS ? For example, if I have dataset 'dataX' with variable 'x' and 'y' and I want to add new observation which is multiplication by two of the
of the observation number n, how can I do it ?
dataX :
x y
1 1
1 21
2 3
I want to create :
dataX :
x y
1 1
1 21
2 3
10 210
where observation number four is multiplication by ten of observation number two.
data X;
input x y;
datalines;
1 1
1 21
2 3
;
run;
data X ;
set X end=eof;
if eof then do;
output;
x=10 ;y=210;
end;
output;
run;
Here is one way to do this:
data dataX;
input x y;
datalines;
1 1
1 21
2 3
run;
/* Create a new observation into temp data set */
data _addRec;
set dataX(firstobs=2); /* Get observation 2 */
x = x * 10; /* Multiply each by 10 */
y = y * 10;
output; /* Output new observation */
stop;
run;
/* Add new obs to original data set */
proc append base=dataX data=_addRec;
run;
/* Delete the temp data set (to be safe) */
proc delete data=_addRec;
run;
data a ;
do kk=1 to 5 ;
output ;
end ;
run;
data a2 ;
kk=999 ;
output ;
run;
data a; set a a2 ;run ;
proc print data=a ;run ;
Result:
The SAS System 1
OBS kk
1 1
2 2
3 3
4 4
5 5
6 999
You can use macro to obtain your desired result :
Write a macro which will read first DataSet and when _n_=2 it will multiply x and y with 10.
After that create another DataSet which will hold only your muliplied value let say x'=10x and y'=10y.
Pass both DataSet in another macro which will set the original datset and newly created dataset.
Logic is you have to create another dataset with value 10x and 10y and after that set wih previous dataset.
I hope this will help !