identifying connected graphs given edges - sas

How do I group people who are related, even indirectly? Concretely, using the first two columns of the data set like below, how do I in SAS (maybe using a DATA step or PROC SQL) programmatically derive the third column? Is there a non-iterative algorithm?
Background: Each person has multiple addresses. Through each address, each person is connected to zero or more persons. If two people are connected, they get the same group ID. If person A is directly connected to B and B is connected to C, then persons A, B, and C share a group.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;

The general methods for finding all connected components of a graph are Breadth-First-Search or Depth-First-Search. SAS is not the best tool for implementing such algorithms since they require using such data structures as queues.
Still it can be done with hash objects. Here's the code for BF-search.
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;
run;
Create adjacency list - all pairs of people with a common address. And empty variable cluster which will be populated later with groups' IDs:
proc sql;
create table connections as
select distinct a.person_id as person_id_a, b.person_id as person_id_b, . as cluster
from people a
inner join people b
on a.address_id=b.address_id
;
quit;
Here goes the BF-search itself:
data _null_;
Declare hash object and its iterator for all unique people (vertices of the graph):
if 0 then set Connections;
dcl hash V(dataset:'Connections', ordered:'y');
V.defineKey('person_id_a');
V.defineData('person_id_a','cluster');
dcl hiter Vi('V');
V.defineDone();
Declare hash object for all connections (edges of the graph):
dcl hash E(dataset:'Connections', multidata:'y');
E.defineKey('person_id_a');
E.defineData('person_id_a','person_id_b');
E.defineDone();
Declare hash object and its iterator for the queue:
dcl hash Q(ordered:'y');
Q.defineKey('qnum','person_id_a');
Q.defineData('qnum','person_id_a');
dcl hiter Qi('Q');
Q.defineDone();
The outermost loop - for taking a new person without assigned cluster to be a root of the next cluster, when the queue is empty:
rc1=Vi.first();
do while(rc1=0);
if missing(cluster) then do;
qnum=1; Q.add(); *qnum-number of the person in the queue, to ensure that new people are added to the end of the queue.;
n+1; cluster=n;
V.replace();*assign cluster number to a person;
In the following two nested loops we de-queue the first person in the queue and look for all people connected to this person in adjacency list. Every found 'connection' we add to the end of the queue. When done with the first person, we delete him/her and de-queue the next one (who became the first now). All of them will be in the same cluster. And so on, until the queue is empty. Then we take a new root person for a new cluster.
rc2=Qi.first();
do while(rc2=0);
qnum=qnum+Q.num_items-1;
rc3=E.find();
do while(rc3=0);
person_id_a=person_id_b;
rc4=V.find();
if rc4=0 and missing(cluster) then do;
qnum+1; Q.add();
cluster=n;
V.replace();
end;
rc3=E.find_next();
end;
Qi.first();
Qi.delete();
Q.remove();
Qi=_new_ hiter ('Q');
rc2=Qi.first();
end;
end;
rc1=Vi.next();
end;
Output list of people with assigned clusters.
V.output(dataset:'clusters');
run;
proc sort data=clusters; by cluster; run;

This is a common problem that has complex solutions. How complex you need depends primarily on the complexity of your data. How often are linkages more than single linkages - ie, in your example above, C and D are linked by 5. Can you have an E that is linked to D by 6? If so then this requires either a different approach or a resolution step.
I show one simple method here. This is a very simplistic solution, but it sometimes is easier to understand and implement. Record linkage is a well covered subject that has a lot of papers available to explore; much better solutions exist that are more able to handle multiple linkage than the below solution (which handles 2 level linkage but not further, and has some weaknesses in handling data crosslinkages).
data people;
input person_id address_id $ household_id;
datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
6 E 3
7 E 3
8 B 2
;
run;
data links(keep=link:);
set people;
by person_id address_id;
retain link_start;
if first.person_id and not last.person_id then do;
link_start = address_id;
end;
if first.address_id and not first.person_id then do;
link_end = address_id;
output;
end;
run;
data for_fmt;
set links;
start=link_end;
label=link_Start;
retain fmtname '$linkf';
output;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data people_linked;
set people;
new_addressid = put(address_id,$linkf.);
new_addressid = put(new_addressid, $linkf.);
run;
proc sort data=people_linked;
by new_addressid;
run;
data people_final;
set people_linked;
by new_addressid;
if first.new_addressID then
new_householdID+1;
run;

I have been working with a problem that requires a similar thing. I was able to solve using SAS OR using proc OPTNET (statement CONCOMP). Documentation even bring an example that illustrate the concept very well .
Thanks,
Murilo

Related

SAS Updating records sequentially

I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.

Merging but keeping all observations?

I have three data sets of inpatient, outpatient, and professional claims. I want to find the number of unique people who have a claim related to tobacco use (1=yes tobacco, 0=tobacco) in ANY of these three data sets.
Therefore, the data sets pretty much are all:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 0
2 1
3 1
4 1
5 0
;
run;
I am trying to merge the inpatient, outpatient, and professional so that I am left with those patient ids that have a tobacco claim in any of the three data sets using:
data tobaccoall;
merge inpatient outpatient professional;
by rid;
run;
However, it is overwriting some of the 1's with 0's in the new data set. How do I better merge the data sets to find if the patient has a claim in ANY of the datasets?
When you merge data sets in SAS that share variable names, the values from the data set listed on the right in the merge statement overwrite the values from data set to its left. In order to keep each value, you'd want to rename the variables before merging. You can do this in the merge statement by adding a rename= option after each data set.
If you want a single variable that represents whether a tobacco claim exists in any of the three variables, you could create a new variable using the max function to combine the three different values.
data tobaccoall;
merge inpatient (rename=(tobacco=tobacco_in))
outpatient (rename=(tobacco=tobacco_out))
professional (rename=(tobacco=tobacco_pro));
by rid;
tobacco_any = max(tobacco_in,tobacco_out,tobacco_pro,0);
run;
If your data were 1=has .=doesn't have (missing), then you could use the UPDATE statement, which mostly works like Merge except it wouldn't overwrite nonmissing data with missing.
For example:
data inpatient;
input Patient_ID Tobacco;
datalines;
1 .
2 1
3 1
4 1
5 .
;
run;
data outpatient;
input Patient_ID Tobacco;
datalines;
1 1
2 1
3 .
4 .
5 .
;
run;
data want;
update inpatient outpatient;
by patient_id;
run;

Sorting an almost sorted dataset in SAS

I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency
1 Jon 20
1 John 5
2 Mathieu 2
2 Mathhew 7
2 Matt 5
3 Nat 1
3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?
If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.

Remove all instances of duplicates in SAS

I am merging two SAS datasets by ID number and would like to remove all instances of duplicate IDs, i.e. if an ID number occurs twice in the merged dataset then both observations with that ID will be deleted.
Web searches have suggested some sql methods and nodupkey, but these are not working because they are for typical duplicate cleansing where one instance is kept and then the multiples are deleted.
Assuming you are using a DATA step with a BY id; statement, then adding:
if NOT (first.id and last.id) then delete;
should do it. If that doesn't work, please show your code.
I'm actually a fan of writing dropped records to a separate dataset so you can track how many records were dropped at different points. So I would code this something like:
data want
drop_dups
;
merge a b ;
by id ;
if first.id and last.id then output want ;
else output drop_dups ;
run ;
Here is an SQL way to do it. You can use left/right/inner join best suitable for your needs. Note that this works on a single dataset just as well.
proc sql;
create table singles as
select * from dataset1 a inner join dataset2 b
on a.ID = b.ID
group by a.ID
having count(*) = 1;
quit;
For example from
ID x
5 2
5 4
1 6
2 7
3 6
You will select
ID x
1 6
2 7
3 6

In SAS, is there a faster way to create an empty variable if it doesn't exist?

Currently I'm using a method similar to that used in a previous question,
Test if a variable exists
but with a small modification to make it able to handle larger numbers of variables more easily. The following code ensures that n6 has the same variables as the data set referenced by dsid2.
data n6;
set n5;
dsid=open('n5');
dsid2=open(/*empty template dataset*/);
varsn=attrn(dsid2,nvars);
i=1;
do until i = varsn;
if varnum(dsid,varname(dsid2,i))=0 then do;
varname(dsid2,i)="";
format varname(dsid2,i) varfmt(dsid2,i);
end;
i=i+1;
end;
run;
If I understand correctly, SAS will run through the entire do loop for each observation. I'm beginning to experience slow run times as I begin to use larger data sets, and I was wondering if anyone has a better technique?
If possible, the simplest approach is to apply your regular logic to your new dataset. Worry about matching the variables later. When you are done with processing you can create an empty version of the template dataset like this:
data empty;
set template(obs=0);
run;
and then merge empty and your new dataset:
data template;
input var1 var2 var3;
datalines;
7 2 2
5 5 3
7 2 7
;
data empty;
set template(obs=0);
run;
data todo;
input var1 var2;
datalines;
1 2
;
data merged;
merge todo empty;
run;
In this example the merged dataset will have var3 with the value missing.