I have two variables ID1 and ID2. They are both the same kinds of identifiers. When they appear in the same row of data it means they are in the same group. I want to make a group identifier for each ID. For example, I have
ID1 ID2
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
Then I would want
ID Group
1 1
2 1
3 2
4 1
5 1
6 1
7 2
Because 1,2,4,5,6 are paired by some combination in the original data they share a group. 3 and 7 are only paired with each other so they are a new group. I want to do this for ~20,000 rows. Every ID that is in ID1 is also in ID2 (more specifically if ID1=1 and ID2=2 for an observation, then there is another observation that is ID1=2 and ID2=1).
I've tried merging them back and forth but that doesn't work. I also tried call symput and trying to make a macro variable for each ID's group and then updating it as I move through rows, but I couldn't get that to work either.
I have used Haikuo Bian's answer as a starting point to develop a slightly more complex algorithm that seems to work for all the test cases I have tried so far. It could probably be optimised further, but it copes with 20000 rows in under a second on my PC while using only a few MB of memory. The input dataset does not need to be sorted in any particular order, but as written it assumes that every row is present at least once with id1 < id2.
Test cases:
/* Original test case */
data have;
input id1 id2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
/* Revised test case - all in one group with connecting row right at the end */
data have;
input ID1 ID2;
/*Make sure each row has id1 < id2*/
if id1 > id2 then do;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
end;
drop t_id2;
cards;
2 5
4 8
2 4
2 6
3 7
4 1
9 1
3 2
6 2
7 3
;
run;
/*Full scale test case*/
data have;
do _N_ = 1 to 20000;
call streaminit(1);
id1 = int(rand('uniform')*100000);
id2 = int(rand('uniform')*100000);
if id1 < id2 then output;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
if id1 < id2 then output;
end;
drop t_id2;
run;
Code:
option fullstimer;
data _null_;
length id group 8;
declare hash h();
rc = h.definekey('id');
rc = h.definedata('id');
rc = h.definedata('group');
rc = h.definedone();
array ids(2) id1 id2;
array groups(2) group1 group2;
/*Initial group guesses (greedy algorithm)*/
do until (eof);
set have(where = (id1 < id2)) end = eof;
match = 0;
call missing(min_group);
do i = 1 to 2;
rc = h.find(key:ids[i]);
match + (rc=0);
if rc = 0 then min_group = min(group,min_group);
end;
/*If neither id was in a previously matched group, create a new one*/
if not(match) then do;
max_group + 1;
group = max_group;
end;
/*Otherwise, assign both to the matched group with the lowest number*/
else group = min_group;
do i = 1 to 2;
id = ids[i];
rc = h.replace();
end;
end;
/*We now need to work through the whole dataset multiple times
to deal with ids that were wrongly assigned to a separate group
at the end of the initial pass, so load the table into a
hash object + iterator*/
declare hash h2(dataset:'have(where = (id1 < id2))');
rc = h2.definekey('id1','id2');
rc = h2.definedata('id1','id2');
rc = h2.definedone();
declare hiter hi2('h2');
change_count = 1;
do while(change_count > 0);
change_count = 0;
rc = hi2.first();
do while(rc = 0);
/*Get the current group of each id from
the hash we made earlier*/
do i = 1 to 2;
rc = h.find(key:ids[i]);
groups[i] = group;
end;
/*If we find a row where the two ids have different groups,
move the id in the higher group to the lower group*/
if groups[1] < groups[2] then do;
id = ids[2];
group = groups[1];
rc = h.replace();
change_count + 1;
end;
else if groups[2] < groups[1] then do;
id = ids[1];
group = groups[2];
rc = h.replace();
change_count + 1;
end;
rc = hi2.next();
end;
pass + 1;
put pass= change_count=; /*For information only :)*/
end;
rc = h.output(dataset:'want');
run;
/*Renumber the groups sequentially*/
proc sort data = want;
by group id;
run;
data want;
set want;
by group;
if first.group then new_group + 1;
drop group;
rename new_group = group;
run;
/*Summarise by # of ids per group*/
proc sql;
select a.group, count(id) as FREQ
from want a
group by a.group
order by freq desc;
quit;
Interestingly, the suggested optimisation of not checking the group of id2 during the initial pass if id1 is already matched actually slows things down a little in this extended algorithm, because it means that more work has to be done in the subsequent passes if id2 is in a lower numbered group. E.g. output from a trial run I did earlier:
With 'optimisation':
pass=0 change_count=4696
pass=1 change_count=204
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.19 seconds
user cpu time 0.17 seconds
system cpu time 0.04 seconds
memory 9088.76k
OS Memory 35192.00k
Without:
pass=0 change_count=4637
pass=1 change_count=182
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.18 seconds
user cpu time 0.16 seconds
system cpu time 0.04 seconds
Please try the below code.
data have;
input ID1 ID2;
datalines;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
* Finding repeating in ID1;
proc sort data=have;by id1;run;
data want_1;
set have;
by id1;
attrib flagrepeat length=8.;
if not (first.id1 and last.id1) then flagrepeat=1;
else flagrepeat=0;
run;
* Finding repeating in ID2;
proc sort data=want_1;by id2;run;
data want_2;
set want_1;
by id2;
if not (first.id2 and last.id2) then flagrepeat=1;
run;
proc sort data=want_2 nodupkey;by id1 ;run;
data want(drop= ID2 flagrepeat rename=(ID1=ID));
set want_2;
attrib Group length=8.;
if(flagrepeat eq 1) then Group=1;
else Group=2;
run;
Hope this answer helps.
Like one commentator mentioned, Hash does seem to be a viable approach. In the following code, 'id' and 'group' is maintained in the Hash table, new 'group' is added only when no 'id' match is found for the entire row. Please note, 'do over' is an undocumented feature, it can be easily replaced with a little bit more coding.
data have;
input ID1 ID2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
data _null_;
if _n_=1 then
do;
declare hash h(ordered: 'a');
h.definekey('id');
h.definedata('id','group');
h.definedone();
call missing(id,group);
end;
set have end=last;
array ids id1 id2;
do over ids;
rc=sum(rc,h.find(key:ids)=0);
/*you can choose to 'leave' the loop here when first h.find(key:ids)=0 is met, for the sake of better efficiency*/
end;
if not rc > 0 then
group+1;
do over ids;
id=ids;
h.replace();
end;
if last then rc=h.output(dataset:'want');
run;
Related
I have a dataset that looks like:
Hour Flag
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
I want to have an output dataset like:
Total_Hours Count
2 2
3 1
4 1
As you can see, I want to count the number of hours included in each period with consecutive "1s". A missing value ends the consecutive sequence.
How should I go about doing this? Thanks!
You'll need to do this in two steps. First step is making sure the data is sorted properly and determining the number of hours in a consecutive period:
PROC SORT DATA = <your dataset>;
BY hour;
RUN;
DATA work.consecutive_hours;
SET <your dataset> END = lastrec;
RETAIN
total_hours 0
;
IF flag = 1 THEN total_hours = total_hours + 1;
ELSE
DO;
IF total_hours > 0 THEN output;
total_hours = 0;
END;
/* Need to output last record */
IF lastrec AND total_hours > 0 THEN output;
KEEP
total_hours
;
RUN;
Now a simple SQL statement:
PROC SQL;
CREATE TABLE work.hour_summary AS
SELECT
total_hours
,COUNT(*) AS count
FROM
work.consecutive_hours
GROUP BY
total_hours
;
QUIT;
You will have to do two things:
compute the run lengths
compute the frequency of the run lengths
For the case of using the implict loop
Each run length occurnece can be computed and maintained in a retained tracking variable, testing for a missing value or end of data for output and a non missing value for run length reset or increment.
Proc FREQ
An alternative is to use an explicit loop and a hash for frequency counts.
Example:
data have; input
Hour Flag; datalines;
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
;
data _null_;
declare hash counts(ordered:'a');
counts.defineKey('length');
counts.defineData('length', 'count');
counts.defineDone();
do until (end);
set have end=end;
if not missing(flag) then
length + 1;
if missing(flag) or end then do;
if length > 0 then do;
if counts.find() eq 0
then count+1;
else count=1;
counts.replace();
length = 0;
end;
end;
end;
counts.output(dataset:'want');
run;
An alternative
data _null_;
if _N_ = 1 then do;
dcl hash h(ordered : "a");
h.definekey("Total_Hours");
h.definedata("Total_Hours", "Count");
h.definedone();
end;
do Total_Hours = 1 by 1 until (last.Flag);
set have end=lr;
by Flag notsorted;
end;
Count = 1;
if Flag then do;
if h.find() = 0 then Count+1;
h.replace();
end;
if lr then h.output(dataset : "want");
run;
Several weeks ago, #Richard taught me how to use DOW-loop and direct addressing array. Today, I give it to you.
data want(keep=Total_Hours Count);
array bin[99]_temporary_;
do until(eof1);
set have end=eof1;
if Flag then count + 1;
if ^Flag or eof1 then do;
bin[count] + 1;
count = .;
end;
end;
do i = 1 to dim(bin);
Total_Hours = i;
Count = bin[i];
if Count then output;
end;
run;
And Thanks Richard again, he also suggested me this article.
I am matching files base on IDs numbers. I need to format a data set with the IDs to be matched, so that the same ID number is not repeated in column a (because column b's ID is the surviving ID after the match is completed). My list of IDs has over 1 million observations, and the same ID may be repeated multiple times in either/both columns.
Here is an example of what I've got/need:
Sample Data
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
The surviving IDs would be:
2
4
5
error - 1 no longer exists
error - 1 no longer exists
8
WHAT I NEED
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I am, probably very obviously, a SAS novice, but here is what I have tried, re-running over and over again because I have some IDs that are repeated upward of 50 times or more.
Proc sort data=Have;
by ID1;
run;
This sort makes the repeated ID1 values consecutive, so the I could use LAG to replace the destroyed ID1s with the surviving ID2 from the line above.
Data Want;
set Have;
by ID1;
lagID1=LAG(ID1);
lagID2=LAG(ID2);
If NOT first. ID1 THEN DO;
If ID1=lagID1 THEN ID1=lagID2;
KEEP ID1 ID2;
IF ID1=ID2 then delete;
end;
run;
That sort of works, but I still end up with some that end up with duplicates that won't resolve no matter how many times I run (I would have looped it, but I don't know how), because they are just switching back and forth between IDs that have other duplicates (I can get down to about 2,000 of these).
I have figured out that instead of using LAG, I need replace all values after the current line with ID2 for each ID1 value, but I cannot figure out how to do that.
I want to read observation 1, find all later instances of the value of ID1, in both ID1 or ID2 columns, and replace that value with the current observation's ID2 value. Then I want to repeat that process with line 2 and so on.
For the example, I would want to look for any instances after line one of the value 1, and replace it with 2, since that is the surviving ID of that pair - 1 may appear further down multiple times in either of the columns, and I need all them to replaced. Line two would look for later values of 3 and replace them with 4, and so one. The end result should be that an ID number only appears once ever in the ID1 column (though it may appear multiple times in the ID2 column).
ID1 ID2
1 2
3 4
2 5
6 1
1 7
5 8
After first line has been read, data set would look as follows:
ID1 ID2
1 2
3 4
2 5
6 2
2 7
5 8
Reading observation two would make no changes since 3 does not appear again; after observation 3, the set would be:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
5 8
Again, there would be not changes from observation four. but observation 5 would cause the final change:
ID1 ID2
1 2
3 4
2 5
6 5
5 7
7 8
I have tried using the following statement but I can't even tell if I am on the complete wrong track or if I just can't get the syntax figured out.
Data want;
Set have;
Do i=_n_;
ID=ID2;
Replace next var{EUID} where (EUID1=EUID1 AND EUID2=EUID1);
End;
Run;
Thanks for your help!
There is no need to work back and forth thru the data file. You just need to retain the replacement information so that you can process the file in a single pass.
One way to do that is to make a temporary array using the values of the ID variables as the index. That is easy to do for your simple example with small ID values.
So for example if all of the ID values are integers between 1 and 1000 then this step will do the job.
data want ;
set have ;
array xx (1000) _temporary_;
do while (not missing(xx(id1))); id1=xx(id1); end;
do while (not missing(xx(id2))); id2=xx(id2); end;
output;
xx(id1)=id2;
run;
You probably need to add a test to prevent cycles (1 -> 2 -> 1).
For a more general solution you should replace the array with a hash object instead. So something like this:
data want ;
if _n_=1 then do;
declare hash h();
h.definekey('old');
h.definedata('new');
h.definedone();
call missing(new,old);
end;
set have ;
do while (not h.find(key:id1)); id1=new; end;
do while (not h.find(key:id2)); id2=new; end;
output;
h.add(key: id1,data: id2);
drop old new;
run;
Here's an implementation of the algorithm you've suggested, using a modify statement to load and rewrite each row one at a time. It works with your trivial example but with messier data you might get duplicate values in ID1.
data have;
input ID1 ID2 ;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
title "Before making replacements";
proc print data = have;
run;
/*Optional - should improve performance at cost of increased memory usage*/
sasfile have load;
data have;
do i = 1 to nobs;
do j = i to nobs;
modify have point = j nobs = nobs;
/* Make copies of target and replacement value for this pass */
if j = i then do;
id1_ = id1;
id2_ = id2;
end;
else do;
flag = 0; /* Keep track of whether we made a change */
if id1 = id1_ then do;
id1 = id2_;
flag = 1;
end;
if id2 = id1_ then do;
id2 = id2_;
flag = 1;
end;
if flag then replace; /* Only rewrite the row if we made a change */
end;
end;
end;
stop;
run;
sasfile have close;
title "After making replacements";
proc print data = have;
run;
Please bear in mind that as this modifies the dataset in place, interrupting the data step while it is running could result in data loss. Make sure you have a backup first in case you need to roll your changes back.
Seems like this should do the trick and is fairly straight forward. Let me know if it is what you are looking for:
data have;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test();
proc sql noprint;
select count(*) into: cnt
from have;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have
where monotonic() = &i;quit;
data have;
set have;
if (_n_ > input("&i",8.))then do;
if (id1 = input("&id1",8.))then id1 = input("&id2",8.);
if (id2 = input("&id1",8.))then id2 = input("&id2",8.);
end;
run;
%end;
%mend test;
%test();
this might be a little faster:
data have2;
input id1 id2;
datalines;
1 2
3 4
2 5
6 1
1 7
5 8
;
run;
%macro test2();
proc sql noprint;
select count(*) into: cnt
from have2;
quit;
%do i = 1 %to &cnt;
proc sql noprint;
select id1,id2 into: id1, :id2
from have2
where monotonic() = &i;
update have2 set id1 = &id2
where monotonic() > &i
and id1 = &id1;
quit;
proc sql noprint;
update have2 set id2 = &id2
where monotonic() > &i
and id2 = &id1;
quit;
%end;
%mend test2;
%test2();
I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;
I have the following data where people in households are sorted by age (oldest to youngest):
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
I would like to calculate for each household the maximum age difference between consecutively aged people. So this example would give values of 5 (=25-20), 16 (=32-16) and 32 (=42-10) for households 1, 2 and 3 consecutively.
I could do this using lots of merges (i.e. extract person 1, merge with extract of person 2, and so on), but as there can be upto 20+ people in a household I'm looking for a much more direct method.
Here's a two pass solution. Same first step as the two solutions above, sort by age. In the second step keep track of max_diff per row, at the last record of HouseID output the results. This results in only two passes through the data.
proc sort data=houses; by houseid age;run;
data want;
set houses;
by houseID;
retain max_diff 0;
diff = dif1(age)*-1;
if first.HouseID then do;
diff = .; max_diff=.;
end;
if diff>max_diff then max_diff=diff;
if last.houseID then output;
keep houseID max_diff;
run;
proc sort data=houses; by houseid personid age;run;
data _t1;
set houses;
diff = dif1(age) * (-1);
if personid = 1 then diff = .;
run;
proc sql;
create table want as
select houseid, max(diff) as Max_Diff
from _t1
group by houseid;
proc sort data = house;
by houseid descending age;
run;
data house;
set house;
by houseid;
lag_age = lag1(age);
if first.houseid then age_diff = 0;
age_diff = lag_age - age;
run;
proc sql;
select houseid,max(age_diff) as max_age_diff
from house
group by houseid;
quit;
Working:
First sort the data set using houseid and descending Age.
Second data step will calculate difference between current age value (in PDV) and previous age value in PDV. Then, using sql procedure, we can get the max age difference for each houseid.
Just throwing one more into the mix. This one is a condensed version of Reeza's response.
/* No need to sort by PersonID as age is the only concern */
proc sort data = houses;
by HouseID Age;
run;
data want;
set houses;
by HouseID;
/* Keep the diff when a new row is loaded */
retain diff;
/* Only replace the diff if it is larger than previous */
diff = max(diff, abs(dif(Age)));
/* Reset diff for each new house */
if first.HouseID then diff = 0;
/* Only output the final diff for each house */
if last.HouseID;
keep HouseID diff;
run;
Here is an example using FIRST. and LAST. with one pass (after sort) through the data.
data houses;
input HouseID PersonID Age;
datalines;
1 1 25
1 2 20
2 1 32
2 2 16
2 3 14
2 4 12
3 1 44
3 2 42
3 3 10
3 4 5
;
run;
Proc sort data=HOUSES;
by houseid descending age ;
run;
Data WANT(keep=houseid max_diff);
format houseid max_diff;
retain max_diff age1 age2;
Set HOUSES;
by houseid descending age ;
if first.houseid and last.houseid then do;
max_diff=0;
output;
end;
else if first.houseid then do;
call missing(max_diff,age1,age2);
age1=age;
end;
else if not(first.houseid or last.houseid) then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
age1=age;
end;
else if last.houseid then do;
age2=age;
temp=age1-age2;
if temp>max_diff then max_diff=temp;
output;
end;
Run;
I checked out this previous post (LINK) for potential solution, but still not working. I want to sum across rows using the ID as the common identifier. The num variable is constant. The id and comp the two variables I want to use to creat a pct variable, which = sum of [comp = 1] / num
Have:
id Comp Num
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
Want:
id tot pct
1 2 100
2 3 0.666666667
3 1 100
Currently have:
proc sort data=have;
by id;
run;
data want;
retain tot 0;
set have;
by id;
if first.id then do;
tot = 0;
end;
if comp in (1) then tot + 1;
else tot + 0;
if last.id;
pct = tot / num;
keep id tot pct;
output;
run;
I use SQL for things like this. You can do it in a Data Step, but the SQL is more compact.
data have;
input id Comp Num;
datalines;
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
;
run;
proc sql noprint;
create table want as
select id,
sum(comp) as tot,
sum(comp)/count(id) as pct
from have
group by id;
quit;
Hi there is a much more elegant solution to your problem :)
proc sort data = have;
by id;
run;
data want;
do _n_ = 1 by 1 until (last.id);
set have ;
by id ;
tot = sum (tot, comp) ;
end ;
pct = tot / num ;
run;
I hope it is clear. I use sql too because I am new and the DOW loop is rather complicated but in your case its pretty straightforward.