Finding all possible paths in a dataset using sas - sas

I want to compare two columns in the dataset shown below
Pid cid
1 2
2 3
2 5
3 6
4 8
8 9
9 4
Then produce a result like below
1 2 3 6
1 2 5
2 3 6
2 5
3 6
4 8 9 4
8 9 4
9 4
First we print the first two values 1 and 2, search for 2 in first column, if its present print its corresponding value from column 2, which is 3. Search for 3 in column 1, if present print the corresponding value from column 2 which is 6
How can this be done using SAS?

The links comprise a directed graph and need recursion to traverse the paths.
In data step, the multiple children of a parent can be stored in a Hash of Hashes structure, but recursion in data step is quite awkward (you would have to manually maintain your own stack and local variables in yet another hash)
In Proc DS2 recursion is far more traditional and obvious, and Package Hash is available. However, the Package Hash hashing is different than data step. The data values are only allowed to be scalars, so Hash of Hashes is out :(.
The lack of hash of hashes can be remediated by setting up the hash to have multidata. Each data (child) of a key (parent) are retrieved with the pattern find, and loop for has_next, with find_next.
Another issue with hashes in DS2 is that they must be global to the data step, and the same for any host variables used for keys and data. This makes for tricky management of variables during recursion. The code at scope depth N can not have any reliance on global variables that can get changed at scope depth N+1.
Fortunately, an anonymous hash can be created in any scope and it's reference is maintained locally... but the key and data variables must still be global; so more careful attention is needed.
The anonymous hash is used to store the multidata retrieved by a key; this is necessary because recursion would affect the has_next get_next operation.
Sample code. Requires a rownum variable to prevent cycling that would occur when a child is allowed to act as a parent in a prior row.
data have; rownum + 1;input
Pid cid;datalines;
1 2
2 3
2 5
3 6
4 8
5 12
6 2
8 9
9 4
12 1
12 2
12 14
13 15
14 20
14 21
14 21
15 1
run;
proc delete data=paths;
proc delete data=rows;
%let trace=;
proc ds2 libs=work;
data _null_ ;
declare double rownum pid cid id step pathid;
declare int hIndex;
declare package hash rows();
declare package hash links();
declare package hash path();
declare package hash paths();
method leaf(int _rootRow, int _step);
declare double _idLast _idLeaf;
&trace. put ' ';
&trace. put 'LEAF';
&trace. put ' ';
* no children, at a leaf -- output path;
rownum = _rootRow;
if _step < 2 then return;
* check if same as last one;
do step = 0 to _step;
paths.find(); _idLast = id;
path.find(); _idLeaf = id;
if _idLast ne _idLeaf then leave;
end;
if _idLast = _idLeaf then return;
pathid + 1;
do step = 0 to _step;
path.find();
paths.add();
end;
end;
method saveStep(int _step, int _id);
&trace. put 'PATH UPDATE' _step ',' _id ' <-------';
step = _step;
id = _id;
path.replace();
end;
method descend(int _rootRow, int _fromRow, int _id, int _step);
declare package hash h;
declare double _hIndex;
declare varchar(20) p;
if _step > 10 then return;
p = repeat (' ', _step-1);
&trace. put p 'DESCEND:' _rootRow= _fromRow= _id= _step=;
* given _id as parent, track in path and descend by child(ren);
* find links to children;
pid = _id;
&trace. put p 'PARENT KEY:' pid=;
if links.find() ne 0 then do;
&trace. put p 'NO KEY';
saveStep(_step, _id);
leaf(_rootRow, _step);
return;
end;
* convert multidata to hash, emulating hash of hash;
* if not, has_next / find_next multidata traversal would be
* corrupted by a find in the recursive use of descent;
* new hash reference in local variable;
h = _new_ hash ([hindex], [cid rownum], 0,'','ascending');
hIndex = 1;
&trace. put p 'CHILD' hIndex= cid= rownum=;
if rownum > _fromRow then h.add();
do while (links.has_next() = 0);
hIndex + 1;
links.find_next();
&trace. put p 'CHILD' hIndex= cid= rownum=;
if rownum > _fromRow then h.add();
end;
if h.num_items = 0 then do;
* no eligble (forward rowed) children links;
&trace. put p 'NO FORWARD CHILDREN';
leaf(_rootRow, _step-1);
return;
end;
* update data for path step;
saveStep (_step, _id);
* traverse hash that was from multidata;
* locally instantiated hash is protected from meddling outside current scope;
* hIndex is local variable;
do _hIndex = 1 to hIndex;
hIndex = _hIndex;
h.find();
&trace. put p 'TRAVERSE:' hIndex= cid= rownum= ;
descend(_rootRow, rownum, cid, _step+1);
end;
&trace. put p 'TRAVERSE DONE:' _step=;
end;
method init();
declare int index;
* data keyed by rownum;
rows.keys([rownum]);
rows.data([rownum pid cid]);
rows.ordered('A');
rows.defineDone();
* multidata keyed by pid;
links.keys([pid]);
links.data([cid rownum]);
links.multidata('yes');
links.defineDone();
* recursively discovered ids of path;
path.keys([step]);
path.data([step id]);
path.ordered('A');
path.defineDone();
* paths discovered;
paths.keys([pathid step]);
paths.data([pathid step id]);
paths.ordered('A');
paths.defineDone();
end;
method run();
set have;
rows.add();
links.add();
end;
method term();
declare package hiter rowsiter('rows');
declare int n;
do while (rowsiter.next() = 0);
step = 0;
saveStep (step, pid);
descend (rownum, rownum, cid, step+1);
end;
paths.output('paths');
rows.output('rows');
end;
run;
quit;
proc transpose data=paths prefix=ID_ out=paths_across(drop=_name_);
by pathid;
id step;
var id;
format id_: 4.;
run;

As the comments says, infinite cycle and searching path are not clear at least. So let's start with the most simple case: always search from top to bottom and nerve look back.
Just start at create data set:
data test;
input Pid Cid;
cards;
1 2
2 3
2 5
3 6
4 8
8 9
9 4
;
run;
With this assumption, my thought is:
Generate a row indicator, E.g. Ord +1 ;
Use a left-join with connection condition a.Pid = b.Cid and a.Ord > b.Ord where both a and b stand test;
Compare the new data set and the old data set;
Loop 2 and 3 while new data set differs from the old data set;
Well, sometimes we may care result more than path, so here is another answer:
data _null_;
set test nobs = nobs;
do i = 1 to nobs;
set test(rename=(Pid=PidTmp Cid=CidTmp)) point = i;
if Cid = PidTmp then Cid = CidTmp;
end;
put (Pid Cid)(=);
run;
Result:
Pid=1 Cid=6
Pid=2 Cid=6
Pid=2 Cid=5
Pid=3 Cid=6
Pid=4 Cid=4
Pid=8 Cid=4
Pid=9 Cid=4

I have tried the below, the result is not perfect though
data want;
obs1 = 1;
do i=1 to 6;
set ar ;
obs2 = obs1 + 1;
set
ar(
rename=(
pid = pid2
cid = cid2
)
) point=obs2
;
if cid =pid2
then k=catx("",pid,cid,cid2);
else k=catx("",pid,cid);
output;
obs1 + 1;
end;
run;
Result:
pid cid k
1 2 1 2 3
2 3 2 3
2 5 2 5
3 6 3 6
4 8 4 8 9
8 9 8 9 4
9 4 9 4

Not enough reputation, so this is another answer, hahaha.
First of all, I can see #Richard answer is very good though I can not use ds2 and hash so skillfully yet. It's a nice example to learn recursion.
So now I know your purpose is definitely the path not the end point, store each result while recursing each observation then be necessary. Your own answer has reflecting this but with a failed do-loop, obs1 = 1 and obs2 = obs1 + 1 and obs1 + 1 would always return obs2 = _N_ + 1 which result in only once loop.
I supplemented and improved the original code this time:
data test;
set test nobs = nobs;
array Rst[*] Cid Cid1-Cid10;
do i = _N_ to nobs;
set test(rename=(Pid=PidTmp Cid=CidTmp)) point = i;
do j = 1 to dim(Rst);
if Rst[j] = PidTmp then Rst[j+1] = CidTmp;
end;
end;
run;
I use an oversize array to store the path and change do i = 1 to nobs; to do i = _N_ to nobs; since I find do i = 1 to nobs; will cause the loop look back.

proc ds2;
data _null_;
declare int t1[7];
declare int t2[7];
declare varchar(100) lst;
method f2(int i, int Y);
do while (y ^= t1[i] and i < dim(t1));
i+1;
end;
if y = t1[i] then do;
lst = cat(lst,t2[i]);
f2(i, t2[i]);
end;
end;
method f1(int n, int x, int y);
dcl int i;
dcl int match;
match=0;
do i = n to dim(t1);
lst = cat(x,y);
if (y = t1[i]) then do;
f2(i,y);
put lst=;
match = 1;
end;
end;
if ^match then put lst=;
end;
method init();
dcl int i;
t1 := (1 2 2 3 4 8 9);
t2 := (2 3 5 6 8 9 4);
do i = 1 to dim(t1);
f1(i, t1[i], t2[i]);
end;
end;
enddata;
run;
quit;`enter code here`

Related

How do I select the first 5 observations with regard to duplicates? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a large dataset containting over 80 000 000 rows sorted by "name" and "income" (with duplicates both for name and income). For the first name I would like to have the 5 lowest incomes. For the second name I would like to have the 5 lowest incomes (but incomes drawn to the first name are then disqualified to be selected). And so on, until the last name (if there are any incomes left at that time).
You first want to rank income within names. So:
proc rank data=yourdata out=temp ties=low;
by name;
var income;
ranks incomerank;
run;
Then you want to filter the 5 lowest incomes by name, so:
proc sql;
create table want as
select distinct *
from temp
where incomerank < 6;
quit;
You will need to sort and track incomes
Use an array to sort and track the lowest five income of a name.
Use a hash to track and check the observance of an income being output and thus ineligible for output by later names.
Example:
An insert sort of eligible low valued incomes is used and will be fast due to only 5 items.
data have;
call streaminit(1234);
do name = 1 to 1e6;
do seq = 1 to rand('integer', 20);
income = rand('integer', 20000, 1000000);
output;
end;
end;
run;
data
want (label='Lowest 5 incomes (first occurring over all names) of each name')
want_barren(keep=name label='Names whose all incomes were previously output for earlier names')
;
array X(5) _temporary_;
if _n_ = 1 then do;
if 0 then set have;
declare hash incomes();
incomes.defineKey('income');
incomes.defineDone();
end;
_maxmin5 = 1e15;
x(1) = 1e15;
x(2) = 1e15;
x(3) = 1e15;
x(4) = 1e15;
x(5) = 1e15;
do _n_ = 1 by 1 until (last.name);
set have;
by name;
if incomes.check() = 0 then continue;
* insert sort - lowest five not observed previously;
if income > _maxmin5 then continue;
do _i_ = 1 to 5;
if income < x(_i_) then do;
do _j_ = 5 to _i_+1 by -1;
x(_j_) = x(_j_-1);
end;
x(_i_) = income;
_maxmin5 = x(5);
incomes.add();
leave;
end;
end;
end;
_outflag = 0;
do _n_ = 1 to _n_;
set have;
if income in x then do;
_outflag = 1;
OUTPUT want;
end;
end;
if not _outflag then
OUTPUT want_barren;
drop _:;
run;
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
address = cats('Address_', _N_);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(dataset : 'have(obs=0)', ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedata(all : 'y');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Here is my interpretation of your problem and a solution.
Suppose a simplified version of your data looks like this and you want the 2 lowest income for each name. For simplicity, I use a numeric variable n as name, but a character var will work as well.
data have;
input n income;
datalines;
1 100
1 200
1 300
2 400
2 100
2 500
3 600
3 200
3 500
;
From this data, my guess is that your logic goes like this:
Start with n = 1.
Output the 2 observations with the lowest income (100 and 200)
Go to the next name (n=2).
Output the 2 observations with the lowest income, that has not already been output (300 and 400). 200 Has been output in the n=1 group.
...And so on...
This gives the desired result below:
data want;
input n income;
datalines;
1 100
1 200
2 300
2 400
3 500
;
Try out the solution below and verify that you get the result as posted above.
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 2 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;
Finally, let us create representable example data with 80Mio obs. I change the if c = 2 then leave; statement to if c = 5 then leave; to go back to your actual problem.
The code below runs in about 45 sec on my system and processes the data in a single pass. Let me know is it works for you :-)
data have;
do n = 1 to 8e5;
do _N_ = 1 to 100;
income = ceil(rand('uniform') * 1e4);
output;
end;
end;
run;
data want(drop=c);
if _N_ = 1 then do;
dcl hash h(ordered : 'a', multidata : 'y');
h.definekey('income');
h.definedone();
dcl hiter i('h');
dcl hash inc();
inc.definekey('income');
inc.definedone();
end;
do until (last.n);
set have;
by n;
h.add();
end;
do c = 0 by 0 while (i.next() = 0);
if inc.add() = 0 then do;
c + 1;
output;
end;
if c = 5 then leave;
end;
_N_ = i.first();
_N_ = i.prev();
h.clear();
run;

In the Data step of SAS, how can I get value of a Column with Column's name represented as a String?

In the Data Step of SAS, you get value of a Column by directly using its name, for example, like this,
name = col1;
But for some reason, I want to get value of a column where column is represented by a string. For example, like this,
name = get_value_of_column(cats("col", i))
Is this possible? And if so, how?
The DATA Step functions VVALUE and VVALUEX will return the formatted value of a variable.
VVALUE(<variable-name>) static, a step compilation time interaction
VVALUEX(<expression>) dynamic, a runtime expression resolving to a variable name
The actual value of the variable can be dynamically obtained via a _type_ array scan
Array Scan
data have;
input name $ x y z (s t u) ($) date: yymmdd10.;
format s t u $upcase. date yymmdd10.;
datalines;
x 1 2 3 a b c 2020-10-01
y 2 3 4 b c d 2020-10-02
z 3 4 5 c d e 2020-10-03
s 4 5 6 hi ho silver 2020-10-04
t 5 6 7 aa bb cc 2020-10-05
u 6 7 8 -- ** !! 2020-10-06
date 7 8 9 ppp qqq rrr 2020-10-07
;
data want;
set have;
length u_vvalue name_vvaluex $20.;
u_vvalue = vvalue(u);
name_vvaluex = vvaluex(name);
array nums _numeric_;
array chars _character_;
/* NOTE:
* variable based arrays cause automatic variable _i_ to be in the PDV
* and _i_ will be automatically dropped from output data sets
*/
do _i_ = 1 to dim(nums);
if upcase(name) = upcase(vname(nums(_i_))) then do;
name_numeric_raw = nums(_i_);
leave;
end;
end;
do _i_ = 1 to dim(chars);
if upcase(name) = upcase(vname(chars(_i_))) then do;
name_character_raw = chars(_i_);
leave;
end;
end;
run;
If you perform an 'excessive' amount of dynamic value lookup in your DATA Step a transposition could possibly lead to simpler processing.

Sum consecutive observations in a dataset SAS

I have a dataset that looks like:
Hour Flag
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
I want to have an output dataset like:
Total_Hours Count
2 2
3 1
4 1
As you can see, I want to count the number of hours included in each period with consecutive "1s". A missing value ends the consecutive sequence.
How should I go about doing this? Thanks!
You'll need to do this in two steps. First step is making sure the data is sorted properly and determining the number of hours in a consecutive period:
PROC SORT DATA = <your dataset>;
BY hour;
RUN;
DATA work.consecutive_hours;
SET <your dataset> END = lastrec;
RETAIN
total_hours 0
;
IF flag = 1 THEN total_hours = total_hours + 1;
ELSE
DO;
IF total_hours > 0 THEN output;
total_hours = 0;
END;
/* Need to output last record */
IF lastrec AND total_hours > 0 THEN output;
KEEP
total_hours
;
RUN;
Now a simple SQL statement:
PROC SQL;
CREATE TABLE work.hour_summary AS
SELECT
total_hours
,COUNT(*) AS count
FROM
work.consecutive_hours
GROUP BY
total_hours
;
QUIT;
You will have to do two things:
compute the run lengths
compute the frequency of the run lengths
For the case of using the implict loop
Each run length occurnece can be computed and maintained in a retained tracking variable, testing for a missing value or end of data for output and a non missing value for run length reset or increment.
Proc FREQ
An alternative is to use an explicit loop and a hash for frequency counts.
Example:
data have; input
Hour Flag; datalines;
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
;
data _null_;
declare hash counts(ordered:'a');
counts.defineKey('length');
counts.defineData('length', 'count');
counts.defineDone();
do until (end);
set have end=end;
if not missing(flag) then
length + 1;
if missing(flag) or end then do;
if length > 0 then do;
if counts.find() eq 0
then count+1;
else count=1;
counts.replace();
length = 0;
end;
end;
end;
counts.output(dataset:'want');
run;
An alternative
data _null_;
if _N_ = 1 then do;
dcl hash h(ordered : "a");
h.definekey("Total_Hours");
h.definedata("Total_Hours", "Count");
h.definedone();
end;
do Total_Hours = 1 by 1 until (last.Flag);
set have end=lr;
by Flag notsorted;
end;
Count = 1;
if Flag then do;
if h.find() = 0 then Count+1;
h.replace();
end;
if lr then h.output(dataset : "want");
run;
Several weeks ago, #Richard taught me how to use DOW-loop and direct addressing array. Today, I give it to you.
data want(keep=Total_Hours Count);
array bin[99]_temporary_;
do until(eof1);
set have end=eof1;
if Flag then count + 1;
if ^Flag or eof1 then do;
bin[count] + 1;
count = .;
end;
end;
do i = 1 to dim(bin);
Total_Hours = i;
Count = bin[i];
if Count then output;
end;
run;
And Thanks Richard again, he also suggested me this article.

Calculate rolling sum for one column in time interval in SAS

I have one problem and I think there is not much to correct to work right.
I have table (with desired output column 'sum_usage'):
id opt t_purchase t_spent bonus usage sum_usage
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 . 7
So, I need to sum all usage values from time_purchase (for one id, opt combination (group by id, opt) there is just one unique time_purchase) until t_spent.
Also, I have about milion rows, so hash table would be the best solution. I've tried with:
data want;
if _n_=1 then do;
if 0 then set have(rename=(usage=_usage));
declare hash h(dataset:'have(rename=(usage=_usage))',hashexp:20);
h.definekey('id','opt', 't_purchase', 't_spent');
h.definedata('_usage');
h.definedone();
end;
set have;
sum_usage=0;
do i=intck('second', t_purchase, t_spent) to t_spent ;
if h.find(key:user,key:id_option,key:i)=0 then sum_usage+_usage;
end;
drop _usage i;
run;
The fifth line from the bottom is not correct for sure (do i=intck('second', t_purchase, t_spent), but have no idea how to approach this. So, the main problem is how to set up time interval to calculate this. I have already one function in this hash table func with the same keys, but without time interval, so it would be pretty good to write this one too, but it's not necessary.
Personally, I would ditch the hash and use SQL.
Example Data:
data have;
input id $ opt
t_purchase datetime20.
t_spent datetime20.
bonus usage sum_usage;
format
t_purchase datetime20.
t_spent datetime20.;
datalines;
a 1 10NOV2017:12:02:00 10NOV2017:14:05:00 100 9 15
a 1 10NOV2017:12:02:00 10NOV2017:15:07:33 100 0 15
a 1 10NOV2017:12:02:00 10NOV2017:13:24:50 100 6 6
b 1 10NOV2017:13:54:00 10NOV2017:14:02:58 100 3 10
a 1 10NOV2017:12:02:00 10NOV2017:20:22:07 100 12 27
b 1 10NOV2017:13:54:00 10NOV2017:13:57:12 100 7 7
;
I'm leaving your sum_usage column here for comparison.
Now, create a table of sums. New value is sum_usage2.
proc sql noprint;
create table sums as
select a.id,
a.opt,
a.t_purchase,
a.t_spent,
sum(b.usage) as sum_usage2
from have as a,
have as b
where a.id = b.id
and a.opt = b.opt
and b.t_spent <= a.t_spent
and b.t_spent >= a.t_purchase
group by a.id,
a.opt,
a.t_purchase,
a.t_spent;
quit;
Now that you have the sums, join them back to the original table:
proc sql noprint;
create table want as
select a.*,
b.sum_usage2
from have as a
left join
sums as b
on a.id = b.id
and a.opt = b.opt
and a.t_spent = b.t_spent
and a.t_purchase = b.t_purchase;
quit;
This produces the table you want. Alternatively, you can use a hash to look up the values and add the sum in a Data Step (which can be faster given the size).
data want2;
set have;
format sum_usage2 best.;
if _n_=1 then do;
%create_hash(lk,id opt t_purchase t_spent, sum_usage2,"sums");
end;
rc = lk.find();
drop rc;
run;
%create_hash() macro available here https://github.com/FinancialRiskGroup/SASPerformanceAnalytics
I believe this question is a morph of one your earlier ones where you compute a rolling sum by do a hash lookup for every second over a 3 hour period for each record in your data set. Hopefully you realized the simplicity of that approach has a large cost of 3*3600 hash lookups per record as well as having to load the entire data vector into a hash.
The time log data presented has new records inserted at the top of the data, and I presume the data to be descending monotonic in time.
A DATA Step can, in a single pass of monotonic data, compute the rolling sum over a time window. The technique uses 'ring' arrays, where-in index advancement is adjusted by modulus. One array is for the time and the other is for the metric (usage). The required array size is the maximum number of items that could occur within the time window.
Consider some generated sample data with time steps of 1, 2, and one jump of 200 seconds:
data have;
time = '12oct2017:11:22:32'dt;
usage = 0;
do _n_ = 1 to &have_count;
time + 2; *ceil(25*ranuni(123));
if _n_ > 30 then time + -1;
if _n_ = 145 then time + 200;
usage = floor(180*ranuni(123));
delta = time-lag(time);
output;
end;
run;
Start with the case of computing a rolling sum from prior items when sorted time ascending. (The descending case will follow):
The example parameters are RING_SIZE 16 and TIME_WINDOW of 12 seconds.
%let RING_SIZE = 16;
%let TIME_WINDOW = '00:00:12't;
data want;
array ring_usage [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
array ring_time [0:%eval(&RING_SIZE-1)] _temporary_ (&RING_SIZE*0);
retain ring_tail 0 ring_head -1 span 0 span_usage 0;
set have;
by time ; * cause error if data not sorted per algorithm requirement;
* unload from accumulated usage the tail items that fell out the window;
do while (span and time - ring_time(ring_tail) > &TIME_WINDOW);
span + -1;
span_usage + -ring_usage(ring_tail);
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
end;
ring_head = mod ( ring_head + 1, &RING_SIZE );
span + 1;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
* update the ring array;
ring_time(ring_head) = time;
ring_usage(ring_head) = usage;
span_usage + usage;
drop ring_tail ring_head span;
run;
For the case of data sorted descending, you could jiggle things; sort ascending, compute rolling and resort descending.
What to do if such a jiggle can't be done, or you just want a single pass?
The items to be part of the rolling calculation have to be from 'lead' rows, or rows not yet read via SET. How is this possible ? A second SET statement can be used to open a separate channel to the data set, and thus obtain lead values.
There is a little more bookkeeping for processing lead data -- premature overwrite and diminished window at the end of data need to be handled.
data want2;
array ring_usage [-1:%eval(&RING_SIZE-1)] _temporary_;
array ring_time [-1:%eval(&RING_SIZE-1)] _temporary_;
retain lead_index 0 ring_tail -1 ring_head -1 span 1 span_usage . guard_index .;
set have;
&debug put / _N_ ':' time= ring_head=;
* unload ring_head slotted item from sum;
span + -1;
span_usage + -ring_usage(ring_head);
* advance ring_head slot by 1, the vacated slot will be overwritten by lead;
ring_head = mod ( ring_head + 1, &RING_SIZE );
&debug put +2 ring_time(ring_head)= span= 'head';
* load ring with lead values via a second SET of the same data;
if not end2 then do;
do until (_n_ > 1 or lead_index = 0 or end2);
set have(keep=time usage rename=(time=t usage=u)) end=end2; * <--- the second SET ;
if end2 then guard_index = lead_index;
&debug if end2 then put guard_index=;
ring_time(lead_index) = t;
ring_usage(lead_index) = u;
&debug put +2 ring_time(lead_index)= 'lead';
lead_index = mod ( lead_index + 1, &RING_SIZE);
end;
end;
* advance ring_tail to cover the time window;
if ring_tail ne guard_index then do;
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
do while (time - ring_time(ring_tail) <= &TIME_WINDOW);
span + 1;
span_usage + ring_usage(ring_tail);
&debug put +2 ring_time(ring_tail)= span= 'seek';
ring_tail_was = ring_tail;
ring_tail = mod ( ring_tail + 1, &RING_SIZE ) ;
if ring_tail_was = guard_index then leave;
if span > 1 and (ring_head = ring_tail) then do;
_n_ = dim(ring_time);
put 'ERROR: Ring array too small, size=' _n_;
abort cancel;
end;
end;
* seek went beyond window, back tail off to prior index;
ring_tail = ring_tail_was;
end;
&debug put +2 ring_time(ring_tail)= span= 'mark';
drop lead_index t u ring_: guard_index span;
format ring: span: usage 6.;
run;
options source;
Confirm both methods have the same computation:
proc sort data=want2; by time;
run;
proc compare noprint data=want compare=want2 out=diff outnoequal;
id time;
var span_usage;
run;
---------- LOG ----------
NOTE: There were 150 observations read from the data set WORK.WANT.
NOTE: There were 150 observations read from the data set WORK.WANT2.
NOTE: The data set WORK.DIFF has 0 observations and 4 variables.
I have not benchmarked ring-array versus SQL versus Proc EXPAND versus Hash.
Caution: Dead reckoning rolling values using +in and -out operations can experience round-off errors when dealing with non-integer values.

recursive search in SAS [duplicate]

I have two variables ID1 and ID2. They are both the same kinds of identifiers. When they appear in the same row of data it means they are in the same group. I want to make a group identifier for each ID. For example, I have
ID1 ID2
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
Then I would want
ID Group
1 1
2 1
3 2
4 1
5 1
6 1
7 2
Because 1,2,4,5,6 are paired by some combination in the original data they share a group. 3 and 7 are only paired with each other so they are a new group. I want to do this for ~20,000 rows. Every ID that is in ID1 is also in ID2 (more specifically if ID1=1 and ID2=2 for an observation, then there is another observation that is ID1=2 and ID2=1).
I've tried merging them back and forth but that doesn't work. I also tried call symput and trying to make a macro variable for each ID's group and then updating it as I move through rows, but I couldn't get that to work either.
I have used Haikuo Bian's answer as a starting point to develop a slightly more complex algorithm that seems to work for all the test cases I have tried so far. It could probably be optimised further, but it copes with 20000 rows in under a second on my PC while using only a few MB of memory. The input dataset does not need to be sorted in any particular order, but as written it assumes that every row is present at least once with id1 < id2.
Test cases:
/* Original test case */
data have;
input id1 id2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
/* Revised test case - all in one group with connecting row right at the end */
data have;
input ID1 ID2;
/*Make sure each row has id1 < id2*/
if id1 > id2 then do;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
end;
drop t_id2;
cards;
2 5
4 8
2 4
2 6
3 7
4 1
9 1
3 2
6 2
7 3
;
run;
/*Full scale test case*/
data have;
do _N_ = 1 to 20000;
call streaminit(1);
id1 = int(rand('uniform')*100000);
id2 = int(rand('uniform')*100000);
if id1 < id2 then output;
t_id2 = id2;
id2 = id1;
id1 = t_id2;
if id1 < id2 then output;
end;
drop t_id2;
run;
Code:
option fullstimer;
data _null_;
length id group 8;
declare hash h();
rc = h.definekey('id');
rc = h.definedata('id');
rc = h.definedata('group');
rc = h.definedone();
array ids(2) id1 id2;
array groups(2) group1 group2;
/*Initial group guesses (greedy algorithm)*/
do until (eof);
set have(where = (id1 < id2)) end = eof;
match = 0;
call missing(min_group);
do i = 1 to 2;
rc = h.find(key:ids[i]);
match + (rc=0);
if rc = 0 then min_group = min(group,min_group);
end;
/*If neither id was in a previously matched group, create a new one*/
if not(match) then do;
max_group + 1;
group = max_group;
end;
/*Otherwise, assign both to the matched group with the lowest number*/
else group = min_group;
do i = 1 to 2;
id = ids[i];
rc = h.replace();
end;
end;
/*We now need to work through the whole dataset multiple times
to deal with ids that were wrongly assigned to a separate group
at the end of the initial pass, so load the table into a
hash object + iterator*/
declare hash h2(dataset:'have(where = (id1 < id2))');
rc = h2.definekey('id1','id2');
rc = h2.definedata('id1','id2');
rc = h2.definedone();
declare hiter hi2('h2');
change_count = 1;
do while(change_count > 0);
change_count = 0;
rc = hi2.first();
do while(rc = 0);
/*Get the current group of each id from
the hash we made earlier*/
do i = 1 to 2;
rc = h.find(key:ids[i]);
groups[i] = group;
end;
/*If we find a row where the two ids have different groups,
move the id in the higher group to the lower group*/
if groups[1] < groups[2] then do;
id = ids[2];
group = groups[1];
rc = h.replace();
change_count + 1;
end;
else if groups[2] < groups[1] then do;
id = ids[1];
group = groups[2];
rc = h.replace();
change_count + 1;
end;
rc = hi2.next();
end;
pass + 1;
put pass= change_count=; /*For information only :)*/
end;
rc = h.output(dataset:'want');
run;
/*Renumber the groups sequentially*/
proc sort data = want;
by group id;
run;
data want;
set want;
by group;
if first.group then new_group + 1;
drop group;
rename new_group = group;
run;
/*Summarise by # of ids per group*/
proc sql;
select a.group, count(id) as FREQ
from want a
group by a.group
order by freq desc;
quit;
Interestingly, the suggested optimisation of not checking the group of id2 during the initial pass if id1 is already matched actually slows things down a little in this extended algorithm, because it means that more work has to be done in the subsequent passes if id2 is in a lower numbered group. E.g. output from a trial run I did earlier:
With 'optimisation':
pass=0 change_count=4696
pass=1 change_count=204
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.19 seconds
user cpu time 0.17 seconds
system cpu time 0.04 seconds
memory 9088.76k
OS Memory 35192.00k
Without:
pass=0 change_count=4637
pass=1 change_count=182
pass=2 change_count=23
pass=3 change_count=9
pass=4 change_count=2
pass=5 change_count=1
pass=6 change_count=0
NOTE: DATA statement used (Total process time):
real time 0.18 seconds
user cpu time 0.16 seconds
system cpu time 0.04 seconds
Please try the below code.
data have;
input ID1 ID2;
datalines;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
run;
* Finding repeating in ID1;
proc sort data=have;by id1;run;
data want_1;
set have;
by id1;
attrib flagrepeat length=8.;
if not (first.id1 and last.id1) then flagrepeat=1;
else flagrepeat=0;
run;
* Finding repeating in ID2;
proc sort data=want_1;by id2;run;
data want_2;
set want_1;
by id2;
if not (first.id2 and last.id2) then flagrepeat=1;
run;
proc sort data=want_2 nodupkey;by id1 ;run;
data want(drop= ID2 flagrepeat rename=(ID1=ID));
set want_2;
attrib Group length=8.;
if(flagrepeat eq 1) then Group=1;
else Group=2;
run;
Hope this answer helps.
Like one commentator mentioned, Hash does seem to be a viable approach. In the following code, 'id' and 'group' is maintained in the Hash table, new 'group' is added only when no 'id' match is found for the entire row. Please note, 'do over' is an undocumented feature, it can be easily replaced with a little bit more coding.
data have;
input ID1 ID2;
cards;
1 4
1 5
2 5
2 6
3 7
4 1
5 1
5 2
6 2
7 3
;
data _null_;
if _n_=1 then
do;
declare hash h(ordered: 'a');
h.definekey('id');
h.definedata('id','group');
h.definedone();
call missing(id,group);
end;
set have end=last;
array ids id1 id2;
do over ids;
rc=sum(rc,h.find(key:ids)=0);
/*you can choose to 'leave' the loop here when first h.find(key:ids)=0 is met, for the sake of better efficiency*/
end;
if not rc > 0 then
group+1;
do over ids;
id=ids;
h.replace();
end;
if last then rc=h.output(dataset:'want');
run;