SAS- Split Dataset based on values in a column (character) and then further split each of these datasets based on a time column - sas

I have stock trading data for a day - about 60 million rows. Basically, I want to create a dataset that lists the average duration for each 5-minute interval for each of the stocks.
Dataset Original
Obs
time
symbol
tradePrice
tradeId
datatime
duration
1
093000154451968
A
152.24
7.1675E13
1943170200.2
.
2
093000845296640
A
151.99
5.2984E13
1943170200.8
0.69084
3
093000845296640
A
151.99
5.2984E13
1943170200.8
0.00000
4
093000846918400
A
151.99
5.2984E13
1943170200.8
0.00162
5
093000847665152
A
151.94
6.2879E13
1943170200.8
0.00075
6
093000847675136
A
151.94
6.2879E13
1943170200.8
0.00001
7
093000857328128
A
151.94
5.2984E13
1943170200.9
0.00965
8
093000889283840
A
151.24
7.1675E13
1943170200.9
0.03196
9
093001249114624
A
151.74
7.1675E13
1943170201.2
0.35983
10
093001824934912
A
151.99
7.1675E13
1943170201.8
0.57582
11
093001834587904
A
151.71
5.2989E13
1943170201.8
0.00965
12
093002261742336
A
151.99
7.1675E13
1943170202.3
0.42715
Here "time" variable is setup as hhmmssnnnnnnnnn (n indicates nanoseconds - i.e. seconds are counted for 9 significant digits after decimal)
and "datetime" variable is converted to nanoseconds using date and time both.
For this code, I only work with one day of data so use "time" variable only.
Final Result
Stock
TimeInterval
Average duration
A
0930-0935
23456
A
0935-0940
56789
A
........
......
A
1555-1600
57689
B
0930-0935
23456
B
0935-0940
56789
B
........
......
B
1555-1600
57689
..
...
...
Z
0930-0935
23456
Z
0935-0940
56789
Z
........
......
Z
1555-1600
57689
Step 1:
I want to split the dataset such that I have a separate dataset for each of the stock symbols. I did this already.
Step 2:
To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here.
Current Code:
/* Read Dataset */
DATA working_dataset;
set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
run;
/* List of Unique Symbols and feed them into new variables */
proc sql noprint;
select distinct symbol into :symbol1 - (NOTRIM)
from working_dataset;
%put &symbol1;
%put &symbol2;
/* Count of Unique Symbols and store the value in variable "n" */
proc sql noprint;
select count(distinct symbol) into: n
from working_dataset;
%put &n;
/* Keeping the variables needed for the analysis */
DATA working_dataset_2;
SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
run;
/* Extracting stock symbol names from the dataset;*/
proc sort data=working_dataset_2 out=symblist (keep = symbol)
nodupkey;
by symbol;
run;
/* Creating multiple datasets from the parent dataset;*/
data _null_;
set symblist;
call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol !! '"; run;');
run;
For Step 2:
I don't know how to but I am planning to run a loop for 78x 5 minute intervals between 0930 to 1600 using an if statement controlled by the loop value. The following is just wishful thinking - not code. I don't know where to begin.
data dataset_final;
set "A"; /* To be changed as per variable for stock symbol */
array symb(&n); /* this array should have all the stock symbols */
do over; /* do over for all the array items in the array symb(&n) */
do i = 1 to 78;
if (time GE (093000000000000 + &i.- 1)) & (time LT (093000000000000 + &i.))
then send obs to symb_j_0930+&i.-1
end;
Any help is appreciated. I am not sure how to attach the datafile.
Step 1 works. I am able to create different datasets using and call/execute.
Log for Step 1:
439
440 DATA working_dataset;
441 set "C:\EQY_US_ALL_TRADE_202107\test_sample_sorted";
442 run;
NOTE: There were 50000 observations read from the data set
C:\EQY_US_ALL_TRADE_202107\test_sample_sorted.
NOTE: The data set WORK.WORKING_DATASET has 50000 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.09 seconds
443
444 proc sql noprint;
445 select distinct symbol into :symbol1 - (NOTRIM)
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: ',', :, FROM, NOTRIM.
ERROR 76-322: Syntax error, statement will be ignored.
446 from working_dataset;
447 %put &symbol1;
A
448 %put &symbol2;
AA
449
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
450 proc sql noprint;
451 select count(distinct symbol) into: n
452 from working_dataset;
453 %put &n;
2
454
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.04 seconds
455 DATA working_dataset_2;
456 SET working_dataset (keep = symbol time duration tradePrice datetime tradeId);
457
458 /* Extracting stock symbol names from the dataset;*/
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET.
NOTE: The data set WORK.WORKING_DATASET_2 has 50000 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
459 proc sort data=working_dataset_2 out=symblist (keep = symbol)
460 nodupkey;
461 by symbol;
462 run;
NOTE: There were 50000 observations read from the data set WORK.WORKING_DATASET_2.
NOTE: 49998 observations with duplicate key values were deleted.
NOTE: The data set WORK.SYMBLIST has 2 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
463 /* Creating multiple datasets from the parent dataset;*/
464 data _null_;
465 set symblist;
466 call execute('data ' !! compress(symbol) !! '; set working_dataset_2; where symbol = "' !! symbol
466! !! '"; run;');
467 run;
NOTE: There were 2 observations read from the data set WORK.SYMBLIST.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
NOTE: CALL EXECUTE generated line.
1 + data A; set working_dataset_2; where symbol = "A "; run;
NOTE: There were 24304 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='A ';
NOTE: The data set WORK.A has 24304 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
2 + data AA; set working_dataset_2; where symbol = "AA "; run;
NOTE: There were 25696 observations read from the data set WORK.WORKING_DATASET_2.
WHERE symbol='AA ';
NOTE: The data set WORK.AA has 25696 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Step 2 is where I am horribly struggling. I am not sure how to do the code.

Assuming you have an actual time value (you can create one from your first 16 digit string) you can just convert that time to the start of the 5 minute interval and use that to group the data. No need for looping (or splitting).
Let's modify your example data so it actually has more than one stock symbol and more than one time interval. You can convert the first 6 characters of your TIME string into an actual TIME value. Which we can then convert to the beginning of the 5 minute interval.
data have ;
input time :$16. symbol :$4. tradePrice tradeId datatime duration;
tod = input(time,hhmmss6.);
interval='00:05:00't*int(tod/'00:05:00't);
format tod interval tod8.;
nanosec = input(substr(time,7),32.);
cards;
093000154451968 A 152.24 7.1675E13 1943170200.2 .
093000845296640 A 151.99 5.2984E13 1943170200.8 0.69084
093500845296640 A 151.99 5.2984E13 1943170200.8 0.00000
093500846918400 A 151.99 5.2984E13 1943170200.8 0.00162
093800847665152 A 151.94 6.2879E13 1943170200.8 0.00075
093000847675136 B 151.94 6.2879E13 1943170200.8 0.00001
093100857328128 B 151.94 5.2984E13 1943170200.9 0.00965
093900889283840 B 151.24 7.1675E13 1943170200.9 0.03196
093001249114624 C 151.74 7.1675E13 1943170201.2 0.35983
093301824934912 C 151.99 7.1675E13 1943170201.8 0.57582
093801834587904 C 151.71 5.2989E13 1943170201.8 0.00965
094102261742336 C 151.99 7.1675E13 1943170202.3 0.42715
;
So once you have a dataset (or even a view) that has the three variables needed, SYBMOL INTERVAL and DURATION, you can then just use PROC SUMMARY to produce the mean of the durations.
proc summary nway ;
class symbol interval;
var duration;
output out=want mean=mean_duration ;
run;
Results:
mean_
Obs symbol interval _TYPE_ _FREQ_ duration
1 A 09:30:00 3 2 0.69084
2 A 09:35:00 3 3 0.00079
3 B 09:30:00 3 2 0.00483
4 B 09:35:00 3 1 0.03196
5 C 09:30:00 3 2 0.46783
6 C 09:35:00 3 1 0.00965
7 C 09:40:00 3 1 0.42715

You say you're struggling with Step 2: "To sum up the values in a column for every 5-minute interval from 0930 to 1600. I am struggling here."
I'm just going to address that part of your question based on looking at the final result that you want. I'm providing code so you don't need to split the data into multiple datasets of each stock.
data final;
set <dataset>;
time_interval = intck("minute", "09:30:00", tradetime);
time_interval = time_interval - mod(time_interval, 5);
run;
proc sql;
select stock, time_interval, avg(duration) as avg_duration
from final
group by stock, time_interval;
quit;
But, if you want to keep multiple datasets by stock, then just remove the "stock" variable from the code and apply this to every stock dataset you have.

Related

SAS - If then do condition

I have a column which is numeric
and I have a logic as shown below:
if col_1 = "2" then do;
col2 = col3+col4
end;
Now; since its a numeric column; i was expecting the sas code to throw error or do not perform the actions under do statement.
however the statements under do get executed.
It produces the same result as below code
if col_1 = 2 then do;
col2 = col3+col4s
end;
can u explain how this is possible
Did not notice the log note?. This is a data statement option NOTE2ERR which switch off automatic type conversion.
44 data _null_;
45 x = 2;
46 if x eq '2' then put 'NOTE: C2N ' _all_;
47 run;
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
46:12
NOTE: C2N x=2 _ERROR_=0 _N_=1
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
2 The SAS System 19:00 Friday, February 26, 2021
48
49 data _null_ / note2err;
50 x = 2;
51 if x eq '2' then put 'NOTE: C2N ' _all_;
ERROR: Character value found where numeric value needed at line 51 column 12.
52 run;
NOTE: The SAS System stopped processing this step because of errors.

Calculating median across multiple rows and columns in SAS 9.4

I tried searching multiple places but have not been able to find a solution yet. I was wondering if someone here would be able to please help me?
I am trying to calculate a median value (with Q1 and Q3) across multiple rows and columns in SAS 9.4 The dataset I am working with looks like the following:
Obs tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
The context is this is for a medical condition where a person may have 1 (or more) tumors. Each row represents 1 person. Each person may have up to 4 tumors. I would like to determine the median size of all tumors for the entire cohort (not just the median size for each person). Is there a way to calculate this? Thank you in advance.
A transpose of the data will yield a data structure (form) that is amenable to median and quartile computations, at a variety of aggregate combinations, made with PROC SUMMARY and a CLASS statement.
Example:
data have;
input
patient tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4; datalines;
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
;
proc transpose data=have out=new_have;
by patient;
var tumor:;
run;
proc summary data=new_have;
class patient;
var col1;
output out=want Q1=Q1 Q3=Q3 MEDIAN=MEDIAN N=N;
run;
Results
patient _TYPE_ _FREQ_ Q1 Q3 MEDIAN N
. 0 20 1 3.50 2.25 10
1 1 4 1 2.75 1.25 4
2 1 4 2 2.50 2.25 2
3 1 4 3 3.00 3.00 1
4 1 4 4 4.00 4.00 1
5 1 4 1 3.50 2.25 2
The _TYPE_ column describes the ways in which the CLASS variables are combined in order to achieve the results for the requested statistics. The _TYPE_ = 0 case is for all values, and, in this problem, the _FREQ_ = 20 indicates 20 inputs went into the computation consideration, and that N = 10 of those were non-missing and were involved in the actual computation. The role of _TYPE_ becomes more obvious when there is more than one CLASS variable.
From the Output Data Set documentation:
the variable _TYPE_ that contains information about the class variables. By default _TYPE_ is a numeric variable. If you specify CHARTYPE in the PROC statement, then _TYPE_ is a character variable. When you use more than 32 class variables, _TYPE_ is automatically a character variable.
and
The value of _TYPE_ indicates which combination of the class variables PROC MEANS uses to compute the statistics. The character value of _TYPE_ is a series of zeros and ones, where each value of one indicates an active class variable in the type. For example, with three class variables, PROC MEANS represents type 1 as 001, type 5 as 101, and so on.
A far less elegant way to compute the median of all is to store all the values in an oversized array and use the MEDIAN function on the array after the last row is read in:
data median_all;
set have end=lastrow;
array values [1000000] _temporary_;
array sizes tumor_size_1-tumor_size_4;
do sIndex = 1 to dim(sizes);
/* if not missing (sizes[sIndex]) then do; */ %* decomment for dense fill;
vIndex + 1;
values[vIndex] = sizes[sIndex];
/* end; */ %* decomment for dense fill;
end;
if lastrow then do;
median_all_tumor_sizes = median (of values(*));
output;
put (median:) (=);
end;
keep median:;
run;
-------- LOG -------
median_all_tumor_sizes=2.25

matching two datasets with one month lag

I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.

Reshaping data from long to wide

Below is an example that I found to reshape data from long to wide.But I am not able ti understand the code, especially the way they are replacing blanks and why. Can someone help me understand the code?
Example 1: Reshaping one variable
We will begin with a small data set with only one variable to be reshaped. We will use the variables year and faminc (for family income) to create three new variables: faminc96, faminc97 and faminc98. First, let's look at the data set and use proc print to display it.
DATA long ;
INPUT famid year faminc ;
CARDS ;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;
RUN ;
PROC PRINT DATA=long ;
RUN ;
Obs famid year faminc
1 1 96 40000
2 1 97 40500
3 1 98 41000
4 2 96 45000
5 2 97 45400
6 2 98 45800
7 3 96 75000
8 3 97 76000
9 3 98 77000
Now let's look at the program. The first step in the reshaping process is sorting the data (using proc sort) on an identification variable (famid) and saving the sorted data set (longsort). Next we write a data step to do the actual reshaping. We will explain each of the statements in the data step in order.
PROC SORT DATA=long OUT=longsort ;
BY famid ;
RUN ;
DATA wide1 ;
SET longsort ;
BY famid ;
KEEP famid faminc96 -faminc98 ;
RETAIN faminc96 - faminc98 ;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
afaminc( year ) = faminc ;
IF last.famid THEN OUTPUT ;
RUN;
This is a good example to compare and contrast with DO UNTIL(LAST. It does away with the RETAIN and INIT to missing on FIRST.FAMID and the LAST. test for when to OUTPUT. Those operations are sill done just using the built in features of the data step loop.
DATA long;
INPUT famid year faminc;
CARDS;
1 96 40000
1 97 40500
1 98 41000
2 96 45000
2 97 45400
2 98 45800
3 96 75000
3 97 76000
3 98 77000
;;;;
RUN;
proc print;
run;
data wide;
do until(last.famid);
set long;
by famid;
ARRAY afaminc[96:98] faminc96-faminc98;
afaminc[year]=faminc;
end;
drop year faminc;
run;
proc print;
run;
The main element here is the SAS retain statement.
The datastep is executed for every observation in the dataset. For every iteration all variables are set to missing and then the data is loaded from the dataset.
If a variable is RETAINed it will not be reset, but will keep the information from the last iteration.
BY famid ;
Your dataset is ordered and the datastep is using a by statement. This will initialize the first.famid and last.famid. These are just binaries that turn to 1 for the first/last observation of a single id-group.
RETAIN faminc96 - faminc98 ;
As already explained faminc96 - faminc98 will keep their value from one datastep iteration to the next.
ARRAY afaminc(96:98) faminc96 - faminc98 ;
Just an array, so you can call the variables by number instead of name.
IF first.famid THEN
DO;
DO i = 96 to 98 ;
afaminc( i ) = . ;
END;
END;
For every first observation in an id-group the retained variables are reset. Otherwise you would keep values from one od-group to the next. Same could be done by IF first.famid then call missing(of afaminc(*));
afaminc( year ) = faminc ;
Writing the information to your transposed variables, according to the year.
IF last.famid THEN OUTPUT ;
After you have written all the values to your new variables, you only OUTPUT one observation (the last) in every id-group to the new dataset. As the variables were retained, they are all filled at this point.
This datastep is fast and purpose build. But generally you could just use proc transpose
I highly recommend proc transpose. It'll make your life easier.
http://support.sas.com/resources/papers/proceedings09/060-2009.pdf

Ranking values based on another data set in SAS

Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)
Here's a simplified example of data sets A, B and want (the desired output):
A:
obs_A VAR1 VAR2 VAR3
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
B:
obs_B VAR1 VAR2 VAR3
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
want:
obs VAR1 VAR2 VAR3
1 2 2 3
2 2 4 2
3 4 3 1
4 5 4 3
5 6 6 6
I come up with a macro loop that involves PROC RANK and PROC APPEND like below:
%macro MyRank(A,B);
data AB; set &A &B; run;
%do i=1 %to 5;
proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
var VAR1-3;
run;
proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
%end;
%mend;
This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.
Thanks in advance for suggestions.
I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.
First, reading in the A dataset:
data ranking_vals;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;;;;
run;
Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).
data for_ranking;
set ranking_vals;
array var[3];
do _i = 1 to dim(var);
var_name = vname(var[_i]);
var_value = var[_i];
output;
end;
run;
proc sort data=for_ranking;
by var_name var_value;
run;
Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.
data for_fmt;
set for_ranking;
by var_name var_value;
retain prev_value;
if first.var_name then do; *initialize things for a new varname;
rank=0;
prev_value=.;
hlo='l'; *first record has 'minimum' as starting point;
end;
rank+1;
fmtname=cats(var_name,'F');
start=prev_value;
end=var_value;
label=rank;
output;
if last.var_name then do; *For last record, some special stuff;
start=var_value;
end=.;
hlo='h';
label=rank+1;
output; * Output that 'high' record;
start=.;
end=.;
label=.;
hlo='o';
output; * And a "invalid" record, though this should never happen;
end;
prev_value=var_value; * Store the value for next row.;
run;
proc format cntlin=for_fmt;
quit;
And then we test it out.
data test_b;
input obs_B VAR1 VAR2 VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1 15 150 2234
2 14 352 1555
3 36 251 1000
4 41 350 2011
5 60 553 5012
;;;;
run;
One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.
proc sql;
create table want as
select
B.obs_B,
(
select count(distinct A.Var1) + 1
from A
where A.var1 <= B.var1.
) as var1
from B;
quit;
Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.
%macro rankBy(
inScore /*Dataset containing data to be ranked*/,
inLookup /*Dataset containing data against which to rank*/,
varID /*Variable by which to identify an observation*/,
varsRank /*Space separated list of variable names to be ranked*/,
outData /*Output dataset name*/);
/* Rank variables in one dataset by identically named variables in another */
proc sql;
create table &outData. as
select
scr.&varID.
/* Loop through each variable to be ranked */
%do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
/* Store the variable name in a macro variable */
%let var = %scan(&varsRank., &i., %str( ));
/* Rank: count all the rows with lower value in lookup */
, (
select count(distinct lkp&i..&var.) + 1
from &inLookup. as lkp&i.
where lkp&i..&var. <= scr.&var.
) as &var.
%end;
from &inScore. as scr;
quit;
%mend rankBy;
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.
In rough testing on a slow PC I saw:
A: 1e1 B: 1e6 time: ~1s
A: 1e2 B: 1e6 time: ~2s
A: 1e3 B: 1e6 time: ~5s
A: 1e1 B: 1e7 time: ~10s
A: 1e2 B: 1e7 time: ~12s
A: 1e4 B: 1e6 time: ~30s
Edit:
As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.
Edit:
Runtime test rig:
data A;
input obs_A VAR1 VAR2 VAR3;
datalines;
1 10 100 2000
2 20 300 1000
3 30 200 4000
4 40 500 3000
5 50 400 5000
;
run;
data B;
do obs_B = 1 to 1e7;
VAR1 = ceil(rand("uniform")* 60);
VAR2 = ceil(rand("uniform")* 500);
VAR3 = ceil(rand("uniform")* 6000);
output;
end;
run;
%let start = %sysfunc(time());
%rankBy(
inScore = B,
inLookup = A,
varID = obs_B,
varsRank = VAR1 VAR2 VAR3,
outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;
Output:
0:00:12.41