Sorting an almost sorted dataset in SAS - sas

I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency
1 Jon 20
1 John 5
2 Mathieu 2
2 Mathhew 7
2 Matt 5
3 Nat 1
3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?

If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.

Related

Sum a number of specific rows before and after

I want to do a sum of 250 previous rows for each row, starting from the row 250th.
X= lag1(VWRETD)+ lag2(VWRETD)+ ... +lag250(VWRETD)
X = sum ( lag1(VWRETD), lag2(VWRETD), ... ,lag250(VWRETD) )
I try to use lag function, but it does not work for too many lags.
I also want to calculate sum of 250 next rows after each row.
What you're looking for is a moving sum both forwards and backwards where the sum is missing until that 250th observation. The easiest way to do this is with PROC EXPAND.
Sample data:
data have;
do MKDate = '01JAN1993'd to '31DEC2000'd;
VWRET = rand('uniform');
output;
end;
format MKDate mmddyy10.;
run;
Code:
proc expand data=have out=want;
id MKDate;
convert VWRET = x_backwards_250 / transform=(movsum 250 trimleft 250);
convert VWRET = x_forwards_250 / transform=(reverse movsum 250 trimleft 250 reverse);
run;
Here's what the transformation operations are doing:
Creating a backwards moving sum of 250 observations, then setting the initial 250 to missing.
Reversing VWRET, creating a moving sum of 250 observations, setting the initial 250 to missing, then reversing it again. This effectively creates a forward moving sum.
The key is how to read observations from previous and post rows. As for your sum(n1, n2,...,nx) function, you can replace it with iterative summation.
This example uses multiple set skill to achieve summing a variable from 25 previous and post rows:
data test;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
set sashelp.air(keep=air rename=air=pre_air) point=i;
sum_pre=sum(sum_pre,pre_air);
end;
do j=_n_+1 to _n_+25;
set sashelp.air(keep=air rename=air=post_air) point=j;
sum_post=sum(sum_post,post_air);
end;
end;
drop pre_air post_air;
run;
Only 26th to nobs-25th rows will be calculated, where nobs stands for number of observations of the setting data sashelp.air.
Multiple set may take long time when meeting big dataset, if you want to be more effective, you can use array and DOW-loop to instead multiple set skill:
data test;
array _val_[1024]_temporary_;
if _n_=1 then do i=1 by 1 until(eof);
set sashelp.air end=eof;
_val_[i]=air;
end;
set sashelp.air nobs=nobs;
if 25<_n_<nobs-25+1 then do;
do i=_n_-25 to _n_-1;
sum_pre=sum(sum_pre,_val_[i]);
end;
do j=_n_+1 to _n_+25;
sum_post=sum(sum_post,_val_[j]);
end;
end;
drop i j;
run;
The weakness is you have to give a dimension number to array, it should be equal or great than nobs.
These skills are from a concept called "Table Look-Up", For SAS context, read "Table Look-Up by Direct Addressing: Key-Indexing -- Bitmapping -- Hashing", Paul Dorfman, SUGI 26.
You don't want use normal arithmetic with missing values becasue then the result is always a missing value. Use the SUM() function instead.
You don't need to spell out all of the lags. Just keep a normal running sum but add the wrinkle of removing the last one in by subtraction. So your equation only needs to reference the one lagged value.
Here is a simple example using running sum of 5 using SASHELP.CLASS data as an example:
%let n=5 ;
data step1;
set sashelp.class(keep=name age);
retain running_sum ;
running_sum=sum(running_sum,age,-(sum(0,lag&n.(age))));
if _n_ >= &n then want=running_sum;
run;
So the sum of the first 5 observations is 68. But for the next observation the sum goes down to 66 since the age on the 6th observation is 2 less than the age on the first observation.
To calculate the other variable sort the dataset in descending order and use the same logic to make another variable.

Outliers for all numerical values to mean SAS

I am working in SAS with a dataset with a lot of numeric values which I have standardised as follows:
proc standard data=df mean=0 std=1
out=df;
run;
Is there any easy way to deal with outliers (+/- 3standard deviation) for all numeric values? Ideally I would want to change all of those to + or - 3x standard deviation, or in worst case remove them.
You have to run through the data twice. There are may ways you can adjust your output. Here's a simple way using a datastep:
Assuming your dataset has a standardized variable called 'test':
Data adjusted;
set df;
if test > 3 then test=3;
if test < -3 then test =-3;
run;
just remember your new dataset will no longer have a mean of 0 and a standard deviation of 1
No sample data provided so I generated 5 random variables with N(0,2) distribution for the purpose of demonstration for removing outliers from N(0,1).
If you have multiple columns to remove outliers from, you could create a macro or just loop through an array.
DATA have;
INPUT var1 var2 var3 var4 var5;
DATALINES;
-0.8458048655231136 -2.1737985573160485 -2.122482432573275 1.8746296707673902 -2.799009287469253
-1.9927731684115295 1.8230096873238637 0.5964656531490122 -1.6465532407305106 3.9430012045284184
0.0294083016125659 1.3877418982525658 -1.3398372120124733 -0.8195179339297752 4.742490300459201
-0.5215716306745832 -3.35412129416837 1.1558155344985737 -1.0073681302151822 2.425914724408619
-2.817574234024364 3.5161858163738424 -2.1822454739704744 0.060674570200235534 0.25898913069677443
-3.941905381717187 4.969013776451821 2.021891632999466 -1.1526212617289868 1.2864391876960568
;
run;
* variable of all columns to remove outliers from ;
%LET column_names=var1 var2 var3 var4 var5;
DATA want;
SET have;
ARRAY columns {*} &column_names.;
DO i=1 to dim(columns);
if columns[i]>3 then columns[i]=3;
if columns[i]<-3 then columns[i]=-3;
END;
DROP i;
RUN;

How can I create pivot table in SAS?

I have three columns in a dataset: spend, age_bucket, and multiplier. The data looks something like...
spend age_bucket multiplier
10 18-24 2x
120 18-24 2x
1 35-54 3x
I'd like a dataset with the columns as the age buckets, the rows as the multipliers, and the entries as the sum (or other aggregate function) of the spend column. Is there a proc to do this? Can I accomplish it easily using proc SQL?
There are a few ways to do this.
data have;
input spend age_bucket $ multiplier $;
datalines;
10 18-24 2x
120 18-24 2x
1 35-54 3x
10 35-54 2x
;
proc summary data=have;
var spend;
class age_bucket multiplier;
output out=temp sum=;
run;
First you can use PROC SUMMARY to calculate the aggregation, sum in this case, for the variable in question. The CLASS statement gives you things to sum by. This will calculate the N-Way sums and the output data set will contain them all. Run the code and look at data set temp.
Next you can use PROC TRANSPOSE to pivot the table. We need to use a BY statement so a PROC SORT is necessary. I also filter to the aggregations you care about.
proc sort data=temp(where=(_type_=3));
by multiplier;
run;
proc transpose data=temp out=want(drop=_name_);
by multiplier;
var spend;
id age_bucket;
idlabel age_bucket;
run;
In traditional mode 35-54 is not a valid SAS variable name. SAS will convert your columns to proper names. The label on the variable will retain the original value. Just be aware if you need to reference the variable later, the name has changed to be valid.

SAS Find Top Combinations in Dataset

Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.

replicating a sql function in sas datastep

Hi another quick question
in proc sql we have on which is used for conditional join is there something similar for sas data step
for example
proc sql;
....
data1 left join data2
on first<value<last
quit;
can we replicate this in sas datastep
like
data work.combined
set data1(in=a) data2(in=b)
if a then output;
run;
You can also can reproduce sql join in one DATA-step using hash objects. It can be really fast but depends on the size of RAM of your machine since this method loads one table into memory. So the more RAM - the larger dataset you can wrap into hash. This method is particularly effective for look-ups in relatively small reference table.
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
data want;
if _N_=1 then do;
if 0 then set have2;
declare hash h(dataset:'have2');
h.defineKey('value');
h.defineData('value');
h.defineDone();
declare hiter hi('h');
end;
set have1;
rc=hi.first();
do while(rc=0);
if first<value<last then output;
rc=hi.next();
end;
drop rc;
run;
The result:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
Yes there is a simple (but subtle) way in just 7 lines of code.
What you intend to achieve is intrinsically a conditional Cartesian join which can be done by a do-looped set statement. The following code use the test dataset from Dmitry and a modified version of the code in the appendix of SUGI Paper 249-30
data data1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data data2;
input value;
datalines;
2
5
6
7
;
run;
/***** by data step looped SET *****/
DATA CART_data;
SET data1;
DO i=1 TO NN; /*NN can be referenced before set*/
SET data2 point=i nobs=NN; /*point=i - random access*/
if first<value<last then OUTPUT; /*conditional output*/
END;
RUN;
/***** by SQL *****/
proc sql;
create table cart_SQL as
select * from data1
left join data2
on first<value<last;
quit;
One can easily see that the results coincide.
Also note that from SAS 9.2 documentation: "At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you CAN refer to the NOBS= variable BEFORE the SET statement. The variable is available in the DATA step but is not added to any output data set."
There isn't a direct way to do this with a MERGE. This is one example where the SQL method is clearly superior to any SAS data step methods, as anything you do will take much more code and possibly more time.
However, depending on the data, it's possible a few approaches may make sense. In particular, the format merge.
If data1 is fairly small (even, say, millions of records), you can make a format out of it. Like so:
data fmt_set;
set data1;
format label $8.;
start=first; *set up the names correctly;
end=last;
label='MATCH';
fmtname='DATA1F';
output;
if _n_=1 then do; *put out a hlo='o' line which is for unmatched lines;
start=.; *both unnecessary but nice for clarity;
end=.;
label='NOMATCH';
hlo='o';
output;
end;
run;
proc format cntlin=fmt_set; *import the dataset;
quit;
data want;
set data2;
if put(value,DATA1F.)="MATCH";
run;
This is very fast to run, unless data1 is extremely large (hundreds of millions of rows, on my system) - faster than a data step merge, if you include sort time, since this doesn't require a sort. One major limitation is that this will only give you one row per data2 row; if that is what is desired, then this will work. If you want repeats of data2 then you can't do it this way.
If data1 may have overlapping rows (ie, two rows where start/end overlap each other), you also will need to address this, since start/end aren't allowed to overlap normally. You can set hlo="m" for every row, and "om" for the non-match row, or you can resolve the overlaps.
I'd still do the sql join, however, since it's much shorter to code and much easier to read, unless you have performance issues, or it doesn't work the way you want it to.
Here's another solution, using a temporary array to hold the lookup dataset. Performance is probably similar to Dmitry's hash-based solution, but this should also work for people still using versions of SAS prior to 9.1 (i.e. when hash objects were first introduced).
I've reused Dmitry's sample datasets:
data have1;
input first last;
datalines;
1 3
4 7
6 9
;
run;
data have2;
input value;
datalines;
2
5
6
7
;
run;
/*We need a macro var with the number of obs in the lookup dataset*/
/*This is so we can specify the dimension for the array to hold it*/
data _null_;
if 0 then set have2 nobs = nobs;
call symput('have2_nobs',put(nobs,8.));
stop;
run;
data want_temparray;
array v{&have2_nobs} _temporary_;
do _n_ = 1 to &have2_nobs;
set have2 (rename=(value=value_array));
v{_n_}=value_array;
end;
do _n_ = 1 by 1 until (eof_have1);
set have1 end = eof_have1;
value=.;
do i=1 to &have2_nobs;
if first < v{i} < last then do;
value=v{i};
output;
end;
end;
if missing(value) then output;
end;
drop i value_array;
run;
Output:
value first last
2 1 3
5 4 7
6 4 7
7 6 9
This matches the output from the equivalent SQL:
proc sql;
create table want_sql as
select * from
have1 left join have2
on first<value<last
;
quit;
run;