Splitting a SAS dataset into mutual-exclusive by row in parallel? - sas

There is SO question on split a large dataset into smaller one but with the advent of proc ds2 there must a way to do this using threads?
I have written the below data step to split a dataset into &chunks. chunks. I tried to write the same in proc ds but it just fails. I am quite new to proc ds2 so a simple explanation for someone with good understanding of data step would be ideal.
Data stepcode
%macro output_chunks(in, out, by, chunks);
data %do i = 1 %to &chunks.;
&out.&i.(compress=char drop = i)
%end;
;
set &in.;
by &by.;
retain i 0;
if first.&by. then do;
i = i + 1;
if i = &chunks.+1 then i = 1;
end;
%do i = 1 %to &chunks.;
if i = &i. then do;
output &out.&i.;
end;
%end;
run;
%mend;
proc ds2 code
proc ds2;
thread split/overwrite=yes;
method run();
set in_data;
thisThread=_threadid_;
/* can make below into macro but I can't seem to get it to work */
if thisThread = 1 then do;
output ds1;
end;
else if thisThread = 2 then do;
output ds2;
end;
end;
method term();
put '**Thread' _threadid_ 'processed' count 'rows:';
end;
endthread;
run;
quit;

So, you're right in one sense that DS/2 may be helpful here. However, I suspect it's a bit more complicated.
DS/2 will happily thread data steps, but what is going to be more challenging is writing to several different datasets. That's because there's not a great way to structure the output dataset name without using the macro language, which won't play with the threading very well as far as I can tell (though I'm no expert here).
Here's an example of it using threading:
PROC DS2;
thread in_thread/overwrite=yes;
dcl bigint count;
drop count;
method init();
count=0;
end;
method run();
set in_data;
count+1;
output;
end;
method term();
put 'Thread' _threadid_ ' processed' count 'observations.';
end;
endthread;
run;
data out_data/overwrite=yes;
dcl thread in_thread t_in; /* instance of the thread */
method run();
set from t_in threads=4;
output;
end;
enddata;
run;
quit;
But this just writes one dataset out, and if you change threads=4 to 1, it doesn't actually take any longer. Both are okay speed-wise, though actually slower than the regular data step (about 1.8x the speed for me). DS/2 uses a much, much slower method to access data under the hood than SAS's base data step when accessing SAS datasets; DS/2's speed gains really come into play when you're working in RDBMSs via SQL or similar.
However, there's no good way to drive the output in parallel. Here's the version of the above turned into making 4 datasets. Notice that the actual selection of where to output is in the main, non-threaded data step...
PROC DS2;
thread in_thread/overwrite=yes;
dcl bigint count;
dcl bigint thisThread;
drop count;
method init();
count=0;
end;
method run();
set in_data;
count+1;
thisThread = _threadid_;
output;
end;
method term();
put 'Thread' _threadid_ ' processed' count 'observations.';
end;
endthread;
run;
data a b c d/overwrite=yes;
dcl thread in_thread t_in; /* instance of the thread */
method run();
set from t_in threads=4;
select(thisThread);
when (1) output a;
when (2) output b;
when (3) output c;
when (4) output d;
otherwise;
end;
end;
enddata;
run;
quit;
So it's actually quite a lot slower than in the non-threaded version. Oops!
Really, your issue here is that disk i/o is the main problem, not CPU. Your CPU does virtually no work here. DS/2 might be able to help in some edge cases where you have a really fast SAN that allows tons of simultaneous writes, but ultimately it takes X amount of time to read those million records and same X amount of time to write a million records, based on your i/o constraint, and odds are parallelizing that won't help.
Hash tables will add a lot more I suspect, and could certainly be used here with DS/2; see my new answer on the other question linked in OP for the data step version. DS/2 probably won't make that solution any faster, more likely slower; but you could implement roughly the same thing in DS/2 if you wanted, and then the sub-thread would be able to output on its own without involving the master thread.
Where DS/2 would be helpful would be if you're doing this in Teradata or something, where you can use SAS's in-database accelerator to execute this code database-side. That would make things a lot more efficient. Then you could use something similar to my code above, or better yet a hash solution.

Example of a User-Defined DS2 package to split a data set using HoH method, the big downside is the inability to name the dataset by the keys without a LOT of fudgery due to very limited utility of variable lists in DS2, as a result I opt for a simpler naming convension:
data cars;
set sashelp.cars;
run;
proc ds2;
package hashSplit / overwrite=yes;
declare package hash h ();
declare package hash hs ();
declare package hiter hi;
/**
* create a child multidata hash object
*/
private method mHashSub(varlist k, varlist d) returns package hash;
hs = _new_ [this] hash();
hs.keys(k);
hs.data(d);
hs.multidata('Y');
hs.defineDone();
return hs;
end;
/**
* constructor, create the parent and child hash objects
*/
method hashSplit(varlist k);
h = _new_ [this] hash();
h.keys(k);
h.definedata('hs');
h.defineDone();
end;
/**
* adds key values to parent hash, if necessary
* adds key values and data values to child hash
* consilidates the FIND, ADD and nested ADD methods
*/
method add(varlist k, varlist d);
declare double rc;
rc = h.find();
if rc ^= 0 then do;
hs = mHashSub(k, d);
h.add();
end;
hs.add();
end;
/**
* outputs the child hashes to data sets with a fixed naming convention
*
* SAS needs to add more support for using variable lists with functions/methods besides hash
*/
method output();
declare double rc;
declare int i;
hi = _new_ hiter('h');
rc = hi.first();
do i = 1 to h.num_items by 1 while (rc = 0);
hs.output(catx('_', 'hashSplit', i));
rc = hi.next();
end;
end;
endpackage;
run;
quit;
/**
* example of using the hashSplit package
*/
proc ds2;
data _null_;
varlist k [origin];
varlist d [_all_];
declare package hashSplit split(k);
method run();
set cars;
split.add(k, d);
end;
method term();
split.output();
end;
enddata;
run;
quit;

Related

Matching SAS character variables to a list

So I have a vector of search terms, and my main data set. My goal is to create an indicator for each observation in my main data set where variable1 includes at least one of the search terms. Both the search terms and variable1 are character variables.
Currently, I am trying to use a macro to iterate through the search terms, and for each search term, indicate if it is in the variable1. I do not care which search term triggered the match, I just care that there was a match (hence I only need 1 indicator variable at the end).
I am a novice when it comes to using SAS macros and loops, but have tried searching and piecing together code from some online sites, unfortunately, when I run it, it does nothing, not even give me an error.
I have put the code I am trying to run below.
*for example, I am just testing on one of the SASHELP data sets;
*I take the first five team names to create a search list;
data terms; set sashelp.baseball (obs=5);
search_term = substr(team,1,3);
keep search_term;;
run;
*I will be searching through the baseball data set;
data test; set sashelp.baseball;
run;
%macro search;
%local i name_list next_name;
proc SQL;
select distinct search_term into : name_list separated by ' ' from work.terms;
quit;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
%let next_name = %scan(&name_list, &i);
*I think one of my issues is here. I try to loop through the list, and use the find command to find the next_name and if it is in the variable, then I should get a non-zero value returned;
data test; set test;
indicator = index(team,&next_name);
run;
%let i = %eval(&i + 1);
%end;
%mend;
Thanks
Here's the temporary array solution which is fully data driven.
Store the number of terms in a macro variable to assign the length of arrays
Load terms to search into a temporary array
Loop through for each word and search the terms
Exit loop if you find the term to help speed up the process
/*1*/
proc sql noprint;
select count(*) into :num_search_terms from terms;
quit;
%put &num_search_terms.;
data flagged;
*declare array;
array _search(&num_search_terms.) $ _temporary_;
/*2*/
*load array into memory;
if _n_ = 1 then do j=1 to &num_search_terms.;
set terms;
_search(j) = search_term;
end;
set test;
*set flag to 0 for initial start;
flag = 0;
/*3*/
*loop through and craete flag;
do i=1 to &num_search_terms. while(flag=0); /*4*/
if find(team, _search(i), 'it')>0 then flag=1;
end;
drop i j search_term ;
run;
Not sure I totally understand what you are trying to do but if you want to add a new binary variable that indicates if any of the substrings are found just use code like:
data want;
set have;
indicator = index(term,'string1') or index(term,'string2')
... or index(term,'string27') ;
run;
Not sure what a "vector" would be but if you had the list of terms in a dataset you could easily generate that code from the data. And then use %include to add it to your program.
filename code temp;
data _null_;
set term_list end=eof;
file code ;
if _n_ =1 then put 'indicator=' # ;
else put ' or ' #;
put 'index(term,' string :$quote. ')' #;
if eof then put ';' ;
run;
data want;
set have;
%include code / source2;
run;
If you did want to think about creating a macro to generate code like that then the parameters to the macro might be the two input dataset names, the two input variable names and the output variable name.

Creating a dataset with the unique values of indexed variable

I have a dataset (LRG_DS) with about 74,000,000 observations. The dataset has been indexed by a variable (I_VAR1) that has about 7500 unique values. I've discovered this by running a proc contents on the dataset.
I'd like to create a dataset (TEMP)contains just the 7000 unique values of the index variable.
I've tried the following:
data TEMP;
set LRG_DS (keep = I_VAR1);
by I_VAR1;
if first.I_VAR1;
run;
and
proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1);
by I_VAR1;
run;
The first approach takes about 46 seconds and the second takes about 55 seconds.
I've read that the sas7bndx is file is not intended to be examined in isolation, but rather as a file to speed up the some of the procedures performed using the index variable.
Any help is much appreciated!
YMMV but using populating an empty hash table with the unique key values may perform better than a sort.
Create some example data:
data x;
do cnt=1 to 10*100000;
var=round(rand('uniform'),0.001);
do cnt2=1 to 10;
output;
end;
drop cnt2;
end;
run;
Test speed with a proc sort:
proc sort data=x(keep=var) out=sorted nodupkey;
by var;
run;
Compare with the hash table version:
data _null_;
set x(keep=var) end=eof;
if _n_ eq 1 then do;
declare hash ht ();
rc = ht.DefineKey ('var');
rc = ht.DefineDone ();
end;
if ht.check() ne 0 then do;
rc = ht.add();
end;
if eof then do;
ht.output(dataset:"ids");
end;
run;
From my very brief tests, I found that the hash table version starts to perform worse as the number of unique values grows. It may be possible to offset this by dimensioning the hash appropriately beforehand but I didn't test.

PROC DS2 performance issues

I was trying to use proc ds2 to try and get some performance increases over the normal data step by using the multithreaded capability.
fred.testdata is a SPDE dataset containing 5 million observations. My code is below:
proc ds2;
thread home_claims_thread / overwrite = yes;
/*declare char(10) producttype;
declare char(12) wrknat_clmtype;
declare char(7) claimtypedet;
declare char(1) event_flag;*/
/*declare date week_ending having format date9.;*/
method run();
/*declare char(7) _week_ending;*/
set fred.testdata;
if claim = 'X' then claimtypedet= 'ABC';
else if claim = 'Y' then claimtypedet= 'DEF';
/*_week_ending = COMPRESS(exposmth,'M');
week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
end;
endthread;
data home_claims / overwrite = yes;
declare thread home_claims_thread t;
method run();
set from t threads=8;
end;
enddata;
run;
quit;
I didn't include all IF statements and only included a few otherwise it would have taken up a few pages (you should get the idea hopefully). As the code currently is it works quite a fair bit faster than the normal data step however significant performance issues arise when any of the following happens:
I uncomment any of the declare statements
I include any numeric variables in fred.testdata (even without performing any calculations on the numeric variables)
My questions are:
Is there any way to introduce numeric variables into fred.testdata without getting significant slowdowns which make DS2 way slower than the normal data step? (for this small table of 5 million rows including numeric column/s the real time is about 1 min 30 for ds2 and 20 seconds for normal data step). The actual full table is closer to 600 million rows. For example I would like to be able to do that week_ending conversion without it introducing a 5x performance penalty in run times. Run times for ds2 WITHOUT declare statements and numeric variables takes about 7 seconds
Is there any way to compress the table in ds2 without having to do an additional data step to compress it?
Thank you
Two methods to try: using proc hpds2 to have SAS handle parallel execution, or a more manual approach. Note that it is impossible to always preserve order with either of these methods.
Method 1: PROC HPDS2
HPDS2 is a way of performing massive parallel processing of data. In single-machine mode, it will make parallel runs per core, then put the data all back together. You only need to make a few slight modifications to your code in order to run it.
hpds2 has a setup where you declare your data in the data and out statements in the proc statement. Your data and set statements will always use the following syntax:
data DS2GTF.out;
method run();
set DS2GTF.in;
<code>;
end;
enddata;
Knowing that, we can modify your code to run on HPDS2:
proc hpds2 data=fred.test_data
out=home_claims;
data DS2GTF.out;
/*declare char(10) producttype;
declare char(12) wrknat_clmtype;
declare char(7) claimtypedet;
declare char(1) event_flag;*/
/*declare date week_ending having format date9.;*/
method run();
/*declare char(7) _week_ending;*/
set DS2GTF.in;
if claim = 'X' then claimtypedet= 'ABC';
else if claim = 'Y' then claimtypedet= 'DEF';
/*_week_ending = COMPRESS(exposmth,'M');
week_ending = to_date(substr(_week_ending,1,4) || '-' || substr(_week_ending,5,2) || '-01');*/
end;
enddata;
run;
quit;
Method 2: Split the data using rsubmit and append
The below code makes use of rsubmit and direct observation access to read data in chunks, then append them all together at the end. This one can work especially well if you have your data set up for Block I/O
options sascmd='!sascmd'
autosignon=yes
noconnectwait
noconnectpersist
;
%let cpucount = %sysfunc(getoption(cpucount));
%macro parallel_execute(data=, out=, threads=&cpucount);
/* Get total obs from data */
%let dsid = %sysfunc(open(&data.));
%let n = %sysfunc(attrn(&dsid., nlobs));
%let rc = %sysfunc(close(&dsid.));
/* Run &threads rsubmit sessions */
%do i = 1 %to &threads;
/* Determine the records that each worker will read */
%let firstobs = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i+1)+1, floor);
%let lastobs = %sysevalf(&n.-(&n./&threads.)*(&threads.-&i.), floor);
/* Get this session's work directory */
%let workdir = %sysfunc(getoption(work));
/* Send all macro variables to the remote session, and simultaneously start the remote session */
%syslput _USER_ / remote=worker&i.;
/* Check for an input libname */
%if(%scan(&data., 2, .) NE) %then %do;
%let inlib = %scan(&data., 1, .);
%let indsn = %scan(&data., 2, .);
%end;
%else %do;
%let inlib = workdir;
%let indsn = &data.;
%end;
/* Check for an output libname */
%if(%scan(&out., 2, .) NE) %then %do;
%let outlib = %scan(&out., 1, .);
%let outdsn = %scan(&out., 2, .);
%end;
%else %do;
%let outlib = workdir;
%let outdsn = &out.;
%end;
/* Work library location of this session to be inherited by the parallel session */
%let workdir = %sysfunc(getoption(work));
/* Sign on to a remote session and send over all user-made macro variables */
%syslput _USER_ / remote=worker&i.;
/* Run code on remote session &i */
rsubmit remote=worker&i. inheritlib=(&inlib.);
libname workdir "&workdir.";
data workdir._&outdsn._&i.;
set &inlib..&indsn.(firstobs=&firstobs. obs=&lastobs.);
/* <PUT CODE HERE>;*/
run;
endrsubmit;
%end;
/* Wait for everything to complete */
waitfor _ALL_;
/* Append all of the chunks together */
proc datasets nolist;
delete &out.;
%do i = 1 %to &threads.;
append base=&out.
data=_&outdsn._&i.
force
;
%end;
/* Optional: remove all temporary data */
/* delete _&outdsn._:;*/
quit;
libname workdir clear;
%mend;
You can test its functionality with the below code:
data pricedata;
set sashelp.pricedata;
run;
%parallel_execute(data=pricedata, out=test, threads=3);
If you look at the temporary files in your WORK directory, you'll see that it evenly split up the dataset among the 3 parallel processes, and that it adds up to the original total.
_test_1 = 340
_test_2 = 340
_test_3 = 340
TOTAL = 1020
pricedata = 1020

SAS - repeating a data step to solve for a value

Is it possible to repeat a data step a number of times (like you might in a %do-%while loop) where the number of repetitions depends on the result of the data step?
I have a data set with numeric variables A. I calculate a new variable result = min(1, A). I would like the average value of result to equal a target and I can get there by scaling variable A by a constant k. That is solve for k where target = average(min(1,A*k)) - where k and target are constants and A is a list.
Here is what I have so far:
filename f0 'C:\Data\numbers.csv';
filename f1 'C:\Data\target.csv';
data myDataSet;
infile f0 dsd dlm=',' missover firstobs=2;
input A;
init_A = A; /* store the initial value of A */
run;
/* read in the target value (1 observation) */
data targets;
infile f1 dsd dlm=',' missover firstobs=2;
input target;
K = 1; * initialise the constant K;
run;
%macro iteration; /* I need to repeat this macro a number of times */
data myDataSet;
retain key 1;
set myDataSet;
set targets point=key;
A = INIT_A * K; /* update the value of A /*
result = min(1, A);
run;
/* calculate average result */
proc sql;
create table estimate as
select avg(result) as estimate0
from myDataSet;
quit;
/* compare estimate0 to target and update K */
data targets;
set targets;
set estimate;
K = K * (target / estimate0);
run;
%mend iteration;
I can get the desired answer by running %iteration a few times, but Ideally I would like to run the iteration until (target - estimate0 < 0.01). Is such a thing possible?
Thanks!
I had a similar problem to this just the other day. The below approach is what I used, you will need to change the loop structure from a for loop to a do while loop (or whatever suits your purposes):
First perform an initial scan of the table to figure out your loop termination conditions and get the number of rows in the table:
data read_once;
set sashelp.class end=eof;
if eof then do;
call symput('number_of_obs', cats(_n_) );
call symput('number_of_times_to_loop', cats(3) );
end;
run;
Make sure results are as expected:
%put &=number_of_obs;
%put &=number_of_times_to_loop;
Loop over the source table again multiple times:
data final;
do loop=1 to &number_of_times_to_loop;
do row=1 to &number_of_obs;
set sashelp.class point=row;
output;
end;
end;
stop; * REQUIRED BECAUSE WE ARE USING POINT=;
run;
Two part answer.
First, it's certainly possible to do what you say. There are some examples of code that works like this available online, if you want a working, useful-code example of iterative macros; for example, David Izrael's seminal Rakinge macro, which performs a rimweighting procedure by iterating over a relatively simple process (proc freqs, basically). This is pretty similar to what you're doing. In the process it looks in the datastep at the various termination criteria, and outputs a macro variable that is the total number of criteria met (as each stratification variable separately needs to meet the termination criterion). It then checks %if that criterion is met, and terminates if so.
The core of this is two things. First, you should have a fixed maximum number of iterations, unless you like infinite loops. That number should be larger than the largest reasonable number you should ever need, often by around a factor of two. Second, you need convergence criteria such that you can terminate the loop if they're met.
For example:
data have;
x=5;
run;
%macro reduce(data=, var=, amount=, target=, iter=20);
data want;
set have;
run;
%let calc=.;
%let _i=0;
%do %until (&calc.=&target. or &_i.=&iter.);
%let _i = %eval(&_i.+1);
data want;
set want;
&var. = &var. - &amount.;
call symputx('calc',&var.);
run;
%end;
%if &calc.=&target. %then %do;
%put &var. reduced to &target. in &_i. iterations.;
%end;
%else %do;
%put &var. not reduced to &target. in &iter. iterations. Try a larger number.;
%end;
%mend reduce;
%reduce(data=have,var=x,amount=1,target=0);
That is a very simple example, but it has all of the same elements. I prefer to use do-until and increment on my own but you can do the opposite also (as %rakinge does). Sadly the macro language doesn't allow for do-by-until like the data step language does. Oh well.
Secondly, you can often do things like this inside a single data step. Even in older versions (9.2 etc.), you can do all of what you ask above in a single data step, though it might look a little clunky. In 9.3+, and particularly 9.4, there are ways to run that proc sql inside the data step and get the result back without waiting for another data step, using RUN_MACRO or DOSUBL and/or the FCMP language. Even something simple, like this:
data have;
initial_a=0.3;
a=0.3;
target=0.5;
output;
initial_a=0.6;
a=0.6;
output;
initial_a=0.8;
a=0.8;
output;
run;
data want;
k=1;
do iter=1 to 20 until (abs(target-estimate0) < 0.001);
do _n_ = 1 to nobs;
if _n_=1 then result_tot=0;
set have nobs=nobs point=_n_;
a=initial_a*k;
result=min(1,a);
result_tot+result;
end;
estimate0 = result_tot/nobs;
k = k * (target/estimate0);
end;
output;
stop;
run;
That does it all in one data step. I'm cheating a bit because I'm writing my own data step iterator, but that's fairly common in this sort of thing, and it is very fast. Macros iterating multiple data steps and proc sql steps will be much slower typically as there is some overhead from each one.

Macro returning a value

I created the following macro. Proc power returns table pw_cout containing column Power. The data _null_ step assigns the value in column Power of pw_out to macro variable tpw. I want the macro to return the value of tpw, so that in the main program, I can call it in DATA step like:
data test;
set tmp;
pw_tmp=ttest_power(meanA=a, stdA=s1, nA=n1, meanB=a2, stdB=s2, nB=n2);
run;
Here is the code of the macro:
%macro ttest_power(meanA=, stdA=, nA=, meanB=, stdB=, nB=);
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
data _null_;
set pw_out;
call symput('tpw'=&power);
run;
&tpw
%mend ttest_power;
#itzy is correct in pointing out why your approach won't work. But there is a solution maintaing the spirit of your approach: you need to create a power-calculation function uisng PROC FCMP. In fact, AFAIK, to call a procedure from within a function in PROC FCMP, you need to wrap the call in a macro, so you are almost there.
Here is your macro - slightly modified (mostly to fix the symput statement):
%macro ttest_power;
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
data _null_;
set pw_out;
call symput('tpw', power);
run;
%mend ttest_power;
Now we create a function that will call it:
proc fcmp outlib=work.funcs.test;
function ttest_power_fun(meanA, stdA, nA, meanB, stdB, nB);
rc = run_macro('ttest_power', meanA, stdA, nA, meanB, stdB, nB, tpw);
if rc = 0 then return(tpw);
else return(.);
endsub;
run;
And finally, we can try using this function in a data step:
options cmplib=work.funcs;
data test;
input a s1 n1 a2 s2 n2;
pw_tmp=ttest_power_fun(a, s1, n1, a2, s2, n2);
cards;
0 1 10 0 1 10
0 1 10 1 1 10
;
run;
proc print data=test;
You can't do what you're trying to do this way. Macros in SAS are a little different than in a typical programming language: they aren't subroutines that you can call, but rather just code that generate other SAS code that gets executed. Since you can't run proc power inside of a data step, you can't run this macro from a data step either. (Just imagine copying all the code inside the macro into the data step -- it wouldn't work. That's what a macro in SAS does.)
One way to do what you want would be to read each observation from tmp one at a time, and then run proc power. I would do something like this:
/* First count the observations */
data _null_;
call symputx('nobs',obs);
stop;
set tmp nobs=obs;
run;
/* Now read them one at a time in a macro and call proc power */
%macro power;
%do j=1 %to &nobs;
data _null_;
nrec = &j;
set tmp point=nrec;
call symputx('meanA',meanA);
call symputx('stdA',stdA);
call symputx('nA',nA);
call symputx('meanB',meanB);
call symputx('stdB',stdB);
call symputx('nB',nB);
stop;
run;
proc power;
twosamplemeans test=diff_satt
groupmeans = &meanA | &meanB
groupstddevs = &stdA | &stdB
groupns = (&nA &nB)
power = .;
ods output Output=pw_out;
run;
proc append base=pw_out_all data=pw_out; run;
%end;
%mend;
%power;
By using proc append you can store the results of each round of output.
I haven't checked this code so it might have a bug, but this approach will work.
You can invoke a macro which calls procedures, etc. (like the example) from within a datastep using call execute(), but it can get a bit messy and difficult to debug.