How to iterate simoultaneusly throught two sets in SAS? - sas

I've tried something like this :
data wynik;
set dane;
if x>3 than x3=3*x;
else set dane2; x3=x2;set dane;
run;
dane and dane2 have the same number of rows
result is interesting - condition x>3 is still holding after setting dane2, but SAS always takes first observation - that is, it doesn't pass the current state of hidden loop counter. Make question is - does SAS have/use hidden loop with counter while iterating through dataset which could be accessed by user ?
editon :
mayby I should add in title - without expicit loops, but this would also be welcomed

Making some assumptions:
data dane;
do x = 1 to 5;
output;
end;
run;
data dane2;
do x2 = 5 to 1 by -1;
output;
end;
run;
data wynik;
merge dane dane2;
if x > 3 then x3=3*x;
else x3=x2;
put x3=;
run;
That uses the side-by-side merge (merge with no by statement) to get you both values at once.
To answer your followup question:
does SAS have/use hidden loop with counter while iterating through dataset which could be accessed by user ?
Yes, it does; _n_ defines the current loop iteration (as long as it isn't modified externally, which it can be - it is just a regular variable that's not written out to the dataset). So you could similarly do the following:
data wynik;
set dane;
if x > 3 then x3=x*3;
else do;
set dane2 point=_n_;
x3=x2;
end;
put x3=;
run;
The side-by-side merge is preferred because it will be faster, unless you very infrequently need to look at DANE2. It's also easier to code.

Related

Creating a dataset with the unique values of indexed variable

I have a dataset (LRG_DS) with about 74,000,000 observations. The dataset has been indexed by a variable (I_VAR1) that has about 7500 unique values. I've discovered this by running a proc contents on the dataset.
I'd like to create a dataset (TEMP)contains just the 7000 unique values of the index variable.
I've tried the following:
data TEMP;
set LRG_DS (keep = I_VAR1);
by I_VAR1;
if first.I_VAR1;
run;
and
proc sort data = LRG_DS nodupkey out = TEMP (keep = I_VAR1);
by I_VAR1;
run;
The first approach takes about 46 seconds and the second takes about 55 seconds.
I've read that the sas7bndx is file is not intended to be examined in isolation, but rather as a file to speed up the some of the procedures performed using the index variable.
Any help is much appreciated!
YMMV but using populating an empty hash table with the unique key values may perform better than a sort.
Create some example data:
data x;
do cnt=1 to 10*100000;
var=round(rand('uniform'),0.001);
do cnt2=1 to 10;
output;
end;
drop cnt2;
end;
run;
Test speed with a proc sort:
proc sort data=x(keep=var) out=sorted nodupkey;
by var;
run;
Compare with the hash table version:
data _null_;
set x(keep=var) end=eof;
if _n_ eq 1 then do;
declare hash ht ();
rc = ht.DefineKey ('var');
rc = ht.DefineDone ();
end;
if ht.check() ne 0 then do;
rc = ht.add();
end;
if eof then do;
ht.output(dataset:"ids");
end;
run;
From my very brief tests, I found that the hash table version starts to perform worse as the number of unique values grows. It may be possible to offset this by dimensioning the hash appropriately beforehand but I didn't test.

putting data into character variable sas

I'm facing the problem that I want to put data into a character variable.
So I have a long tranposed dataset where I have three variables: date( by which i transposed before hand) var (has three different outputs of my previous variables) and col1 (which includes the values of my previous variables).
Now i want to create a forth variable which has as well three different outputs. My problem is that I can create the variable put with my code it does always create missing value.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" then output;
length all $ 20;
all=" ";
if var="BNNESR" then all="pdev";
if var="BNNESR_t" then all="trigger";
if var="BNNESR_r" then all="rdev";
drop var;
run;
Afterwards I want to tranpose it back by the "all" variable. I know i could just rename the old vars before I transpose it and then just keep them.
But the complete calculation will go on and actually will be turned into a macro where it is not that easy if would do it like that way.
Your program will just subset the input data and add a new variable that is empty because you are writing the data out before you assign any value to the new variable.
Use a subsetting IF (or WHERE) statement instead of using an explicit OUTPUT statement. Once your data step has an explicit OUTPUT statement then SAS no longer automatically writes the observation at the end of the data step iteration.
data pair2;
set data1;
if var="BNNESR" or var="BNNESR_r" or var="BNNESR_t" ;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
drop var;
run;
Since the list in the IF statement matches the values in the recode step then perhaps you want to just use a DELETE statement instead?
data pair2;
set data1;
length all $20;
if var="BNNESR" then all="pdev";
else if var="BNNESR_t" then all="trigger";
else if var="BNNESR_r" then all="rdev";
else delete;
drop var;
run;

SAS - Built a new table by calling a macro with a variable of another table passed as parameter

I have written this code to do this :
read records in the table "not_identified" one by one
for one record pass the "name_firstname" variable to a macro named "mCalcul_lev_D33",
then, the macro calculates the Levenstein between the variable passed as parameter and all the values of the variable "name_firstname_in_D33" in "data_all" table,
if the Levenstein returns a value less or equal to "3", then the record of "data_all" is copied to "lev_D33" table.
rsubmit;
%macro mCalcul_lev_D33(theName);
data result.lev_D33;
set result.data_all;
name_LEV=complev(&theName, name_firstname_in_D33);
if name_LEV<=3 then output;
run;
%mend mCalcul_lev_D33;
endrsubmit;
rsubmit;
data _null_;
set result.not_identified;
call execute ('%mCalcul_lev_D33('||name_firstname||')');
;
run;
endrsubmit;
There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :
Is the code doing what I want?
How coding if I want to write "name_firstname" (the variable passed like parameter) in the beginning of each record of "lev_D33"?
Thank you!
D.O.:
I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33 to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.
You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.
How many rows are in not_identified ?Hopefully far less than data_all.
Is the result libname pointing to a network drive or remote server ?Networking i/o can make things run a very long time and even win you a phone call from the network team.
A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.
Consider this sample code:
data all_data;
do row = 1 to 100;
length name_firstname $20;
name_firstname
= repeat (byte(65 + mod(row,26)), 4*ranuni(123))
|| repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
;
output;
end;
run;
data not_identified;
do row = 1 to 10;
length name_firstname $20;
name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
output;
end;
run;
data lev33;
set all_data;
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:
%macro do_chunk (FROM_OBS=, TO_OBS=);
data lev33_&FROM_OBS._&TO_OBS;
set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
%mend;
%macro do_chunks;
%local index;
%do index = 1 %to 100 %by 10;
%do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
%end;
%mend;
%do_chunks
You might shepherd the whole the process, bypassing do_chunks and manually invoking do_chunk for various ranges of your choosing.
Thanks to #Richard. I have used your second example to write this code :
rsubmit;
data result.lev_D33;
set result.not_identified (firstobs=1 obs=10);
do check_row = 1 to 1000000;
set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
name_lev = complev (name_firstname, name_firstname_D3);
if name_lev <= 3 then output;
end;
run;
endrsubmit ;
And it worked like I wanted.
In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).
To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.

Sas : partition a data given a curved variable

I have a database with serveral variables, including one, RIF, that hase an x^2 shape relative to another variable, Y.
I want to obtain two seperate databases, separated based on whether the observation is on the decreasing or the increasing part of the curve.
I thought I had something by using the lag function, but my code does not work.
proc sort data=have; by y; run;
data want;
set have;
do while (rif<=lag(rif));
Part=1;
end;
if Part ne 1 then Part=2
run;
And the separating given Part, but it seems to create infintite loop.
Is there a mistake in my code / is there a better way of doing this
data have;
do x = -10 to 10 by 1;
y = x**2;
output;
end;
run;
data want;
set have;
lag_y = lag(y);
if _n_ = 1 then Part=.;
else if y <= lag_y then Part=1;
else Part=2;
drop lag_y;
run;

SAS - repeating a data step to solve for a value

Is it possible to repeat a data step a number of times (like you might in a %do-%while loop) where the number of repetitions depends on the result of the data step?
I have a data set with numeric variables A. I calculate a new variable result = min(1, A). I would like the average value of result to equal a target and I can get there by scaling variable A by a constant k. That is solve for k where target = average(min(1,A*k)) - where k and target are constants and A is a list.
Here is what I have so far:
filename f0 'C:\Data\numbers.csv';
filename f1 'C:\Data\target.csv';
data myDataSet;
infile f0 dsd dlm=',' missover firstobs=2;
input A;
init_A = A; /* store the initial value of A */
run;
/* read in the target value (1 observation) */
data targets;
infile f1 dsd dlm=',' missover firstobs=2;
input target;
K = 1; * initialise the constant K;
run;
%macro iteration; /* I need to repeat this macro a number of times */
data myDataSet;
retain key 1;
set myDataSet;
set targets point=key;
A = INIT_A * K; /* update the value of A /*
result = min(1, A);
run;
/* calculate average result */
proc sql;
create table estimate as
select avg(result) as estimate0
from myDataSet;
quit;
/* compare estimate0 to target and update K */
data targets;
set targets;
set estimate;
K = K * (target / estimate0);
run;
%mend iteration;
I can get the desired answer by running %iteration a few times, but Ideally I would like to run the iteration until (target - estimate0 < 0.01). Is such a thing possible?
Thanks!
I had a similar problem to this just the other day. The below approach is what I used, you will need to change the loop structure from a for loop to a do while loop (or whatever suits your purposes):
First perform an initial scan of the table to figure out your loop termination conditions and get the number of rows in the table:
data read_once;
set sashelp.class end=eof;
if eof then do;
call symput('number_of_obs', cats(_n_) );
call symput('number_of_times_to_loop', cats(3) );
end;
run;
Make sure results are as expected:
%put &=number_of_obs;
%put &=number_of_times_to_loop;
Loop over the source table again multiple times:
data final;
do loop=1 to &number_of_times_to_loop;
do row=1 to &number_of_obs;
set sashelp.class point=row;
output;
end;
end;
stop; * REQUIRED BECAUSE WE ARE USING POINT=;
run;
Two part answer.
First, it's certainly possible to do what you say. There are some examples of code that works like this available online, if you want a working, useful-code example of iterative macros; for example, David Izrael's seminal Rakinge macro, which performs a rimweighting procedure by iterating over a relatively simple process (proc freqs, basically). This is pretty similar to what you're doing. In the process it looks in the datastep at the various termination criteria, and outputs a macro variable that is the total number of criteria met (as each stratification variable separately needs to meet the termination criterion). It then checks %if that criterion is met, and terminates if so.
The core of this is two things. First, you should have a fixed maximum number of iterations, unless you like infinite loops. That number should be larger than the largest reasonable number you should ever need, often by around a factor of two. Second, you need convergence criteria such that you can terminate the loop if they're met.
For example:
data have;
x=5;
run;
%macro reduce(data=, var=, amount=, target=, iter=20);
data want;
set have;
run;
%let calc=.;
%let _i=0;
%do %until (&calc.=&target. or &_i.=&iter.);
%let _i = %eval(&_i.+1);
data want;
set want;
&var. = &var. - &amount.;
call symputx('calc',&var.);
run;
%end;
%if &calc.=&target. %then %do;
%put &var. reduced to &target. in &_i. iterations.;
%end;
%else %do;
%put &var. not reduced to &target. in &iter. iterations. Try a larger number.;
%end;
%mend reduce;
%reduce(data=have,var=x,amount=1,target=0);
That is a very simple example, but it has all of the same elements. I prefer to use do-until and increment on my own but you can do the opposite also (as %rakinge does). Sadly the macro language doesn't allow for do-by-until like the data step language does. Oh well.
Secondly, you can often do things like this inside a single data step. Even in older versions (9.2 etc.), you can do all of what you ask above in a single data step, though it might look a little clunky. In 9.3+, and particularly 9.4, there are ways to run that proc sql inside the data step and get the result back without waiting for another data step, using RUN_MACRO or DOSUBL and/or the FCMP language. Even something simple, like this:
data have;
initial_a=0.3;
a=0.3;
target=0.5;
output;
initial_a=0.6;
a=0.6;
output;
initial_a=0.8;
a=0.8;
output;
run;
data want;
k=1;
do iter=1 to 20 until (abs(target-estimate0) < 0.001);
do _n_ = 1 to nobs;
if _n_=1 then result_tot=0;
set have nobs=nobs point=_n_;
a=initial_a*k;
result=min(1,a);
result_tot+result;
end;
estimate0 = result_tot/nobs;
k = k * (target/estimate0);
end;
output;
stop;
run;
That does it all in one data step. I'm cheating a bit because I'm writing my own data step iterator, but that's fairly common in this sort of thing, and it is very fast. Macros iterating multiple data steps and proc sql steps will be much slower typically as there is some overhead from each one.