I'm trying to understand how the retain statement is supposed to work with existing variables, but still it seems I missing something as I do not get the desired result
In the following example my code aim to create a sort of counter for the value variable
data new (sortedby=id);
input id $ value count;
datalines ;
d 55 0
d 66 0
d 33 0
run;
data cc;
set new;
by id;
retain count;
count+value;
run;
And I 'm expecting that the count variable will be the result of the cumulation of the value column. However, the result is not achived and the column keep its original 0 values.
I would be interested in understanding why the implict retain statement in the "+" sign is not working in this case.
It is an issue related to the fact that count is an already existing variables?
Bests
All the RETAIN statement does is prevent variable from being set to missing at the top of the DATA step. In your code, your SET statement reads a value for COUNT (0), so even though the value is retained, it is reset to 0 when the SET statement executes.
I would play with code like below, with lots of PUT statements in it:
data cc;
put "Top of loop" (_n_ value count count2 count3)(=) ;
set new;
put "After set statement " (_n_ value count count2 count3)(=) ;
by id;
retain count;
count+value;
count2+value ;
count3=sum(count3,value) ;
put "After sum statement" (_n_ value count count2 count3)(=) ;
run;
At the top of the loop, Count and Count2 are retained. Count is retained because of the explicit retain statement, and because it was read on a SET set statement. Count2 is retained because the sum statement has an implicit retain. Count3 is not retained.
Results are like:
Top of loop _N_=1 value=. count=. count2=0 count3=.
After set statement _N_=1 value=55 count=0 count2=0 count3=.
After sum statement _N_=1 value=55 count=55 count2=55 count3=55
Top of loop _N_=2 value=55 count=55 count2=55 count3=.
After set statement _N_=2 value=66 count=0 count2=55 count3=.
After sum statement _N_=2 value=66 count=66 count2=121 count3=66
Top of loop _N_=3 value=66 count=66 count2=121 count3=.
After set statement _N_=3 value=33 count=0 count2=121 count3=.
After sum statement _N_=3 value=33 count=33 count2=154 count3=33
Top of loop _N_=4 value=33 count=33 count2=154 count3=.
Yes the fact that the variable is already on the input dataset will impact your program. When the SET statement executes the retained value of COUNT is overwritten by the value of COUNT read from the input dataset.
Note that actually all variables that come from input dataset are already retained across data step iterations by SAS. This explains how the MERGE statement is able to implement a one to many merge. It also explains the way SAS keeps the values from the last observation in the shorter group when you do an N to M merge.
From what I learned about retain, I would suggest this solution:
data cc;
if count = . then previous_count = 0; else previous_count = count;
set new;
by id;
drop previous_count;
count = previous_count + value;
run;
A few comments: since count already exists, SAS retains the variable anyway, and you can get the old value before the set is executed.
For the first iteration, count is . and SAS can not add a number to .. I simply fixed that with an if.
As Quentin suggested, adding some put statements really helps to better understand whats going on!
Related
I need to create a new variable in which is comprised of a list of other variables found in my dataset.
HAD1 (1=yes 2=no 9=unknown),
HAF10 (1=yes 2=no 9=unknown),
HAC1C (1=yes 2=no 9=unknown),
and HAC1D (1=yes 2=no 9=unknown)
to add up the number of health conditions an individual has. I also want to set all 9 to equal "."
My new variable will be named CC4
CC4 =
(0=no conditions,
1=one condition,
2=two conditions,
3=three conditions,
4=four conditions,
.=any condition appears missing)
How do I code it in the correct way and add it to my dataset?
I only wrote this:
data dataset;
set dataset;
*use arrays to create clean, re-coded versions of variables;
array code1 [4] HAD1 HAF10 HAC1C HAC1D;
array code2 [4] diabetes hattack hfailure stroke;
do new= 1 to 4;
if code1 [new] = 1 then code2 [new] = 1;
*keep all 1's as 1's;
else if code1 [new] = 2 then code2 [new] = 2;
*keep all 2's as 2's;
else if code1 [new] = 9 then code2 [new] = .;
*make all 9's into .'s;
end;
drop new;
*create summation variable;
cc4=HAD1+HAF10+HAC1C+HAC1D;
You don't need to recode to count.
You can take advantage of SAS evaluating boolean expressions to 1 for TRUE and 0 for FALSE.
data want;
set have;
cc4 = (HAD1=1)+(HAF10=1)+(HAC1C=1)+(HAC1D=1);
run;
PS Do not overwrite your input data by using the same dataset name in the DATA and SET statements. It will make it hard to correct coding mistakes.
If you want new names for the original code variables, use RENAME
If you want new names and the original code variables, probably shouldn't
If you only want the cc4 result, you can use the fact of a single digit code for meaning to compute your condition count when all conditions assert a yes/no state.
Example:
data have;
do code1 = 1,2,9;
do code2 = 1,2,9;
do code3 = 1,2,9;
do code4 = 1,2,9;
output;
end;end;end;end;
run;
data want;
set have;
codes = cats(code1,code2,code3,code4);
drop codes;
cc4 = ifn(index(codes,'9'),.,count(codes,'1'));
run;
In your case replace code1,code2,code3,code4 with HAD1,HAF10,HAC1C,HAC1D
I have written this code to do this :
read records in the table "not_identified" one by one
for one record pass the "name_firstname" variable to a macro named "mCalcul_lev_D33",
then, the macro calculates the Levenstein between the variable passed as parameter and all the values of the variable "name_firstname_in_D33" in "data_all" table,
if the Levenstein returns a value less or equal to "3", then the record of "data_all" is copied to "lev_D33" table.
rsubmit;
%macro mCalcul_lev_D33(theName);
data result.lev_D33;
set result.data_all;
name_LEV=complev(&theName, name_firstname_in_D33);
if name_LEV<=3 then output;
run;
%mend mCalcul_lev_D33;
endrsubmit;
rsubmit;
data _null_;
set result.not_identified;
call execute ('%mCalcul_lev_D33('||name_firstname||')');
;
run;
endrsubmit;
There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :
Is the code doing what I want?
How coding if I want to write "name_firstname" (the variable passed like parameter) in the beginning of each record of "lev_D33"?
Thank you!
D.O.:
I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33 to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.
You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.
How many rows are in not_identified ?Hopefully far less than data_all.
Is the result libname pointing to a network drive or remote server ?Networking i/o can make things run a very long time and even win you a phone call from the network team.
A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.
Consider this sample code:
data all_data;
do row = 1 to 100;
length name_firstname $20;
name_firstname
= repeat (byte(65 + mod(row,26)), 4*ranuni(123))
|| repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
;
output;
end;
run;
data not_identified;
do row = 1 to 10;
length name_firstname $20;
name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
output;
end;
run;
data lev33;
set all_data;
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:
%macro do_chunk (FROM_OBS=, TO_OBS=);
data lev33_&FROM_OBS._&TO_OBS;
set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
%mend;
%macro do_chunks;
%local index;
%do index = 1 %to 100 %by 10;
%do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
%end;
%mend;
%do_chunks
You might shepherd the whole the process, bypassing do_chunks and manually invoking do_chunk for various ranges of your choosing.
Thanks to #Richard. I have used your second example to write this code :
rsubmit;
data result.lev_D33;
set result.not_identified (firstobs=1 obs=10);
do check_row = 1 to 1000000;
set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
name_lev = complev (name_firstname, name_firstname_D3);
if name_lev <= 3 then output;
end;
run;
endrsubmit ;
And it worked like I wanted.
In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).
To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.
This question was discussed on SAS forum, and participants finally agreed to disagree .
The issue is simple : SAS assign a missing value to all variables at compile time UNLESS a variable shows up in a sum statement (in this case SAS assigns a value of 0 at compile time ) . Here is my simple proof
data test;
put _all_;
var1+1;
var2=5;
put _all_;
run;
Log screen
var1=0 var2=. _ERROR_=0 _N_=1
var1=1 var2=5 _ERROR_=0 _N_=1
NOTE: The data set WORK.TEST has 1 observations and 2 variables.
var2 was assigned a missing value BUT var1 was assigned 0 because it is part of a sum statement (I believe so )
My question is WHY ? I was pretty sure that SAS assignes missing values to all variables at compilation . Why does it make an exception to a variable that will show up in a sum statement ? Are there any other exceptions ?
I wouldn't call it sum statement.
The statement
var1+1;
is equivalent of
retain var1 0;
var1 = var1 + 1;
Nor the 'long' sum statement
var1 = var1 + 1;
nor
var1 = sum(var1, 1);
itself would do the RETAIN behavior nor initialization to zero.
So to answer the question:
initialization to zero is part of RETAIN behavior implicitly requested by
a + b;
syntax for variable a.
I can't think of other exceptions.
I've tried something like this :
data wynik;
set dane;
if x>3 than x3=3*x;
else set dane2; x3=x2;set dane;
run;
dane and dane2 have the same number of rows
result is interesting - condition x>3 is still holding after setting dane2, but SAS always takes first observation - that is, it doesn't pass the current state of hidden loop counter. Make question is - does SAS have/use hidden loop with counter while iterating through dataset which could be accessed by user ?
editon :
mayby I should add in title - without expicit loops, but this would also be welcomed
Making some assumptions:
data dane;
do x = 1 to 5;
output;
end;
run;
data dane2;
do x2 = 5 to 1 by -1;
output;
end;
run;
data wynik;
merge dane dane2;
if x > 3 then x3=3*x;
else x3=x2;
put x3=;
run;
That uses the side-by-side merge (merge with no by statement) to get you both values at once.
To answer your followup question:
does SAS have/use hidden loop with counter while iterating through dataset which could be accessed by user ?
Yes, it does; _n_ defines the current loop iteration (as long as it isn't modified externally, which it can be - it is just a regular variable that's not written out to the dataset). So you could similarly do the following:
data wynik;
set dane;
if x > 3 then x3=x*3;
else do;
set dane2 point=_n_;
x3=x2;
end;
put x3=;
run;
The side-by-side merge is preferred because it will be faster, unless you very infrequently need to look at DANE2. It's also easier to code.
Given a data step like this:
data tmp;
do i=1 to 10;
if 3<i<7 then do;
some stuff;
end;
end;
run;
I want to write to the log how many times the if statement is true. For example, in this example, I want to have a line in the log that says:
If statement true 3 times
because the condition is true when i is 4, 5, or 6. How can I do this?
Using retain to keep a counter variable, it's pretty easy to increment a count of how many times an if condition was met.
data tmp;
retain Counter 0;
do i=1 to 10;
if 3<i<7 then do;
Counter+1;
*some stuff;
end;
end;
put 'If statement true ' Counter 'time(s).';
run;
Note that this writes to the log once because it is the last thing that occurs before the data step terminates (there's only one loop in the data step in the example). If you wanted to do this for a data step that has more than one loop (e.g. when there is a set statement reading data in from another dataset, you'd want to tell SAS you only want it to report at the end of the step. You'd do it like this:
* create an example input data set;
data exampleData;
do i=1 to 10;
output;
end;
run;
* use a variable 'eof' to indicate the end of the input dataset;
data new;
set exampleData end=eof;
retain Counter 0;
if 3<i<7 then do;
Counter+1;
*some stuff;
end;
if eof then put 'If statement true ' Counter 'time(s).';
run;