This question was discussed on SAS forum, and participants finally agreed to disagree .
The issue is simple : SAS assign a missing value to all variables at compile time UNLESS a variable shows up in a sum statement (in this case SAS assigns a value of 0 at compile time ) . Here is my simple proof
data test;
put _all_;
var1+1;
var2=5;
put _all_;
run;
Log screen
var1=0 var2=. _ERROR_=0 _N_=1
var1=1 var2=5 _ERROR_=0 _N_=1
NOTE: The data set WORK.TEST has 1 observations and 2 variables.
var2 was assigned a missing value BUT var1 was assigned 0 because it is part of a sum statement (I believe so )
My question is WHY ? I was pretty sure that SAS assignes missing values to all variables at compilation . Why does it make an exception to a variable that will show up in a sum statement ? Are there any other exceptions ?
I wouldn't call it sum statement.
The statement
var1+1;
is equivalent of
retain var1 0;
var1 = var1 + 1;
Nor the 'long' sum statement
var1 = var1 + 1;
nor
var1 = sum(var1, 1);
itself would do the RETAIN behavior nor initialization to zero.
So to answer the question:
initialization to zero is part of RETAIN behavior implicitly requested by
a + b;
syntax for variable a.
I can't think of other exceptions.
Related
I am processing a dataset, the contents of which I do not know in advance. My target SAS instance is 9.3, and I cannot use SQL as that has certain 'reserved' names (such as "user") that cannot be used as column names.
The puzzle looks like this:
data _null_;
set some.dataset; file somefile;
/* no problem can even apply formats */
put name age;
/* how to do this without making new vars? */
put somefunc(name) max(age);
run;
I can't put var1=somefunc(name); put var1; as that may clash with a source variable named var1.
I'm guessing the answer is to make some macro function that will read the dataset header and return me a "safe" (non-clashing) variable, or an fcmp function in a format, but I thought I'd check with the community to see - is there some "old school" way to outPUT directly from a function, in a data step?
Temporary array?
34 data _null_;
35 set sashelp.class;
36 array _n[*] _numeric_;
37 array _f[3] _temporary_;
38 put _n_ #;
39 do _n_ = 1 to dim(_f);
40 _f[_n_] = log(_n[_n_]);
41 put _f[_n_]= #;
42 end;
43 put ;
44 run;
1 _f[1]=2.6390573296 _f[2]=4.2341065046 _f[3]=4.7229532216
2 _f[1]=2.5649493575 _f[2]=4.0342406382 _f[3]=4.4308167988
3 _f[1]=2.5649493575 _f[2]=4.1789920363 _f[3]=4.5849674787
4 _f[1]=2.6390573296 _f[2]=4.1399550735 _f[3]=4.6298627986
5 _f[1]=2.6390573296 _f[2]=4.1510399059 _f[3]=4.6298627986
6 _f[1]=2.4849066498 _f[2]=4.0483006237 _f[3]=4.4188406078
7 _f[1]=2.4849066498 _f[2]=4.091005661 _f[3]=4.4367515344
8 _f[1]=2.7080502011 _f[2]=4.1351665567 _f[3]=4.7229532216
9 _f[1]=2.5649493575 _f[2]=4.1351665567 _f[3]=4.4308167988
The PUT statement does not accept a function invocation as a valid item for output.
A DATA step does not do columnar functions as you indicated with max(age) (so it would be even less likely to use such a function in PUT ;-)
Avoid name collisions
My recommendation is to use a variable name that is highly unlikely to collide.
_temp_001 = somefunc(<var>);
_temp_002 = somefunc2(<var2>);
put _temp_001 _temp_002;
drop _temp_:;
or
%let tempvar = _%sysfunc(rand(uniform, 1e15),z15.);
&tempvar = somefunc(<var>);
put &tempvar;
drop &tempvar;
%symdel tempvar;
Repurpose
You can re-purpose any automatic variable that is not important to the running step. Some omni-present candidates include:
numeric variables:
_n_
_iorc_
_threadid_
_nthreads_
first.<any-name> (only tweak after first. logic associated with BY statement)
last.<any-name>
character variables:
_infile_ (requires an empty datalines;)
_hostname_
avoid
_file_
_error_
I think you would be pretty safe choosing some unlikely to collide names. An easy way to generate these and still make the code somewhat readable would be to just hash a string to create a valid SAS varname and use a macro reference to make the code readable. Something like this:
%macro get_low_collision_varname(iSeed=);
%local try cnt result;
%let cnt = 0;
%let result = ;
%do %while ("&result" eq "");
%let try = %sysfunc(md5(&iSeed&cnt),hex32.);
%if %sysfunc(anyalpha(%substr(&try,1,1))) gt 0 %then %do;
%let result = &try;
%end;
%let cnt = %eval(&cnt + 1);
%end;
&result
%mend;
The above code takes a seed string and just adds a number to the end of it. It iterates the number until it gets a valid SAS varname as output from the md5() function. You could even then test the target dataset name to make sure the variable doesn't already exist. If it does build that logic into the above function.
Test it:
%let my_var = %get_low_collision_varname(iSeed=this shouldnt collide);
%put &my_var;
data _null_;
set sashelp.class;
&my_var = 1;
put _all_;
run;
Results:
Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 C34FD80ED9E856160E59FCEBF37F00D2=1 _ERROR_=0 _N_=1
Name=Alice Sex=F Age=13 Height=56.5 Weight=84 C34FD80ED9E856160E59FCEBF37F00D2=1 _ERROR_=0 _N_=2
This doesn't specifically answer the question of how to achieve it without creating new varnames, but it does give a practical workaround.
My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.
I'm trying to understand how the retain statement is supposed to work with existing variables, but still it seems I missing something as I do not get the desired result
In the following example my code aim to create a sort of counter for the value variable
data new (sortedby=id);
input id $ value count;
datalines ;
d 55 0
d 66 0
d 33 0
run;
data cc;
set new;
by id;
retain count;
count+value;
run;
And I 'm expecting that the count variable will be the result of the cumulation of the value column. However, the result is not achived and the column keep its original 0 values.
I would be interested in understanding why the implict retain statement in the "+" sign is not working in this case.
It is an issue related to the fact that count is an already existing variables?
Bests
All the RETAIN statement does is prevent variable from being set to missing at the top of the DATA step. In your code, your SET statement reads a value for COUNT (0), so even though the value is retained, it is reset to 0 when the SET statement executes.
I would play with code like below, with lots of PUT statements in it:
data cc;
put "Top of loop" (_n_ value count count2 count3)(=) ;
set new;
put "After set statement " (_n_ value count count2 count3)(=) ;
by id;
retain count;
count+value;
count2+value ;
count3=sum(count3,value) ;
put "After sum statement" (_n_ value count count2 count3)(=) ;
run;
At the top of the loop, Count and Count2 are retained. Count is retained because of the explicit retain statement, and because it was read on a SET set statement. Count2 is retained because the sum statement has an implicit retain. Count3 is not retained.
Results are like:
Top of loop _N_=1 value=. count=. count2=0 count3=.
After set statement _N_=1 value=55 count=0 count2=0 count3=.
After sum statement _N_=1 value=55 count=55 count2=55 count3=55
Top of loop _N_=2 value=55 count=55 count2=55 count3=.
After set statement _N_=2 value=66 count=0 count2=55 count3=.
After sum statement _N_=2 value=66 count=66 count2=121 count3=66
Top of loop _N_=3 value=66 count=66 count2=121 count3=.
After set statement _N_=3 value=33 count=0 count2=121 count3=.
After sum statement _N_=3 value=33 count=33 count2=154 count3=33
Top of loop _N_=4 value=33 count=33 count2=154 count3=.
Yes the fact that the variable is already on the input dataset will impact your program. When the SET statement executes the retained value of COUNT is overwritten by the value of COUNT read from the input dataset.
Note that actually all variables that come from input dataset are already retained across data step iterations by SAS. This explains how the MERGE statement is able to implement a one to many merge. It also explains the way SAS keeps the values from the last observation in the shorter group when you do an N to M merge.
From what I learned about retain, I would suggest this solution:
data cc;
if count = . then previous_count = 0; else previous_count = count;
set new;
by id;
drop previous_count;
count = previous_count + value;
run;
A few comments: since count already exists, SAS retains the variable anyway, and you can get the old value before the set is executed.
For the first iteration, count is . and SAS can not add a number to .. I simply fixed that with an if.
As Quentin suggested, adding some put statements really helps to better understand whats going on!
I am very new but keen to learn SAS coding.I have 2 data sets a and b namely dt1 and dt2 which consist of columns a for dt1 and b and c for dt2:
a b c
2014 2008 2
2009 3
2014 4
2015 5
I am trying to get the nth row of the c column when the element which is at nth row of b column is equal to a(1)
Here it is c=4;
I wrote a code below.
DATA dt1;
set dt1;
data dt2;
set dt2;
i=1;
do while (b ne a);
i=i+1;
end;
call symput('ROW_NUMBER',i);
run;
proc print data = dt2(keep = c obs = &ROW_NUMBER firstobs = &ROW_NUMBER);
run;
but this code enters in an infinite loop and I could not find any solution for this. I appreciate if you help solve this issue.
Thanks
I think you should learn the basic syntax of the data step before trying to use macro variables. A lot of what you're doing makes little sense. Here is an explanation of how the data step works. You will do yourself a huge favor if you study that.
Here's how to do an inner join in proc sql, which seems to be more in line with your goal here. This simply selects the values of c where dt1.a is equal to dt2.b:
proc sql;
select c
from dt1 inner join dt2 on dt1.a = dt2.b;
quit;
If you were to use a data step, you'd do something like this the following.
data out(keep=c);
set dt1;
do until (a=b or eof);
set dt2 end=eof;
if a=b then output;
end;
run;
proc print data=out noobs;
run;
Use the end= option to create temporary variable eof which allows you to end the loop after the last row of dt2 is read.
This is a simple MERGE. You just need to rename the variables to match. This assumes they're both sorted by the value (a/b). You can then set the macro variable in that data step or do whatever you want.
data want;
merge dt1(in=_a rename=a=b) dt2(in=_b);
by b;
if _a and _b;
call symput("ROW_NUMBER",c);
run;
If you want to define macro variables:
data _null_;
set dt2;
if _n_=1 then set dt1;
if a=b then do;
call symput('c_val',c);
call symput('row_num',_n_);
end;
run;
%put &row_num &c_val;
I want to have a mean which is based in non zero values for given variables using proc means only.
I know we do can calculate using proc sql, but I want to get it done through proc means or proc summary.
In my study I have 8 variables, so how can I calculate mean based on non zero values where in I am using all of those in the var statement as below:
proc means = xyz;
var var1 var2 var3 var4 var5 var6 var7 var8;
run;
If we take one variable at a time in the var statement and use a where condition for non zero variables , it works but can we have something which would work for all the variables of interest mentioned in the var statement?
Your suggestions would be highly appreciated.
Thank you !
One method is to change all of your zero values to missing, and then use PROC MEANS.
data zeromiss /view=zeromiss ;
set xyz ;
array n{*} var1-var8 ;
do i = 1 to dim(n) ;
if n{i} = 0 then call missing(n{i}) ;
end ;
drop i ;
run ;
proc means data=zeromiss ;
var var1-var8 ;
run ;
Create a view of your input dataset. In the view, define a weight variable for each variable you want to summarise. Set the weight to 0 if the corresponding variable is 0 and 1 otherwise. Then do a weighted summary via proc means / proc summary. E.g.
data xyz_v /view = xyz_v;
set xyz;
array weights {*} weight_var1-weight_var8;
array vars {*} var1-var8;
do i = 1 to dim(vars);
weights[i] = (vars[i] ne 0);
end;
run;
%macro weighted_var(n);
%do i = 1 to &n;
var var&i /weight = weight_var&i;
%end;
%mend weighted_var;
proc means data = xyz_v;
%weighted_var(8);
run;
This is less elegant than Chris J's solution for this specific problem, but it generalises slightly better to other situations where you want to apply different weightings to different variables in the same summary.
Can't you use a data statement?
data lala;
set xyz;
drop qty;
mean = 0;
qty = 0;
if(not missing(var1) and var1 ^= 0) then do;
mean + var1;
qty + 1;
end;
if(not missing(var2) and var2 ^= 0) then do;
mean + var2;
qty + 1;
end;
/* ... repeat to all variables ... */
if(not missing(var8) and var8 ^= 0) then do;
mean + var8;
qty + 1;
end;
mean = mean/qty;
run;
If you want to keep the mean in the same xyz dataset, just replace lala with xyz.