I am very new to sas so I am still learning retain statement; I have a datasets with ID and total spending; I want to get the cumulative spending of each customer by using the retain statement; my codes were following;
data ex03.try1; set ex03.sorted;
by ID;
if first.ID then do;
retain total 0;
total = total+amount; end;
else do; total=total+amount; end;
run;
However, my codes do not really set the initial value for total to be 0 for each new ID; would anyone please help me to understand where I did wrong please;
Appreciated it;
Thanks again;
The RETAIN statement is evaluated during compilation of the data step. Where you place it doesn't really matter much, it will have the same effect. In particular placing it inside of a conditional does nothing. The RETAIN statement tells SAS that it is not to set the value to missing when the next iteration of the data step starts. The optional initial value on the retain statement will set the value before the first iteration of the data step.
To change the value for each new ID value you need to use an actual assignment statement that will actually do something during the running of the data step.
You can use a SUM statement and make your code shorter. Using the SUM statement implies that the variable is retained and initialized to zero.
data want;
set have;
by id;
if first.id then total=0;
total+amount;
run;
Note that the SUM statement will also handle missing values of the AMOUNT variable. It is really the equivalent of:
retain total 0;
total=sum(total,amount);
I have gotten it worked by putting retain total 0 outside of the if then else statement like
by ID;
retain total 0;
if first.ID then do;
total=0;
total=total+amount;
else do;
total=total+amount;
run;
But still, could anyone explain to me why my pervious codes didn't work. I was thinking if it is a new ID, then set total to be 0 otherwise just keep adding the value. I guess I must be wrong about it
Thanks again;
Related
I am trying to create a variable that will flag (to a "1") when it hits a certain number (when there is improvement in a process). I am then trying to reset the baseline, so that a new baseline (threshold) has to be hit for it to be flagged. the data set starts off with just one variable (x). I create another one from the first observation called "baseline", so I will compare all other "x's" to baseline. once I hit a threshold, I want to change the baseline to the threshold it just hit.
here is the relevant part of the code (note I have already created code that determined baseline earlier in program).
data combo;
set combo;
if (baseline-x)/8 >1 then do;
flag=1;
baseline=x;
end;
else
flag=0;
run;
here is the relevant part of the output.
I am expecting flag to be 1 (which it is) for the third observation (because baseline started out at 259, then moved to 251 as I want it to. but why is flag=1 after that? The threshold is not met. can anyone help? thanks John
I think you need another parentheses in your condition like below.
I run here and after all flags became zero.
if ((baseline-x)/8) >1 then
do;
flag=1;
baseline=x;
end;
else
flag=0;
run;
The data step is overwriting the original value of BASELINE after it sets the FLAG variable to 1. So we cannot see what value it had when read from the original value of the COMBO dataset, but we can assume it was at least 8 more than X to cause it to go down that branch of the IF statement.
You need a separate variable to keep track of the current baseline. You can use RETAIN to do this.
data out;
set combo;
** Keep the value of this for each observation in the data set **;
retain current_baseline;
** Initialize baseline to starting value for data set **;
if _n_ = 1 then current_baseline = baseline;
if (current_baseline - x) / 8 < 1 then do;
flag = 1;
** Update current_baseline to new value since flag has been tripped **;
current_baseline = x;
end;
else flag = 0;
** If you want to store the value of baseline for later viewing, you can **;
baseline = current_baseline;
run;
Note that you really only need the values of x and the initial baseline value to run this. Let's say your initial baseline is x - 8. Then you can simply modify the initialization line to
** Initialize baseline to starting value for data set **;
if _n_ = 1 then current_baseline = x - 8;
Then you can run this with your raw data set with only the values for x.
Below code will solve for getting last 2 observations from the dataset without using loops, first & last dot concept or sorting.
data a;
set sashelp.cars nobs=_nobs_;/*create the temporary variable to store total no of obs*/
if _N_ ge _nobs_-1;/*Now compare the automatic variable _N_ to _nobs_*/
run;
Not sure there is a question here.
You can also use the POINT= option on the SET statement. You have to explicitly end the data step since most data steps end when they read past the end of the input data and this step cannot do that.
data want;
do p=max(1,nobs-1) to nobs;
set have point=p nobs=nobs;
output;
end;
stop;
run;
A DATA step is an implicit loop with an automatic index variable _N_. You can leverage that fact to implicitly output rows without an explicit DO loop. As per #Tom, the point= option is used so the entire data set does not need to be read to reach the last two rows.
Example:
data want;
if _N_ > min(2,_Z_) then stop;
_P_ = _Z_ - min(2,_Z_) + _N_;
set sashelp.class point=_P_ nobs=_Z_;
run;
I need to introduce a time trend to a regression model for a course but have no idea how to create a variable that's just (1,2,3,4,...,108). In R or Python I would just create an empty vector of 0's and then loop through to fill them with the loop index but I have no clue how to do it in SAS.
Thank you in advance
data want;
set have;
time_trend+1;
run;
SAS is an inherently looping language. The code above does four things:
Read a row
Add 1 to a variable called time_trend
Output the row to a dataset named want
Read the next row and execute the statements again
SAS automatically initialized the variable time_trend for us at compilation, so we do not need to declare a length or type. SAS assumes it is a numeric variable by default.
The statement time_trend+1 is a special shortcut of the below logic:
data want;
set have;
retain time_trend 0;
time_trend = time_trend + 1;
run;
To 'copy' the PDV structure of a data set, it has been advised to "reference a data set at compile time" using
if 0 then set <data-set>
For example,
data toBeCopied;
length var1 $ 4. var2 $ 4. ;
input var1 $ var2 $;
datalines;
this is
just some
fake data
;
run;
data copyPDV;
if 0 then set toBeCopied;
do var1 = 'cutoff' ;
do var2 = 'words';
output;
end;
end;
run;
When you run this, however, the following NOTE appears in the log:
NOTE: DATA STEP stopped due to looping.
This is because the data step never reaches the EOF marker and gets stuck in an infinite loop, as explained in Data Set Looping. (It turns out the DATA step recognizes this and terminates the loop, hence the NOTE in the log).
It seems like usage of the if 0 then set <data-set> statement is a longstanding practice, dating as far back as 1987. Although it seems hacky to me, I can't think of another way to produce the same result (i.e. copying PDV structure), aside from manually restating the attribute requirements. It also strikes me as poor form to allow ERRORs, WARNINGs, and NOTEs which imply unintended program behavior to remain in the log.
Is there way to suppress this note or an altogether better method to achieve the same result (i.e. of copying the PDV structure of a data set)?
If you include a stop; statement, as in
if 0 then do;
set toBeCopied;
stop;
end;
the NOTE still persists.
Trying to restrict the SET to a single observation also seems to have no effect:
if 0 then set toBeCopied (obs=1);
Normally SAS will terminate a data step at the point when you read past the input. Either of raw data or SAS datasets. For example this data step will stop when it executes the SET statement for the 6th time and finds there are no more observations to read.
data want;
put _n_=;
set sashelp.class(obs=5);
run;
To prevent loops SAS checks to see if you read any observations in this iteration of the data step. It is smart enough not to warn you if you are not reading from any datasets. So this program does not get a warning.
data want ;
do age=10 to 15;
output;
end;
run;
But by adding that SET statement you triggered the checking. You can prevent the warning by having a dataset that you are actually reading so that it stops when it reads past the end of the actual input data.
data want;
if 0 then set sashelp.class ;
set my_class;
run;
Or a file you are reading.
data want ;
if 0 then set sashelp.class ;
infile 'my_class.csv' dsd firstobs=2 truncover ;
input (_all_) (:) ;
run;
Otherwise add a STOP statement to manually end the data step.
data want ;
if 0 then set sashelp.class;
do age=10 to 15;
do sex='M','F';
output;
end;
end;
stop;
run;
The stop needs to not be in the if 0 branch. That branch is never executed. stop needs to be executed, and executed at the spot where you want the execution to stop.
data copyPDV;
if 0 then set toBeCopied;
do var1 = 'cutoff' ;
do var2 = 'words';
output;
end;
end;
stop;
run;
Not to suppress this note in your situation. It is method to get structure of a data set.
data class;
set sashelp.class(obs=0);
run;
or
proc sql;
create table class1 like sashelp.class;
quit;
I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;