SAS---the adoption of lag function - sas

I used the following code to create a new variable that represents the previous value of close_midpoint by separating the different group of _ric.
data test;
set HAVE;
lric=lag(_ric);
if lric=_ric then lclose_midpoint=lag(close_midpoint);
else lclose_midpoint=.;
run;
However, as showing in the following figure, the lag value of close_midpoint in red square equal the last value of close_midpoint in previous _ric group. For example, the lclose_midpoint in the observation 7 should be 4.675, while it is 4.2 in the actual result. So what's the problem in my code? Thanks.

LAG() does not take the value from the previous observation. It makes its own stack and takes the value from the values that were passed to it on previous calls. Since you are only calling LAG() conditionally there is no way for it to find the values you want.
data test;
set HAVE;
by _ric ;
lclose_midpoint=lag(close_midpoint);
if not first._ric then lclose_midpoint=.;
run;

#Tom shows how the lag is used unconditionally and therefore the side-effect is the value of the prior row.
IFN can be used to have the same effect as two statements.
data want;
set sashelp.cars;
by make;
MSRP_lag_within_group = ifn ( first.make , . , lag(MSRP) );
keep make model MSRP lMSRP;
run;

Related

How do I sum similarly named variables in SAS?

I want to remove every observation where the variable starting with R1 has a missing value. In order to do this, I first try to sum every variable with that prefix:
data test
input R1_1 R1_2 R1_3;
datalines;
. . .
;
run;
data test2;
set test;
diagnosis=sum (of R1:);
run;
This syntax should work according to this article. However something seems to be wrong. In the above example, I get an error complaining about the function call not having enough arguments. In other cases, the code seems to run smoothly but my diagnosis variable isn't created.
Can I fix this and in that case how?
Your code does not work because you did not have a semicolon ending the DATA statement so the TEST dataset you created does not have any variables. Instead you also created datasets named INPUT R1_1 R1_2 and R1_3 that also did not have any variables.
To your actual question you can use NMISS() to count the number of missing numeric values.
nmiss = nmiss(of R1_:) ;
So you can eliminate observations with ANY missing values by using something like:
data want;
set have;
where nmiss(of R1_1-R1_3);
run;
If the goal is to remove observations where ALL of the values are missing you need to know how many variables you are testing. If you don't know that number in advance then you could use an ARRAY to count them. But then you would need to use a subsetting IF instead of WHERE.
data want;
set have;
array x r1_: ;
if nmiss(of r1_:) < dim(x);
run;
If you have a mix of numeric and character variables you can use CMISS() instead.

SAS-Data selection within certain requirements

I have a dataset as shown in this screenshot. In each day, each interval (e.g. 9:30:00) has multiple, repeated _RIC. For example, observation 2 and 3 (DDA211204700) are repeated.
I'd like to select every first _RIC in each interval in each day. For example, for 20120103, 09:30:00, I want to pick up Observation 1, 2, 4, 6, and so on.
I used the following code:
data test1;
do until (last.interval);
set test;
by _ric date_L_ interval;
if first._ric;
output;
end;
run;
Although the code seems to work as shown in this next screenshot, I still hope someone could help me to check my code because I really have little experience with SAS. Thanks!
Your data is not ordered properly to detect the first record for each _RIC within an INTERVAL. First sort the data properly and then your logic might work. Also there is a logic error in using the subsetting IF statement inside of a DOW loop since it will abort the outer DO loop. You wanted to just use a normal IF/THEN statement instead (if first._ric then output;) . But you really don't need a DOW loop for this situation. So we could use a subsetting IF.
You could sort by INTERVAL and then _RIC and date.
data WANT ;
set HAVE ;
by interval _ric date_L_ ;
if first._ric;
run;
Or you could get the same records if you sorted by _RIC and then INTERVAL and date and use FIRST.INTERVAL instead.
It seems that you want to get the earliest time_L_ in group, you could also try this:
proc sql;
select * from have group by _ric,interval having time_L_=min(time_L_);
quit;

Why does undefined variable not return an error in DATA step?

I wrote the code below to create a "running total" column named "sum". Although it seems to work, I don't understand how SAS is executing this code. When it encounters the statement sum + var, how does it know what to do given that sum is undefined? Based on the book "The Little SAS Book: A Primer", the SAS data step has a built-in loop that executes the program observation by observation. Given this, how does the program know to do the equivalent of sum[row2] = sum[row1] + var[row2] when it gets to the second row?
data df;
input var;
datalines;
1
3
.
5
1
;
run;
data df2;
set df;
sum+var;
run;
This syntax is known as an implicit retain, and is equivalent to:
retain sum 0;
sum=sum(sum,var);
When you retain a variable, it's value is not set to missing when the PDV is reloaded (it 'retains' the previous value). It does NOT read from the previous row - a common misconception.
More information on the retain statement is available in the SAS documentation

Naming variable using _n_, a column for each iteration of a datastep

I need to declare a variable for each iteration of a datastep (for each n), but when I run the code, SAS will output only the last one variable declared, the greatest n.
It seems stupid declaring a variable for each row, but I need to achieve this result, I'm working on a dataset created by a proc freq, and I need a column for each group (each row of the dataset).
The result will be in a macro, so it has to be completely flexible.
proc freq data=&data noprint ;
table &group / out=frgroup;
run;
data group1;
set group (keep=&group count ) end=eof;
call symput('gr', _n_);
*REQUESTED code will go here;
run;
I tried these:
var&gr.=.;
call missing(var&gr.);
and a lot of other statement, but none worked.
Always the same result, the ds includes only var&gr where &gr is the maximum n.
It seems that the PDV is overwriting the new variable each iteration, but the name is different.
Please, include the result in a single datastep, or, at least, let the code take less time as possible.
Any idea on how can I achieve the requested result?
Thanks.
Macro variables don't work like you think they do. Any macro variable reference is resolved at compile time, so your call symput is changing the value of the macro variable after all the references have been resolved. The reason you are getting results where the &gr is the maximum n is because that is what &gr was as a result of the last time you ran the code.
If you know you can determine the maximum _n_, you can put the max value into a macro variable and declare an array like so:
Find max _n_ and assign value to maxn:
data _null_;
set have end=eof;
if eof then call symput('maxn',_n_);
run;
Create variables:
data want;
set have;
array var (&maxn);
run;
If you don't like proc transpose (if you need 3 columns you can always use it once for every column and then put together the outputs) what you ask can be done with arrays.
First thing you need to determine the number of groups (i.e. rows) in the input dataset and then define an array with dimension equal to that number.
Then the i-th element of your array can be recalled using _n_ as index.
In the following code &gr. contains the number of groups:
data group1;
set group;
array arr_counts(&gr.) var1-var&gr.;
arr_counts(_n_)= count;
run;
In SAS there're several methods to determine the number of obs in a dataset, my favorite is the following: (doesn't work with views)
data _null_;
if 0 then set group nobs=n;
call symputx('gr',n);
run;

How to do simple mathematics with min/max variable in SAS

I am currently running a macro code in SAS and I want to do a calculation with regards to max and min. Right now the line of code I have is :
hhincscaled = 100*(hhinc - min(hhinc) )/ (max(hhinc) - min(hhinc));
hhvaluescaled = 100*(hhvalue - min(hhvalue))/ (max(hhvalue) - min(hhvalue));
What I am trying to do is re-scale household income and value variables with the calculations below. I am trying to subtract the minimum value of each variable and subtract it from the respective maximum value and then scale it by multiplying it by 100. I'm not sure if this is the right way or if SAS is recognizing the code the way I want it.
I assume you are in a Data Step. A Data Step has an implicit loop over the records in the data set. You only have access to the record of the current loop (with some exceptions).
The "SAS" way to do this is the calculate the Min and Max values and then add them to your data set.
Proc sql noprint;
create table want as
select *,
min(hhinc) as min_hhinc,
max(hhinc) as max_hhinc,
min(hhvalue) as min_hhvalue,
max(hhvalue) as max_hhvalue
from have;
quit;
data want;
set want;
hhincscaled = 100*(hhinc - min_hhinc )/ (max_hhinc - min_hhinc);
hhvaluescaled = 100*(hhvalue - min_hhvalue)/ (max_hhvalue - min_hhvalue);
/*Delete this if you want to keep the min max*/
drop min_: max_:;
run;
Another SAS way of doing this is to create the max/min table with PROC MEANS (or PROC SUMMARY or your choice of alternatives) and merge it on. Doesn't require SQL knowledge to do, and probably about the same speed.
proc means data=have;
*use a class value if you have one;
var hhinc hhvalue;
output out=minmax min= max= /autoname;
run;
data want;
if _n_=1 then set minmax; *get the min/max values- they will be retained automatically and available on every row;
set have;
*do your calculations, using the new variables hhinc_max hhinc_min etc.;
run;
If you have a class statement - ie, a grouping like 'by state' or similar - add that in proc means and then do a merge instead of a second set in want, by your class variable. It would require a sorted (initial) dataset to merge.
You also have the option of doing this in SAS-IML, which works more similarly to how you are thinking above. IML is the SAS interactive matrix language, and more similar to r or matlab than the SAS base language.