trouble with why a variable is taking on a certain value - if-statement

I am trying to create a variable that will flag (to a "1") when it hits a certain number (when there is improvement in a process). I am then trying to reset the baseline, so that a new baseline (threshold) has to be hit for it to be flagged. the data set starts off with just one variable (x). I create another one from the first observation called "baseline", so I will compare all other "x's" to baseline. once I hit a threshold, I want to change the baseline to the threshold it just hit.
here is the relevant part of the code (note I have already created code that determined baseline earlier in program).
data combo;
set combo;
if (baseline-x)/8 >1 then do;
flag=1;
baseline=x;
end;
else
flag=0;
run;
here is the relevant part of the output.
I am expecting flag to be 1 (which it is) for the third observation (because baseline started out at 259, then moved to 251 as I want it to. but why is flag=1 after that? The threshold is not met. can anyone help? thanks John

I think you need another parentheses in your condition like below.
I run here and after all flags became zero.
if ((baseline-x)/8) >1 then
do;
flag=1;
baseline=x;
end;
else
flag=0;
run;

The data step is overwriting the original value of BASELINE after it sets the FLAG variable to 1. So we cannot see what value it had when read from the original value of the COMBO dataset, but we can assume it was at least 8 more than X to cause it to go down that branch of the IF statement.

You need a separate variable to keep track of the current baseline. You can use RETAIN to do this.
data out;
set combo;
** Keep the value of this for each observation in the data set **;
retain current_baseline;
** Initialize baseline to starting value for data set **;
if _n_ = 1 then current_baseline = baseline;
if (current_baseline - x) / 8 < 1 then do;
flag = 1;
** Update current_baseline to new value since flag has been tripped **;
current_baseline = x;
end;
else flag = 0;
** If you want to store the value of baseline for later viewing, you can **;
baseline = current_baseline;
run;
Note that you really only need the values of x and the initial baseline value to run this. Let's say your initial baseline is x - 8. Then you can simply modify the initialization line to
** Initialize baseline to starting value for data set **;
if _n_ = 1 then current_baseline = x - 8;
Then you can run this with your raw data set with only the values for x.

Related

SAS do-Loops and set statements

Why does this macro work? (it does) The loop is able to start, despite the fact that the nrows variable is defined in the set statement inside the loop. Does SAS read the set statement before starting the loop? Where can I find documentation on this issue (which statements inside loops, if any, are executed before the loop starts)?
%macro get_last_n_rows(n, existing, new);
data &new.;
do _i_ = 1 + nrows - &n. to nrows;
set &existing. point = _i_ nobs = nrows;
output;
end;
stop;
run;
%mend get_last_n_rows;
The short answer to your question is: yes, SAS reads the number of rows available prior to the loop executing. In fact, SAS reads the number of rows available before the data step executes; it's determined at data step compile time. See for example, this paper and this paper, among many others.
See specifically the SAS documentation for SET:
At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you can refer to the NOBS= variable before the SET statement. The variable is available in the DATA step but is not added to any output data set.
Note this has nothing to do with the do loop; this is true for the entire data step (which is itself one large loop, of course).

Easy help about Retain statement in SAS

I am very new to sas so I am still learning retain statement; I have a datasets with ID and total spending; I want to get the cumulative spending of each customer by using the retain statement; my codes were following;
data ex03.try1; set ex03.sorted;
by ID;
if first.ID then do;
retain total 0;
total = total+amount; end;
else do; total=total+amount; end;
run;
However, my codes do not really set the initial value for total to be 0 for each new ID; would anyone please help me to understand where I did wrong please;
Appreciated it;
Thanks again;
The RETAIN statement is evaluated during compilation of the data step. Where you place it doesn't really matter much, it will have the same effect. In particular placing it inside of a conditional does nothing. The RETAIN statement tells SAS that it is not to set the value to missing when the next iteration of the data step starts. The optional initial value on the retain statement will set the value before the first iteration of the data step.
To change the value for each new ID value you need to use an actual assignment statement that will actually do something during the running of the data step.
You can use a SUM statement and make your code shorter. Using the SUM statement implies that the variable is retained and initialized to zero.
data want;
set have;
by id;
if first.id then total=0;
total+amount;
run;
Note that the SUM statement will also handle missing values of the AMOUNT variable. It is really the equivalent of:
retain total 0;
total=sum(total,amount);
I have gotten it worked by putting retain total 0 outside of the if then else statement like
by ID;
retain total 0;
if first.ID then do;
total=0;
total=total+amount;
else do;
total=total+amount;
run;
But still, could anyone explain to me why my pervious codes didn't work. I was thinking if it is a new ID, then set total to be 0 otherwise just keep adding the value. I guess I must be wrong about it
Thanks again;

sas collapsing categorical variables clustering analysis

I came across the following code in the logistic regression modeling course offered by SAS:
data dataset(drop=i);
set data;
array mi{*} mi_Ag mi_Inc
mi_WR;
array x{*} Ag Inc WR;
do i=1 to dim(mi);
mi{i}=(x{i}=.);
end;
run;
I need to understand two things:
1.) there is a column created titled "i" once this data step is run. What does that signify and why is there. The drop "i" essentially drops it but if i don't use drop option the column stays in the data set
2.) this do step is replacing all the missing values with a 1 and rest with 0. How is that happening when nothing is clearly specified in the do step as to what needs to be done. In my eyes, "do i=1 to dim(mi); mi{i}=(x{i}=.);" should simply put dots in mi(i) wherever it finds dots in x(i).
Part 2:
While collapsing the categorical variable, following code has been used:
proc freq data=example1 noprint;
tables CLUSTER_CODE*TARGET_B / chisq;
output out=out_chi(keep=_pchi_) chisq;
run;
data ex_cutoff;
if _n_=1 then set out_chi;
set ex_cluster;
chisquare=_pchi_*rsquared;
degfree=numberofclusters-1;
logpvalue=logsdf('CHISQ',chisquare,degfree);
run;
what is n=1 doing ? and also, why are we creating chisquare=_pchi*rsquared. pchi is already chisquare so whats the point of multiplying it with R square?
Thanks
P.S. The code is from one of the SAS learning courses. Hopefully I am allowed to share it here for discussion/learning purposes.
i is the array iterator (created in the do loop). It's dropped since it's not really intended to be kept on the dataset, it's just an iterator (letting you go through the array one element at a time and during that iteration letting you reference a single element).
mi{i}=(x{i}=.); is assigning 1/0 like this:
x(i)=. is either true or false. If it is true, it evaluates to a 1. If it is false, it evaluates to 0. Thus when it's true that x(i)=. then m(i) is assigned a 1; otherwise it is assigned a 0. That's just how SAS works with boolean (True/False) values; many other langauges work that way as well (True is nonzero, False is zero); and when converted to number, True is converted to 1 (but any nonzero nonmissing value is 'True' when converted the other way around).

SAS: How do I point to a specific observation of a value?

I'm very new to SAS and I'm trying to figure out some basic things available in other languages.
I have a table
ID Number
-- ------
1 2
2 5
3 6
4 1
I would like to create a new variable where I sum the value of one observation of Number to each other observations, like
Number2 = Number + Number[3]
ID Number Number2
-- ------ ------
1 2 8
2 5 11
3 6 12
4 1 7
How to I get the value of third observation of Number and add this to each observation of Number in a new variable?
There are several ways to do this; here is one using the SAS POINT= option:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
run;
data want;
retain adder;
drop adder;
if _n_=1 then do;
adder = 3;
set have point=adder;
adder = number;
end;
set have;
number = number + adder;
run;
The RETAIN and DROP statements define a temp variable to hold the value you want to add. RETAIN means the value is not to be re-initialized to missing each time through the data step and DROP means you do not want to include that variable in the output data set.
The POINT= option allows one to read a specific observation from a SAS data set. The _n_=1 part is a control mechanism to only execute that bit of code once, assigning the variable adder to the value of the third observation.
The next section reads the data set one observation at a time and adds applies your change.
Note that the same data set is read twice; a handy SAS feature.
I'll start by suggesting that Base SAS doesn't really work this way, normally; it's not that it can't, but normally you can solve most problems without pointing to a specific row.
So while this answer will solve your explicit problem, it's probably not something useful in a real world scenario; usually in the real world you'd have a match key or some other element other than 'row number' to combine with, and if you did then you could do it much more efficiently. You also likely could rearrange your data structure in a way that made this operation more convenient.
That said, the specific example you give is trivial:
data have;
input ID Number;
datalines;
1 2
2 5
3 6
4 1
;;;;
run;
data want;
set have;
_t = 3;
set have(rename=number=number3 keep=number) point=_t ;
number2=number+number3;
run;
If you have SAS/IML (SAS's matrix language), which is somewhat similar to R, then this is a very different story both in your likelihood to perform this operation and in how you'd do it.
proc iml;
a= {1 2, 2 5, 3 6, 4 1}; *create initial matrix;
b = a[,2] + a[3,2]; *create a new matrix which is the 2nd column of a added
elementwise to the value in the third row second column;
c = a||b; *append new matrix to a - could be done in same step of course;
print b c;
quit;
To do this with the First observation, it's a lot easier.
data want;
set have;
retain _firstpoint; *prevents _firstpoint from being set to missing each iteration;
if _n_ = 1 then _firstpoint=number; *on the first iteration (usually first row) set to number's value;
number = number - _firstpoint; *now subtract that from number to get relative value;
run;
I'll elaborate a little more on this. SAS works on a record-by-record level, where each record is independently processed in the DATA step. (PROCs on the other hand may not behave this way, though many do at some level). SAS, like SQl and similar databases, doesn't truly acknowledge that any row is "first" or "second" or "nth"; however, unlike SQL, it does let you pretend that it is, based on the current sort. The POINT= random access method is one way to go about doing that.
Most of the time, though, you're going to be using something in the data to determine what you want to do rather than some related to the ordering of the data. Here's a way you could do the same thing as the POINT= method, but using the value of ID:
data want;
if n = 1 then set have(where=(ID=3) rename=number=number3);
set have;
number2=number+number3;
run;
That in the first iteration of the data step (_N_=1) takes the row from HAVE where Id=3, and then takes the lines from have in order (really it does this:)
*check to see if _n_=1; it is; so take row id=3;
*take first row (id=1);
*check to see if _n_=1; it is not;
*take second row (id=2);
... continue ...
Variables that are in a SET statement are automatically retained, so NUMBER3 is automatically retained (yay!) and not set to missing between iterations of the data step loop. As long as you don't modify the value, it will stay for each iteration.

Understanding the SAS PDV in by-group processing

While I've read quite a bit about conceptualizing the Program Data Vector when using a SAS data step, I still don't understand how the PDV works when there is by group processing. For example if I have the dataset olddata
GROUP VAL
A 10
A 5
B 20
And I call a datastep on it with a by statement, such as:
data newdata;
set olddata;
by group;
...
run;
then the compiler adds two temporary variables to the PDV: first.group and last.group. When you read any tutorial on the PDV it will tell you that on the first pass of the SET statement, the PDV will look like:
_N_ _ERROR_ FIRST.GROUP LAST.GROUP GROUP VAL
1 0 1 0 A 10
and that LAST.GROUP is zero because observation 1 is not the last observation in group A.
Herein lies my question: How does SAS know that this is not the last observation?
If SAS is processing olddata row-by-row, how is the PDV aware that the next row holds another group A observation instead of a new group? In other words, it seems like SAS must be using information from previous or future rows to update the FIRST and LAST variables, but I'm not sure how. Is there some trick in how the PDV retains values from row to row when the BY statement is called?
SAS actually looks ahead to the next record to see if it should set LAST.(var) or not. I haven't been able to find an article explaining that in any detail, unfortunately. I was a bit disappointed to see that even papers like http://www.wuss.org/proceedings09/09WUSSProceedings/papers/ess/ESS-Li1.pdf just gloss over how LAST is detemined.
SAS also looks ahead to see if the END= variable should be set, when specified, and a few other things. It's not just using metadata to determine those; you can remove or modify records without modifying the metadata, and it will still work - and SQL tables that don't have the usual SAS metadata will still allow you to perform normal BY group processing and such.
The FIRST variable doesn't need a look-behind, of course; it remembers where it was after all.
Edit: I crossposted this to SAS-L, and got the same answer - there doesn't seem to be any documentation of the subject, but it must read ahead. See http://listserv.uga.edu/cgi-bin/wa?A1=ind1303a&L=sas-l#8 for example.
Edit2: From SAS-L, Dan Nordlund linked to a paper that confirms this. http://support.sas.com/resources/papers/proceedings12/222-2012.pdf
The paper's logic that confirms the lookahead - look at the number of observations read from the data set.
DATA DS_Sample1;
Input Sum_Var
Product;
Cards;
100 3
100 2
100 1
;
*With BY statement - reads 3 observations even though it stops after 2.;
DATA DS_Sample2;
Set DS_Sample1;
by Sum_Var;
cnt+1; If CNT > 1 then stop;
Run;
*no BY statement - reads 2 observations as expected;
DATA DS_Sample2;
Set DS_Sample1;
cnt+1; If CNT > 1 then stop;
Run;
* END statement - again, a lookahead;
DATA DS_Sample2;
Set DS_Sample1 end=eof;
cnt+1; If CNT > 1 then stop;
Run;