Output when using FIRST and LAST - sas

Say we have the SAS code:
data t1 (keep=KEY COUNT C_AMT2 C_AMT);
SET t1;
BY key;
RETAIN COUNT C_AMT;
IF FIRST.KEY THEN
DO;
COUNT=0;
C_AMT2=0;
END;
COUNT+1;
C_AMT=SUM(C_AMT2, C_AMT);
IF LAST.KEY THEN
OUTPUT;
RUN;
What would change here if I were to remove "IF LAST.KEY THEN OUTPUT;". The documentation says that output causes SAS to write to the datastep immediately, not at the end of the data step. Because here it is right before the end of the data step, would this mean removing it would cause no difference?

Removing it would cause a difference.
Then you would have a record for every value of key, assuming multiple values. Controlling the output means you'd have only the last record.
It looks like it's calculating a count and total so there are other ways to achieve this. I'm going to assume that there's some other code that you've suppressed.
The relevant section from the documentation that refers to this is in the link you have above
Implicit versus Explicit Output
By default, every DATA step contains an implicit OUTPUT statement at the end of each iteration that tells SAS to write observations to the data set or data sets that are being created. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, and SAS adds an observation to a data set only when an explicit OUTPUT statement is executed. Once you use an OUTPUT statement to write an observation to any one data set, however, there is no implicit OUTPUT statement at the end of the DATA step. In this situation, a DATA step writes an observation to a data set only when an explicit OUTPUT executes. You can use the OUTPUT statement alone or as part of an IF-THEN or SELECT statement or in DO-loop processing.
Here's some code that simulates your issue:
*Generate random data;
Data have;
do Key=1 to 2;
do i=1 to 3;
Amount=floor(rand('normal', 50, 5));
OUTPUT;
end;
end;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
if last.key then output;
run;
proc print data=t1;
run;
data t1;
set have;
retain count C_Amt;
by Key;
if first.key then do;
count=0;
C_Amt=0;
end;
Count+1;
c_amt=sum(c_amt, amount);
*if last.key then output;
run;
proc print data=t1;
run;
And the corresponding output:
With last.key then output
Obs Key i Amount count C_Amt
1 1 3 46 3 147
2 2 3 44 3 154
And with out last.key
Obs Key i Amount count C_Amt
1 1 1 47 1 47
2 1 2 54 2 101
3 1 3 46 3 147
4 2 1 61 1 61
5 2 2 49 2 110
6 2 3 44 3 154

Commas are an error here:
(keep=KEY, COUNT, C_AMT2, C_AMT)
Anyway:
RUN;
usually means:
output;
return;
But if SAS encounters an output statement in your code, the output at the end (enclosed in the run statement) will be ignored.
Hence, since your output statement is conditionally executed only IF LAST.KEY, in your dataset you will have only observations marked as last.key, because your RUN; will only mean return.
Something like:
data want; set have; output; run;
Is exactly the same to not explicit output:
data want; set have; output; run;
You can use output as you want:
data want01 want02;
set have;
if a then output want01;
if b then output want02;
run;
data want01;
var=var1;
output;
var=var2;
output;
run;

Related

proc summary with statistic "multiply"

Is it possible to make a new statistic with proc summary that multiplies every value in each column, for example instead of just mean? SAS is so rigid it makes me crazy.
data test;
input b c ;
datalines;
50 11
35 12
75 13
;
Desired output would be 50*35*75, and 11*12*13, and _FREQ (as is normal output in proc summary)
This is an uncommon aggregate so you essentially need to roll your own. Since a data step loops this is easily accomplished using a RETAIN to keep value from row to row and outputting result at the last record.
Data want;
Set have end=eof;
Retain prod_b prod_c;
prod_b = prod_b * b;
prod_c = prod_c * c;
Freq= _n_;
If eof then OUTPUT;
Keep prod: freq;
Run;

does if statement in datastep check for missing value

Just curious is this code:
data Bla.SomeGreatNewDataset;
set WORK.InputTempDataset;
by SomeColumnName;
if first.SomeColumnName then output;
else delete;
run;
the same as:
data Bla.SomeGreatNewDataset;
set WORK.InputTempDataset;
by SomeColumnName;
if not missing(first.SomeColumnName) then output;
else delete;
run;
in other words does:
if first.SomeColumnName
just check if SomeColumnName does not contain a missing value?
Short answer, no.
BY Group processing with first.var and last.var operates on the distinct values of the variable. A missing value is a valid missing value.
first.var and last.var are Boolean values, either 1 or 0. You code outputs just the first record for each unique value of SomeColumnName.
Note, the data needs to either be sorted by SomeColumnName or have an index on that column.
Here is an example:
data have;
input x;
datalines;
1
2
2
.
3
3
3
;
run;
proc sort data=have;
by x;
run;
data want;
set have;
by x;
if first.x;
run;
proc print data=want;
run;
Produces:
Obs x
1 .
2 1
3 2
4 3

Delete the group that none of its observation contain the certain value in SAS

I want to delete the whole group that none of its observation has NUM=14
So something likes this:
Original DATA
ID NUM
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
Since none of the ID=2 contain NUM=14, I delete group 2.
And it should looks like this:
ID NUM
1 14
1 12
1 10
3 14
3 10
This is what I have so far, but it doesn't seem to work.
data originaldat;
set newdat;
by ID;
If first.ID then do;
IF NUM EQ 14 then Score = 100;
Else Score = 10;
end;
else SCORE+1;
run;
data newdat;
set newdat;
If score LT 50 then delete;
run;
An approach using proc sql would be:
proc sql;
create table newdat as
select *
from originaldat
where ID in (
select ID
from originaldat
where NUM = 14
);
quit;
The sub query selects the IDs for groups that contain an observation where NUM = 14. The where clause then limits the selected data to only these groups.
The equivalent data step approach would be:
/* Get all the groups that contain an observation where N = 14 */
data keepGroups;
set originaldat;
if NUM = 14;
keep ID;
run;
/* Sort both data sets to ensure the data step merge works as expected */
proc sort data = originaldat;
by ID;
run;
/* Make sure there are no duplicates values in the groups to be kept */
proc sort data = keepGroups nodupkey;
by ID;
run;
/*
Merge the original data with the groups to keep and only keep records
where an observation exists in the groups to keep dataset
*/
data newdat;
merge
originaldat
keepGroups (in = k);
by ID;
if k;
run;
In both datasets the subsetting if statement is used to only output observations when the condition is met. In the second case k is a temporary variable with value 1(true) when a value is read from keepGroups an 0(false) otherwise.
You're sort of getting at a DoW loop here, but not quite doing it right. The problem (Assuming the DATA/SET names are mistyped and not actually wrong in your program) is the first data step doesn't append that 100 to every row - only to the 14 row. What you need is one 'line' per ID value with a keep/no keep decision.
You can either do this by doing your first data step, but RETAIN score, and only output one row per ID. Your code would actually work, based on 14 being the first row, if you just fixed your data/set typo; but it only works when 14 is the first row.
data originaldat;
input ID NUM ;
datalines;
1 14
1 12
1 10
2 13
2 11
2 10
3 14
3 10
;;;;
run;
data has_fourteen;
set originaldat;
by ID;
retain keep;
If first.ID then keep=0;
if num=14 then keep=1;
if last.id then output;
run;
data newdata;
merge originaldat has_fourteen;
by id;
if keep=1;
run;
That works by merging the value from a 1-per-ID to the whole dataset.
A double DoW also works.
data newdata;
keep=0;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if num=14 then keep=1;
end;
do _n_=1 by 1 until (last.id);
set originaldat;
by id;
if keep=1 then output;
end;
run;
This works because it iterates over the dataset twice; for each ID, it iterates once through all records, looking for a 14, if it finds one then setting keep to 1. Then it reads all records again for that ID, and keeps if keep=1. Then it goes on to the next set of records by ID.
data in;
input id num;
cards;
1 14
1 12
1 10
2 16
2 13
3 14
3 67
;
/* To find out the list of groups which contains num=14, use below SQL */
proc sql;
select distinct id into :lst separated by ','
from in
where num = 14;
quit;
/* If you want to create a new data set with only groups containing num=14 then use following data step */
data out;
set in;
where id in (&lst.);
run;

Copy a value from a line below in SAS

I hope this is not a duplicate question. I've searched the forum and retain function seems to be choice of weapon but it copies down an observation, and I'm trying to do the following; for a given id, copy the second line to the first line for the x value. also first value of x is always 2.
Here's my data;
id x
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
and i want it to look like this;
id x
3 1
3 1
3 1
2 1
2 1
2 1
6 0
6 0
6 0
and here's the starter code;
data have;
input id x;
cards;
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
;
run;
Lead is tricky in SAS. You can sort in reverse and use a lag function to get around it though, and you are right: a retain statement will allow us to add an order variable so we can sort it back to its original format.
data have;
set have;
retain order;
lagid = lag(id);
if id ne lagid then order = 0;
order = order + 1;
drop lagid;
run;
proc sort data=have; by id descending order; run;
data have;
set have;
leadx = lag(x);
run;
proc sort data=have; by id order; run;
data have;
set have;
if order = 3 then x_fixed = x;
else x_fixed = leadx;
run;
If your data is exactly as you say, then you can use a lookahead merge. It literally takes the dataset and merges itself to a copy of the dataset that starts on row 2, side-to-side. You just have to check that you're still on the same ID. This does change the value of x for all records to the value one hence, not just the first; you could add additional code to pay attention to that (but can't use FIRST and LAST).
data want;
merge have have(firstobs=2 rename=(id=newid x=newx));
if newid=id then x=newx;
keep x id;
run;
If you don't have any additional variables of interest, then you can do something even more interesting: duplicate the second row in its entirety and delete the first row.
data want;
set have;
by id notsorted;
if first.id then do;
firstrow+1;
delete;
end;
if firstrow=1 then do;
firstrow=0;
output;
end;
output;
run;
However, the "safest" method (in terms of doing most likely what you want precisely) is the following, which is a DoW loop.
data want;
idcounter=0;
do _n_ = 1 by 1 until (last.id);
set have;
by id notsorted;
idcounter+1;
if idcounter=2 then second_x = x;
end;
do _n_=1 by 1 until (last.id);
set have;
by id notsorted;
if first.id then x=second_x;
output;
end;
run;
This identifies the second x in the first loop, for that BY group, then in the second loop sets it to the correct value for row 1 and outputs.
In both of the latter examples I assume your data is organized by ID but not truly sorted (like your initial example is). If it's not organized by ID, you need to perform a sort first (but then can remove the NOTSORTED).

How to find max value of variable for each unique observation in "stacked" dataset

Sorry for the vauge title.
My data set looks essentially like this:
ID X
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.
Any feedback would be greatly appreciated.
Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc means data=sample noprint max nway missing;
class id;
var x;
output out=sample_max (drop=_type_ _freq_) max=;
run;
Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).
I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
proc sort data=sample;
by myid myvalue;
run;
data result;
set sample;
by myid;
if last.myid then output;
run;
proc print data=result;
run;
This would give you this result:
Obs myid myvalue
1 18 2
2 361 1
3 369 3
If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:
PROC SORT DATA=A; BY ID DESCENDING X; RUN;
DATA B; SET A;
BY ID;
RETAIN X_MAX;
IF FIRST.ID THEN X_MAX = X;
ELSE X_MAX = X_MAX;
RUN;
You could try this:
PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;
A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:
If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a by statement rather than a class statement - this reduces the memory usage.
data sample;
input id x;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
data want;
do until(last.ID);
set sample;
by ID;
xmax = max(x, xmax);
end;
x = xmax;
drop xmax;
run;
Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than proc means and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.
data _null_;
set sample end = eof;
if _n_ = 1 then do;
call missing(xmax);
declare hash h(ordered:'a');
rc = h.definekey('ID');
rc = h.definedata('ID','xmax');
rc = h.definedone();
end;
rc = h.find();
if rc = 0 then do;
if x > xmax then do;
xmax = x;
rc = h.replace();
end;
end;
else do;
xmax = x;
rc = h.add();
end;
if eof then rc = h.output(dataset:'want2');
run;
In this example, on my PC, the hash approach used this much memory:
memory 966.15k
OS Memory 27292.00k
vs. this much for an equivalent proc summary:
memory 8706.90k
OS Memory 35760.00k
Not a bad saving if you really need it to scale up!
Use an appropriate proc with the by statement. For instance,
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc sort data=sample;
by myid;
run;
proc means data=sample;
var myvalue;
by myid;
run;
I would just sort by x and id putting the highest value for each ID at the top.
NODUPKEY removes every duplicate below.
proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;
proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;
A multidata hash should be used if you want the result to show each id at max value. That is, for the cases when more than one id is found having a max value
Example code:
Find the ids associated with the max value of 40 different numeric variables.
The code is Proc DS2 data program.
data have;
call streaminit(123);
do id = 1 to 1e5; %* 10,000 rows;
array v v1-v40; %* 40 different variables;
do over v; v=ceil(rand('uniform', 2e5)); end;
output;
end;
run;
proc ds2;
data _null_;
declare char(32) _name_ ; %* global declarations;
declare double value id;
declare package hash result();
vararray double v[*] v:; %* variable based array, limit yourself to 1,000;
declare double max[1000]; %* temporary array for holding the vars maximum values;
method init();
declare package sqlstmt s('drop table want'); %* DS2 version of `delete`;
s.execute();
result.keys([_name_]); %* instantiate a multidata hash;
result.data([_name_ value id]);
result.multidata();
result.ordered('ascending');
result.defineDone();
end;
method term();
result.output('want'); %* write the results to a table;
end;
method run();
declare int index;
set have;
%* process each variable being examined for 'id at max';
do index = 1 to dim(v);
if v[index] > max[index] then do; %* new maximum for this variable ?
_name_ = vname(v[index]); %* retrieve variable name;
value = v[index]; %* move value into hash host variable;
if not missing (max[index]) then do;
result.removeall(); %* remove existing multidata items associated with the variable;
end;
result.add(); %* add new multidata item to hash;
max[index] = v[index]; %* track new maximum;
end;
else
if v[index] = max[index] then do; %* two or more ids have same max;
_name_ = vname(v[index]);
value = v[index];
result.add(); %* add id to the multidata item;
end;
end;
end;
enddata;
run;
quit;
%let syslast=want;
Reminder: Proc DS2 defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:
Use table option overwrite=yes when syntax allows
The package hash .output() method does not recognize the table option
Drop the table before recreating it
The above code can be used in a Base SAS DATA step with minor modifications.