I'm writing a program in SAS.
Here's the dataset I have:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
For each ID, I want to delete the record if variable huuse ne 1, until I get to the first huuse=1. Then I want to keep that record and all subsequent records for that id, no matter what value huuse is. So for id=1, I want to delete the first two records than keep all records for id=1 starting with the 3rd record. For id=2, the first record has huuse=1, so I want to keep all records for id=2.
The data set I want should look like this:
id huuse days
1 0 4
1 0 3
1 1 12
1 1 1
1 2 15
2 1 13
2 0 16
2 1 18
2 0 44
I tried this code, but it removes all records that have huuse ne 1.
data want;
set have;
by id;
do until (huuse=1);
if huuse = 1 then LEAVE;
if huuse ne 1 then DELETE;
END;
run;
I've tried several variations of do loops, but they all do the same thing.
The DATA step is a program with an implicit loop that reads every record of the data set specified in the SET statement. Any program data vector (pdv) variables not coming from the data set are, by default, reset to missing at the top of the implicit loop. You change that behavior using a RETAIN statement to name variables that should not get reset.
So, in your problem you have two situations when a tracking variable is needed. The variable will track the state of the condition Have I seen huuse=1 yet in this group ?. Call this variable one_flag
RETAIN one_flag; so you control when it's value changes
At the start of a BY group one_flag needs to be reset to false (0)
When huuse is first seen as 1 set the flag to true (1)
Example:
data want(drop=one_flag);
set have;
by id;
retain one_flag 0;
if first.id then one_flag = 0;
if not one_flag and huuse = 1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
run;
You can place the SET and BY statement inside an explicit DO and that changes the operating behavior of the program, especially if the explicit loop is terminated according to a LAST.<var> automatic variable. Such a loop is commonly called a DOW loop by SAS programmers. There is no phrase DOW loop in the SAS documentation.
Example:
data want;
do until (last.id);
set have;
by id;
if not one_flag and huuse=1 then one_flag = 1;
if one_flag then OUTPUT; * want all rows in group starting at first huuse=1;
end;
run;
Because the looping is explicit and never reaches the TOP of the program with in the loop, there is no need to RETAIN the flag variable, nor reset it. Program variables that are not retained are reset automatically at the top of the program, and the top of the program is only reached at the start of the BY group. Learn more about this programming construct in the SGF 2013 paper "The Magnificent DO", Paul M. Dorfman
Your source and result are same :-)
But if I understood your question correctly the solution is quite simple with a retain solution. I add 2 lines to the example to make it clear that I understood correctly.
The code with example table:
data test;
id=1;huuse=0;days=4;output;
id=1;huuse=0;days=3;output;
id=1;huuse=1;days=12;output;
id=1;huuse=1;days=1;output;
id=1;huuse=2;days=15;output;
id=2;huuse=1;days=13;output;
id=2;huuse=0;days=16;output;
id=2;huuse=1;days=18;output;
id=2;huuse=0;days=44;output;
id=3;huuse=0;days=1;output;
id=3;huuse=1;days=2;output;
run;
data test_output;
set test;
retain keep_id -1;
if (keep_id ne id and huuse ne 0) then keep_id=id;
if keep_id = id then output;
run;
/* the results:
id huuse days
1 1 12 1
1 1 1 1
1 2 15 1
2 1 13 2
2 0 16 2
2 1 18 2
2 0 44 2
3 1 2 3
*/
Related
I have 3 variables and a counter has to be created based on them.
Input:
ID window start window end
1 29oct20 12mar21
1 31oct20 08Feb21
1 31oct21 08feb21
1 31oct21 08feb21
2 06Nov20 11Apr21
2 06Nov20 11Apr21
2 27Nov20 01Apr19
Expected output:
ID window start window end priority_count
1 29oct20 12mar21 1
1 31oct20 08Feb21 2
1 31oct21 08feb21 2
1 31oct21 08feb21 2
2 06Nov20 11Apr21 1
2 06Nov20 11Apr21 1
2 27Nov20 01Apr19 2
So for every ID a new count should start once a new date comes.
I have been using this code
data want;
set have;
by ID window_start window_end;
if first.ID and first.window_start and first.window_endthen priority_count=1;
else priority_count+1;
run;
But it gives:
priority_count
1
2
3
4
1
2
3
Not sure if those are typos but there are several observations for which window_start is after window_end.
Using the LAG function
data want;
set have;
by id;
_lag=lag(window_start);
if first.id then priority_count=1;
else do;
if window_start ne _lag then
priority_count + 1;
end;
drop _lag;
run;
ID window_start window_end priority_count
1 29OCT2020 12MAR2021 1
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
1 31OCT2020 08FEB2021 2
2 06NOV2020 11APR2021 1
2 06NOV2020 11APR2021 1
2 27NOV2020 01APR2019 2
I think you're on the right track but need a slight modifications on your IF statements to reflect the logic.
Set to 0 at first of each ID
Increment if the window_end changes (or window_start since they're consistent in your example). Setting it to 0 initially means you can increment without worrying if it's the first or not.
data want;
set have;
by ID window_start window_end;
if first.ID then priority_count=0;
if first.window_end then priority_count+1;
run;
I'd like to ask help in this, as I am new to SAS, but a PROC SQL approach is usable as well.
My dataset has IDs, a time variable, and a flag. After I sort by id and time, I need to find the first flagged observation of the last flagged group/streak. As in:
ID TIME FLAG
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
Here I want my script to return the row where time is 8 for ID 1, as it is the first observation from the last "streak", or flagged group. For ID 2 it should be where time is 3.
Desired output:
ID TIME FLAG
1 8 1
2 3 1
I'm trying to wrap my head around using first. and last. here, but I suppose the problem here is that I view temporally displaced flagged groups/streaks as different groups, while SAS looks at them as they are only separated by flag, so a simple "take first. from last." is not sufficient.
I was also thinking of collapsing the flags to a string and using a regex lookahead, but I couldn't come up with either the method or the pattern.
I would just code a double DOW loop. The first will let you calculate the observation for this ID that you want to output and the second will read through the records again and output the selected observation.
You can use the NOTSORTED keyword on the BY statement to have SAS calculate the FIRST.FLAG variable.
data have;
input ID TIME FLAG;
cards;
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
;
data want;
do obs=1 by 1 until(last.id);
set have;
by id flag notsorted;
if first.flag then want=obs;
end;
do obs=1 to obs;
set have;
if obs=want then output;
end;
drop obs want;
run;
Loop through the dataset by id. Use the lag function to look at the current and previous value of flag. If the current value is 1 and the previous value is 0, or it's the first observation for that ID, write the value of time to a retained variable. Only output the last observation for each id. The retained variable should contain the time of the first flagged observation of the last flagged group:
data result;
set have;
by id;
retain firstflagged;
prevflag = lag(flag);
if first.id and flag = 1 then firstflagged = time;
else if first.id and flag = 0 then firstflagged = .;
else if flag = 1 and prevflag = 0 then firstflagged = time;
if last.id then output;
keep id firstflagged flag;
rename firstflagged = time;
run;
I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!
So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;
I have a dataset that contains an ID and some additional data. I want to perform transformations based on the ID with a by statement. The transformation works. Unfortunately SAS automatically reduces the dataset to one row per group. Does anybody know how to keep the original (number of) rows and still perform the group actions?
Here is some sample code to illustrate my problem
data dat;
input ID X $;
datalines;
1 a
1 b
1 c
1 d
2 a
2 b
3 a
4 k
5 z
5 a
5 c
;
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
drop x;
run;
Just add an OUTPUT statement inside the DO loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
output;
end;
drop x;
run;
When you do not have an explicit OUTPUT statement in a data step then an implied OUTPUT statement executes at the end of the data step. Your DO loop around the SET statement means that the end of the data step is only reached for the last observation per group.
If you want the final calculated value to be replicated on each observation then just add another loop to re-read the observations and put the OUTPUT statement in that loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
do until(last.ID);
set dat;
by ID notsorted;
output;
end;
drop x;
run;
When you want to associate a group level computation result to EACH row in the group you will need to first iterate over the group to compute the result, and then have a second loop that reads the same rows of the group and outputs each. Use additional variables if you need to know the sequence number within the group and the total number of rows in the group.
data want(keep=id x_csv_list by_group_size seq);
length x_csv_list $2100.;
do by_group_size = 1 by 1 until(last.ID);
set dat;
by ID notsorted;
x_csv_list = catx(',',x_csv_list,x);
end;
do seq = 1 to by_group_size;
set dat;
output;
end;
run;
Also, if you are at the 'never really get it' stage, remember NOTSORTED means contiguous rows with the same by group variable values.
by s
s group first.s last.s
- ----- ------- ------
A 1st 1 0
A 1st 0 0 /* trick knowledge both 0 means row is interior */
A 1st 0 1
B 2nd 1 1 /* trick knowledge both 1 means group size is 1 row */
A 3rd 1 0
A 3rd 0 1
B 4th 1 0
B 4th 0 0
B 4th 0 1
C 5th 1 0
C 5th 0 1
I want to create a column in my dataset that calculates the sum of the current row and next row for another field. There are several groups within the data, and I only want to take the sum of the next row if the next row is part of the current group. If a row is the last record for that group I want to fill with a null value.
I'm referencing reading next observation's value in current observation, but still can't figure out how to obtain the solution I need.
For example:
data have;
input Group ID Salary;
cards;
10 1 1
10 2 2
10 3 2
10 4 1
11 1 2
11 2 2
11 3 1
11 4 1
;
run;
The result I want to obtain here is this:
data want;
input Group ID Salary Sum;
cards;
10 1 1 3
10 2 2 4
10 3 2 3
10 4 1 .
11 1 2 4
11 2 2 3
11 3 1 2
11 4 1 .
;
run;
Similar to Tom's answer, but using a 'look-ahead' merge (without a by statement, and firstobs=2) :
data want ;
merge have
have (firstobs=2
keep=Group Salary
rename=(Group=NextGroup Salary=NextSalary)) ;
if Group = NextGroup then sum = sum(Salary,NextSalary) ;
drop Next: ;
run ;
Use BY group processing and a second SET statement that skips the first observation.
data want ;
set have end=eof;
by group ;
if not eof then set have (keep=Salary rename=(Salary=Sum) firstobs=2);
if last.group then Sum=.;
else sum=sum(sum,salary);
run;
I found a solution using proc expand that produced what I needed:
proc sort data = have;
by Group ID;
run;
proc expand data=have out=want method=none;
by Group;
convert Salary = Next_Sal / transformout=(lead 1);
run;
data want(keep=Group ID Salary Sum);
set want;
Sum = Salary + Next_Sal;
run;