I have some old SAS code to convert into python.
Part of the code does effectively this:
data A (index=(key1=(record_id, record_version)));
set table.xxx (where = ...)
run;
data B;
set table.yyy (where = ...)
set A key=key1/unique;
if _ERROR_ = 1 then do;
valueA = "";
_ERROR_ = 0;
end;
run;
I have read the documentation of the SET and UNIQUE statements, which says:
By default, SET begins searching at the top of the index only when the KEY= value changes.
If the KEY= value does not change on successive executions of the SET statement, the search begins by following the most recently retrieved observation. In other words, when consecutive duplicate KEY= values appear, the SET statement attempts a one-to-one match with duplicate indexed values in the data set that is being read. If more consecutive duplicate KEY= values are specified than exist in the data set that is being read, the extra duplicates are treated as not found.
When KEY= is a unique value, only the first attempt to read an observation with that key value succeeds; subsequent attempts to read the observation with that value of the key fail. The IORC variable returns a value that corresponds to the SYSRC autocall macro's mnemonic _DSENOM. If you add the /UNIQUE option, subsequent attempts to read the observation with the unique KEY= value succeed. The IORC variable returns a 0.
Huh, "subsequent attempts to read the observation with that value of the key fail.". Fail how?
So practically speaking, given A and B are:
A record_id record_version valueA B record_id record_version valueB
1 1 A11 1 1 B10
1 1 A12 1 2 B20
1 2 A22
1 3 A33
My output will definitely include these rows:
record_id record_version valueA valueB
1 1 A11 B10
1 2 A22 B20
What I don't understand is what the if _ERROR_ statement does.
Do I get this?
record_id record_version valueA valueB
1 1 B10
Or this?
record_id record_version valueA valueB
1 1 A12 B10
Do I get this?
record_id record_version valueA valueB
1 3 A33 null
Or this?
record_id record_version valueA valueB
1 3 null
What edge case is the error statement handling?
The code is resetting the lookup value to missing when a key from the set table.yyy (where = ...) data is NOT present in the lookup table A. If the reset did not occur the lookup value would be what ever value was retrieved from a previous successful lookup.
The /UNIUE tells SET that it should retrieve the first lookup if there are more than one possible retrievals (i.e. the lookup table A has repeats of record_id/record_version).
The _ERROR_ is still needed for resetting the lookup value for the case of no such lookup was found.
The issue really only comes to fore if the master table has a more rows with repeated keys than the non-unique indexed lookup table has repeated keys.
Example:
* lookup indexed, but not unique;
* lookup is more typically a 'transaction' table;
data lookup(index=(IDX_key1key2=(key1 key2)));
input key1 key2 valueA $; datalines;
1 1 A11 1st 1 1
1 1 A12 2nd 1 1
1 2 A22
1 3 A33
;
data master;
input key1 key2 valueB $; datalines;
1 1 B10 1st 1 1
1 2 B20
1 1 B30 1st 1 1
1 1 B40 2nd 1 1
1 1 B50 3rd 1 1
;
* data for 2nd 1 1 lookup is from 2nd lookup;
* data for 3rd 1 1 lookup is from 2nd lookup and PUT will show _ERROR_=1 in log;
* No _ERROR_ check, that cant be good;
data master_with_keyed_lookup;
set b;
set a key=IDX_key1key2;
put _all_;
run;
* data retrieved for 2nd and 3rd 1 1 lookup are from 1st lookup row due to unique;
* No _ERROR_ check, that cant be good;
data master_with_unique_keyed_lookup;
set b;
set a key=IDX_key1key2/unique;
put _all_;
run;
Related
I'd like to ask help in this, as I am new to SAS, but a PROC SQL approach is usable as well.
My dataset has IDs, a time variable, and a flag. After I sort by id and time, I need to find the first flagged observation of the last flagged group/streak. As in:
ID TIME FLAG
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
Here I want my script to return the row where time is 8 for ID 1, as it is the first observation from the last "streak", or flagged group. For ID 2 it should be where time is 3.
Desired output:
ID TIME FLAG
1 8 1
2 3 1
I'm trying to wrap my head around using first. and last. here, but I suppose the problem here is that I view temporally displaced flagged groups/streaks as different groups, while SAS looks at them as they are only separated by flag, so a simple "take first. from last." is not sufficient.
I was also thinking of collapsing the flags to a string and using a regex lookahead, but I couldn't come up with either the method or the pattern.
I would just code a double DOW loop. The first will let you calculate the observation for this ID that you want to output and the second will read through the records again and output the selected observation.
You can use the NOTSORTED keyword on the BY statement to have SAS calculate the FIRST.FLAG variable.
data have;
input ID TIME FLAG;
cards;
1 2 1
1 3 1
1 4 1
1 5 0
1 6 1
1 7 0
1 8 1
1 9 1
1 10 1
2 2 0
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
;
data want;
do obs=1 by 1 until(last.id);
set have;
by id flag notsorted;
if first.flag then want=obs;
end;
do obs=1 to obs;
set have;
if obs=want then output;
end;
drop obs want;
run;
Loop through the dataset by id. Use the lag function to look at the current and previous value of flag. If the current value is 1 and the previous value is 0, or it's the first observation for that ID, write the value of time to a retained variable. Only output the last observation for each id. The retained variable should contain the time of the first flagged observation of the last flagged group:
data result;
set have;
by id;
retain firstflagged;
prevflag = lag(flag);
if first.id and flag = 1 then firstflagged = time;
else if first.id and flag = 0 then firstflagged = .;
else if flag = 1 and prevflag = 0 then firstflagged = time;
if last.id then output;
keep id firstflagged flag;
rename firstflagged = time;
run;
I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!
So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;
In SAS, a dataset I have is as follows.
id A
1 2
1 3
2 1
3 1
3 2
ID is given to each individual and A is a categorical variable which takes 1, 2 or 3. I want to get the data with one observation per each individual separating A into three indicator variables, say A1, A2 and A3.
The result would look like this:
id A1 A2 A3
1 0 1 1
2 1 0 0
3 1 1 0
Does anyone have any thought how to do this in data step, not in sql? Thanks in advance.
So you're on the right track, a transpose statement is definitely the way to go:
data temp;
input id A;
datalines;
1 2
1 3
2 1
3 1
3 2
;
run;
First you want to transpose by id, using the variable A:
proc transpose data = temp
out = temp2
prefix = A;
by id;
var A;
id A;
run;
And then, for all variables beginning with A, you want to replace all missing values with 0s and all non-missing values with 1s. The retain statement here reorders your variables:
data temp3 (drop = _name_);
retain id A1 A2 A3;
set temp2;
array change A:;
do over change;
if change~=. then change=1;
if change=. then change=0;
end;
run;
I have the following dataset :
ID CODE
1 A
1 B
2 A
2 A
2 B
3 A
3 B
I would like to add a third column to this table which gives a sequence no. as given below :
ID CODE SEQ
1 A 1
1 B 2
2 A 1
2 A 1
2 B 2
3 A 1
3 B 2
How can I achieve this instead of coding A as 1 and B as 2 rather by a retain statement ?
You should look at by processing and first.. Something like this will work; basically, for each ID initialize seq to zero, and for each new code increment it by one.
data want;
set have;
by id code;
if first.id then seq=0;
if first.code then seq+1;
run;
In this data step I do not understand what if last.y do...
Could you tell me ?
data stop2;
set stop2;
by x y z t;
if last.y; /*WHAT DOES THIS DO ??*/
if t ne 999999 then
t=t+1;
else do;
t=0;
z=z+1;
end;
run;
LAST.Y refers to the row immediately before a change in the value of Y. So, in the following dataset:
data have;
input x y z;
datalines
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 3 1
2 3 2
2 3 3
;;;;
run;
LAST.Y would occur on the third, sixth, ninth, and twelfth rows in that dataset (on each row where Z=3). The first two times are when Y is about to change from 1 to 2, and when it is about to change from 2 to 3. The third time is when X is about to change - LAST.Y triggers when Y is about to change or when any variable before it in the BY list changes. Finally, the last row in the dataset is always LAST.(whatever).
In the specific dataset above, the subsetting if means you only take the last row for each group of Ys. In this code:
data want;
set have;
by x y z;
if last.y;
run;
You would end up with the following dataset:
data want;
input x y z;
datalines;
1 1 3
1 2 3
1 3 3
2 3 3
;;;;
run;
at the end.
One thing you can do if you want to see how FIRST and LAST operate is to use PUT _ALL_;. For example:
data want;
set have;
by x y z;
put _all_;
if last.y;
run;
It will show you all of the variables, including FIRST.(whatever) and LAST.(whatever) on the dataset. (FIRST.Y and LAST.Y are actually variables.)
In SAS, first. and last. are variables created implicitly within a data step.
Each variable will have a first. and a last. corresponding to each record in the DATA step. These values will be wither 0 or 1. last.y is same as saying if last.y = 1.
Please refer here for further info.
That is an example of subsetting IF statement. Which is different than an IF/THEN statement. It basically means that if the condition is not true then stop this iteration of the data step right now.
So
if last.y;
is equivalent to
if not last.y then delete;
or
if not last.y then return;
or
if last.y then do;
... rest of the data step before the run ...
end;