I have a large data set that can be dissected into the following:
ID x
1 0
1 0
1 0
1 1
1 1
I have one ID variable telling me which individual that the X value corresponds to.
The X variable is 0 if no event has occurred for this individual and 1 if an event has occurred.
I'm interested in creating a variable which tells me if an event has at all occurred for a customers during my whole time series for that specific ID, as seen in X2 below.
ID x x2
1 0 1
1 0 1
1 0 1
1 1 1
1 1 1
Hence x2 takes the value 1 across all observations because x takes the value 1 in at least one instance.
I have looked at creating a reversed lag through the "SAS leading technique" but it doesn't seem to be able to retain the value, so I would need to do a reverse lag multiple times which is not an option since my actual data set contains thousands of rows and every ID needs different amounts of lags.
Does anyone have an idea about how to solve this?
Thanks in advance!
The easiest way to do this is the double DoW loop.
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if x then x2=1;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
output;
end;
run;
This loops through the dataset once to find the value of x2, sets it, and then loops through again to output the data. You don't actually retain anything because it's all in one data step loop iteration for a single ID - x2 isn't being reset except between IDs.
This is reasonably fast, as long as you don't have more records per ID than you can fit in the read buffer, as it will buffer the first read and thus not have to re-read from disk a second time.
Try a SQL solution.
proc sql;
create table flagged as
select
a.*,
b.x2
from
table a
join
(select
id,
max(x) as x2
from table
group by id) b
on
a.id = b.id
;
quit;
You could do this by merging the data with itself, applying a where= dataset option to the second copy. You will need to keep a copy of the X variable, but renamed, so that it can used in the where=. You could use this renamed variable as the new X2, but then you would need to convert missings to zeros. Or you could use the IN= dataset option to generate the new X2 variable with 0/1 values.
data want;
merge have have(in=in2 keep=id x rename=(x=x3) where=(x3=1)) ;
by id;
x2 = in2;
drop x3;
run;
Related
I have the following dataset.
ID var1 var2 var3
1 100 200
1 150 300
2 120
2 100 150 200
3 200 150
3 250 300
I would like to have a new dataset with only the last not blank record for each group of variables.
id var1 var2 var3
1 150 200 300
2 100 150 200
3 250 300 150
last. select the last reord, but i need to selet the last not null record
Looks like you want the last non missing value for each non-key variable. So you can let the UPDATE statement do the work for you. Normally for update operation you apply transactions to a master dataset. But for your application you can use OBS=0 dataset option to make your current dataset work as both the master and the transactions.
data want ;
update have(obs=0) have ;
by id;
run;
Riccardo:
There are many ways to select, within each group, the last non-missing value of each column. Here are three ways. I would not say one is the best way, each approach has it's merits depending on the specific data set, coder comfort and long term maintainability.
Way 1 - UPDATE statement
Perhaps the simplest of the coding approaches goes like this:
make a data set that has one row per id and the same columns as the original data.
use the UPDATE statement to replace each like named variable with a non-missing value.
Example:
data want_base_table(label="One row per id, original columns");
set have;
by id;
if first.id;
run;
* use have as a transaction data set in the update statement;
data want_by_update;
update want_base_table have;
by id;
run;
Way 2 - DOW loop
Others will involve arrays and first. and last. flag variables of the BY group. This example shows a DOW loop that tracks the non-missing value and then uses them for the output of each ID:
data want_dow;
do until (last.id);
set have;
by id;
array myvar var1-var3 ;
array myhas has1-has3 ;
do _i = 1 to dim(myvar);
if not missing (myvar(_i)) then
myhas(_i) = myvar(_i);
end;
end;
do _i = 1 to dim(myhas);
myvar(_i) = myhas(_i);
end;
output;
drop _i has1-has3;
run;
A loop is most often called a DOW loop when there is a SET statement inside the DO; END; block and the loop termination is triggered by the last. flag variable. A similar non DOW approach (not shown) would use the implicit loop and first. to initialize the tracking array and last. for copying the values (tracked within group) into the columns for output.
Way 3 - Merging column slices
data want_by_column_slices;
merge
have (keep=id var1 where=(var1 ne .))
have (keep=id var2 where=(var2 ne .))
have (keep=id var3 where=(var3 ne .))
;
by id;
if last.id;
run;
I am simply looking to create a new column retaining the value of a specific column within the previous record, so that I can compare the existing column against the new column. Further down the line I want to be able to output records whereby the values in both columns are different and lose the records where the values are the same
Essentially I want my dataset to look like this, where the first idr has the retained date set to null:
Idr Date1 Date2
1 20/01/2016 .
1 20/01/2016 20/01/2016
1 18/10/2016 20/01/2016
2 07/03/2016 .
2 18/05/2016 07/03/2016
2 21/10/2016 18/05/2016
3 29/01/2016 .
3 04/02/2016 29/01/2016
3 04/02/2016 04/02/2016
I have used code along the following lines in the past whereby I have created a temporary variable referencing the data I want to retain:
date_temp=date1;
data example2;
set example1;
by idr date1;
date_temp=date1;
retain date_temp ;
if first.idr then do;
date_temp=date1;
end;else do;
date2=date_temp;
end;
run;
I have searched the highs and lows of google - Any help would be greatly appreciated
The trick here is to set the value of the retained variable ready for the next row after you've already output the current row, rather than relying on the default implicit output at the end of the data step:
data example2;
set example1;
by idr;
retain date2;
if first.idr then call missing(date2);
output;
date2 = date1;
format date2 ddmmyy10.;
run;
Logic that executes after the output statement doesn't make any difference to the row that's just been output, but until the data step proceeds to the next iteration, everything is still in the PDV, including variables that are being retained across to the next row, so this is an opportunity to update them. More details on the PDV.
Another way of doing this, using the lag function:
data example3;
set example1;
date2 = lag(date1);
if idr ne lag(idr) then call missing(date2);
run;
Be careful when using lag - it returns the value from the last time that instance of the lag function was executed during your data step, not necessarily the value of that variable from the previous row, so weird things tend to happen if you do something like if condition then mylaggedvar=lag(var);
To achieve your final outcome (remove records where the idr and date are the same as the previous row), you can easily achieve this without creating the extra column. Providing the data is sorted by idr and date1, then just use first.date1 to keep the required records.
data have;
input Idr Date1 :ddmmyy10.;
format date1 ddmmyy10.;
datalines;
1 20/01/2016
1 20/01/2016
1 18/10/2016
2 07/03/2016
2 18/05/2016
2 21/10/2016
3 29/01/2016
3 04/02/2016
3 04/02/2016
;
run;
data want;
set have;
by idr date1;
if first.date1 then output;
run;
I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;
I have a data set where each observation is a combination of binary indicator variables, but not necessarily all possible combinations. I'd like to eliminate observations that are subsets of other observations. As an example, suppose I had these three observations:
var1 var2 var3 var4
0 0 1 1
1 0 0 1
0 1 1 1
In this case, I would want to eliminate observation 1, because it's a subset of observation 3. Observation 2 isn't a subset of anything else, so my output data set should contain observations 2 and 3.
Is there an elegant and preferably fast way to do this in SAS? My best solution thus far is a brute force loop through the data set using a second set statement with the point option to see if the current observation is a subset of any others, but these data sets could become huge once I start working with a lot of variables, so I'm hoping to find a better way.
First off, one consideration: is it possible for one row to have 1 for all indicators? You should check for that first - if one row does have all 1s, then it will always be the unique solution.
_POINT_ is inefficient, but loading into a hash table isn't a terribly bad way to do it. Just load up a hash table with a string of the binary indicators CATted together, and then search that table.
First, use PROC SORT NODUPKEY to eliminate the exact matches. Unless you have a very large number of indicator variables, this will eliminate many rows.
Then, sort it in an order where the more "complicated" rows are at the top, and the less complicated at the bottom. This might be as simple as making a variable which is the sum of binary indicators and sort by that descending; or if your data suggests, might be sorting by a particular order of indicators (if some are more likely to be present). The purpose of this is to reduce the number of times we search; if the likely matches are on top, we will leave the loop faster.
Finally, use a hash iterator to search the list, in descending order by the indicators variable, for any matches.
See below for a partially-tested example. I didn't verify that it eliminated every valid elimination, but it eliminates around half of the rows, which sounds reasonable.
data have;
array vars var1-var20;
do _u = 1 to 1e4;
do _t = 1 to dim(Vars);
vars[_t] = round(ranuni(7),1);
end;
complexity = sum(of vars[*]);
indicators = cats(of vars[*]);
output;
end;
drop _:;
run;
proc sort nodupkey data=have;
by indicators;
run;
proc sort data=have;
by descending complexity;
run;
data want;
if _n_ = 1 then do;
format indicators $20.;
call missing(indicators, complexity);
declare hash indic(dataset:'have', ordered:'d');
indic.defineKey('indicators');
indic.defineData('complexity','indicators');
indic.defineDone();
declare hiter inditer('indic');
end;
set have(drop=indicators rename=complexity=thisrow_complex); *assuming have has a variable, "indicators", like "0011001";
array vars var1-var20;
rc=inditer.first();
rowcounter=1;
do while (rc=0 and complexity ge thisrow_complex);
do _t = 1 to dim(vars);
if vars[_t]=1 and char(indicators,_t) ne '1' then leave;
end;
if _t gt dim(Vars) then delete;
else rc=inditer.next();
rowcounter=rowcounter+1;
end;
run;
I'm prety sure their is probably a more math-oriented way of doing this but for now this what I can think about. Proceed with caution as I only checked on a small no. of test cases.
My pseudo-algorithm:
(Bit pattern= concatenation of all binary variables into a string.)
get a unique list of binary(bit) patterns which is sorted in
asceding order (this is what the PROC SQL step is doing
Set up two arrays. 1 array to track the variables (var1 to var4) on the current row and another array to track lag(of var1 to var4).
If a bit pattern is a subset of another bit pattern (lagged version) then "dot product vector multiplication" should give back the lagged bit pattern.
If bit pattern = lagged_bit_pattern then flag that pattern to be excluded.
In the last data step you will get the list of bit patterns you need to exclude. NOTE: this approach does not take care of duplicate patterns such as the following:
record1: 1 0 0 1
record2: 1 0 0 1
which can be easily excluded via PROC SORT & NODUPKEY.
/*sample data*/
data sample_data;
input id var1-var4;
bit_pattern = compress(catx('',var1,var2,var3,var4));
datalines;
1 0 0 1 1
2 1 0 0 1
3 0 1 1 1
4 0 0 0 1
5 1 1 1 0
;
run;
/*in the above example, 0001 0011 need to be eliminated. These will be highlighted in the last datastep*/
/*get unique combination of patterns in the dataset*/
proc sql ;
create table all_poss_patterns as
select
var1,var2, var3,var4,
count(*) as freq
from sample_data
group by var1,var2, var3,var4
order by var1,var2, var3,var4;
quit;
data patterns_to_exclude;
set all_poss_patterns;
by var1-var4;
length lagged1-lagged4 8;
array first_array{*} var1-var4;
array lagged{*}lagged1-lagged4;
length bit_pattern $32.;
length lagged_bit_pattern $32.;
bit_pattern = '';
lagged_bit_pattern='';
do i = 1 to dim(first_array);
lagged{i}=lag(first_array{i});
end;
do i = 1 to dim(first_array);
bit_pattern=cats("", bit_pattern,lagged{i}*first_array{i});
lagged_bit_pattern=cats("",lagged_bit_pattern,lagged{i});
end;
if bit_pattern=lagged_bit_pattern then exclude_pattern=1;
else exclude_pattern=0;
/*uncomment the following two lines to just keep the patterns that need to be excluded*/
/*if bit_pattern ne '....' and exclude_pattern=1;*/ /*note the bit_pattern ne '....' the no. of dots should equal no. of binary vars*/
/*keep bit_pattern;*/
run;
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75