Deleting rows based on multiple columns conditions - sas

Given the following table have, I would like to delete the records that satisfy the conditions based on the to_delete table.
data have;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Basket|30SEP21:00:00:00
111|Basket|31DEC20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|31DEC20:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
data to_delete;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
In the past, I used to operate with the catx() function to concatenate the conditions in a where statement, but I wonder if there is a better way of doing this
proc sql;
delete from have
where catx('|',id,item,datetime) in
(select catx('|',id,item,datetime) from to_delete);
run;
+-----+--------+------------------+
| id | item | datetime |
+-----+--------+------------------+
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
+-----+--------+------------------+
Please note that it should allow the have table to have more columns than the table to_delete.

You can use except from to compute difference set of two sets:
proc sql;
create table want as
select * from have except select * from to_delete
;
quit;

Related

Derive attributes based on multiple events

I have data that I want to transpose to get visualization of the status of a single id at any point in time.
I have been trying to follow #Joe's answer from Aggregating multiple observations depending on validity ranges, but I struggle with the case of multiple modalities attributes.
This is the event-based data I have:
data have;
infile datalines delimiter="|";
input attrib :$30. multiple_attr :$1. id :$30. attrib_id :8. member_value :$100. type :$5. dt_event :datetime18.;
format dt_event datetime20.;
datalines;
TYPE|N|ABC123|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|ABC123|111|MEDIUM|End|18APR2021:00:00:00
TYPE|N|ABC123|111|BIG|Start|19APR2021:00:00:00
TYPE|N|ABC123|111|BIG|End|31DEC2030:00:00:00
POSITION|N|ABC123|222|TOP|Start|01DEC2014:00:00:00
POSITION|N|ABC123|222|TOP|End|31DEC2030:00:00:00
IS_ACTIVE|N|ABC123|333|YES|Start|01DEC2014:00:00:00
IS_ACTIVE|N|ABC123|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|ABC123|1|ALONE|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|Start|01DEC2014:00:00:00
LEVELS|Y|ABC123|1|BOTH|End|18APR2021:00:00:00
LEVELS|Y|ABC123|1|ALONE|End|31DEC2030:00:00:00
TYPE|N|DEF456|111|MEDIUM|Start|01DEC2014:00:00:00
TYPE|N|DEF456|111|MEDIUM|End|31DEC2030:00:00:00
POSITION|N|DEF456|222|MID|Start|01DEC2014:00:00:00
POSITION|N|DEF456|222|MID|End|31DEC2030:00:00:00
IS_ACTIVE|N|DEF456|333|YES|Start|01MAR2014:00:00:00
IS_ACTIVE|N|DEF456|333|YES|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|ALONE|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|01MAR2014:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31MAR2018:00:00:00
LEVELS|Y|DEF456|1|BOTH|Start|20AUG2018:00:00:00
LEVELS|Y|DEF456|1|ALONE|End|31DEC2030:00:00:00
LEVELS|Y|DEF456|1|BOTH|End|31DEC2030:00:00:00
;
Using #Joe's method:
proc sort data=have;
by id attrib_id dt_event member_value;
run;
data want;
set have(rename=member_value=in_value);
by id attrib_id dt_event;
retain start_date end_date member_value orig_value;
format member_value new_value $100.;
* First row per attrib_id is easy, just start it off with a START;
if first.attrib_id then do;
start_date = dt_event;
member_value = in_value;
end;
else do; *Now is the harder part;
* For ENDs, we want to remove the current member_value from the concatenated value string, always, and then if it is the last row for that dt_event, we want to output a new record;
if type='End' then do;
*remove the current (in_)value;
if first.dt_event then orig_value = member_value;
do _i = 1 to countw(member_value,';');
if scan(orig_value,_i,';') ne in_value then do;
if orig_value > scan(orig_value,_i,';') then new_value = catx('; ',scan(orig_value,_i,';'),new_value);
else new_value = catx('; ',new_value,scan(orig_value,_i,';'));
end;
end;
orig_value = new_value;
if last.dt_event then do;
end_date = dt_event;
output;
start_date = dt_event + 86400;
member_value = new_value;
orig_value = ' ';
end;
end;
else do;
* For START, we want to be more careful about outputting, as this will output lots of unwanted rows if we do not take care;
end_date = dt_event - 86400;
if start_date < end_date and not missing(member_value) then output;
if member_value > in_value then member_value = catx('; ',in_value,member_value);
else member_value = catx('; ',member_value,in_value);
start_date = dt_event;
end_date = .;
end;
end;
format start_date end_date datetime20.;
keep id multiple_attr attrib_id member_value start_date end_date;
run;
I end up with:
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BOTH; ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | BOTH; ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | BOTH; BOTH; ALONE |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+-------------------+
You can see that multiple modalities attributes (where multiple_attr = "Y") are not handled properly.
The desired output should be like this:
+---------------+--------+-----------+--------------------+--------------------+--------------+
| multiple_attr | id | attrib_id | start_date | end_date | member_value |
+---------------+--------+-----------+--------------------+--------------------+--------------+
| Y | ABC123 | 1 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | ALONE; BOTH |
| Y | ABC123 | 1 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | ALONE |
| N | ABC123 | 111 | 01DEC2014:00:00:00 | 18APR2021:00:00:00 | MEDIUM |
| N | ABC123 | 111 | 19APR2021:00:00:00 | 31DEC2030:00:00:00 | BIG |
| N | ABC123 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | TOP |
| N | ABC123 | 333 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | YES |
| Y | DEF456 | 1 | 01MAR2014:00:00:00 | 31MAR2018:00:00:00 | ALONE; BOTH |
| Y | DEF456 | 1 | 01APR2018:00:00:00 | 19AUG2018:00:00:00 | ALONE |
| Y | DEF456 | 1 | 20AUG2018:00:00:00 | 31DEC2030:00:00:00 | ALONE; BOTH |
| N | DEF456 | 111 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MEDIUM |
| N | DEF456 | 222 | 01DEC2014:00:00:00 | 31DEC2030:00:00:00 | MID |
| N | DEF456 | 333 | 01MAR2014:00:00:00 | 31DEC2030:00:00:00 | YES |
+---------------+--------+-----------+--------------------+--------------------+--------------+
Is there a way to handle multiple modalities attributes? I can't find a way to delete a member value once a modality of that attribute is ending (i.e. switching from ALONE; BOTH to ALONE after it ended).
Not 100% sure I understand all of this, but I think at least this is one problem.
Looking at where you remove the values, you need to use strip or similar because of spaces. I removed the spaces in the catx() and add strip() to do that here.
if strip(scan(orig_value,_i,';')) ne strip(in_value) then do;
if strip(orig_value) > strip(scan(orig_value,_i,';')) then new_value = catx(';',scan(orig_value,_i,';'),new_value);
else new_value = catx(';',new_value,scan(orig_value,_i,';'));
end;
Otherwise it is comparing words with spaces to words without spaces, and while in some cases those words are identical (or treated as such by SAS), in some cases they aren't, which causes some of your issues here. When I run this, I get "Alone" on the second line, for example.

Construct continuous intervals from non-contiguous validity ranges

Given
data have;
infile datalines missover delimiter="|" dsd;
input id :$20. (start_date end_date) (:date9.) (attribute_1 attribute_2 attribute_3 attribute_4) ($);
format start_date end_date date9.;
datalines;
ID1|01MAR2014|31DEC9999|BIG|YES||
ID2|01SEP2015|30NOV2020|||TWO|
ID2|01SEP2015|31DEC9999|SMALL|||
ID2|01AUG2021|31DEC9999|||TWO|
ID3|01DEC2014|31MAY2016||YES||
ID3|01DEC2014|29JUN2017||||OK
ID3|01DEC2014|31DEC9999|MEDIUM|||
ID3|31MAR2015|29SEP2017|||ONE|
ID3|30JUN2017|31DEC9999||YES||TBD
ID3|30SEP2017|31DEC9999|||ONE, TWO|
;
I would like to get continuous validity ranges for each id with the correct attributes at each of them.
The desired output would be like this:
+-----+------------+-----------+-------------+-------------+-------------+-------------+
| id | start_date | end_date | attribute_1 | attribute_2 | attribute_3 | attribute_4 |
+-----+------------+-----------+-------------+-------------+-------------+-------------+
| ID1 | 01MAR2014 | 31DEC9999 | BIG | YES | | |
| ID2 | 01SEP2015 | 30NOV2020 | SMALL | | TWO | |
| ID2 | 01DEC2020 | 31JUL2021 | SMALL | | | |
| ID2 | 01AUG2021 | 31DEC9999 | SMALL | | TWO | |
| ID3 | 01DEC2014 | 30MAR2015 | MEDIUM | YES | | OK |
| ID3 | 31MAR2015 | 31MAY2016 | MEDIUM | YES | ONE | OK |
| ID3 | 01JUN2016 | 29JUN2016 | MEDIUM | | ONE | OK |
| ID3 | 30JUN2016 | 29SEP2017 | MEDIUM | YES | ONE | TBD |
| ID3 | 30SEP2017 | 31DEC9999 | MEDIUM | YES | ONE, TWO | TBD |
+-----+------------+-----------+-------------+-------------+-------------+-------------+
I found a way of doing it using joins, but would like to know if there exist a better way of doing the following:
data all_intervals;
set have(keep= id start_date end_date);
_start = start_date; output;
_end = start_date-1; output;
_end = end_date; output;
if end_date < '31DEC9999'd then do;
_start = end_date+1; output;
end;
run;
proc sql;
create table all_intervals as
select distinct t1.id, t1._start, t2._end
from all_intervals t1, all_intervals t2
where t1.id = t2.id and t2._end > t1._start
;
quit;
data all_intervals;
set all_intervals;
by id _start;
if first.id or first._start;
run;
proc sql noprint;
select 'max(t1.'||NAME||') as '||NAME into :attributes separated by ','
from sashelp.vcolumn
where libname = "WORK" and memname = "HAVE" and upcase(name) not in ('ID', "_START", "_END")
;
quit;
proc sql;
create table merge as
select t2.id, t2._start as start_date, t2._end as end_date, &attributes.
from have t1 right join all_intervals t2
on t1.id = t2.id
and ((t2._start <= t1.start_date <= t2._end)
or (t2._start <= t1.end_date <= t2._end)
or (t2._start >= t1.start_date and t2._end <= t1.end_date))
group by t2.id, t2._start
order by t2.id, t2._start
;
quit;
proc sort data=merge out=want nodupkey; by id start_date end_date; run;
The above produce the expected output.

Grouping child items and displaying parent sum

I have the following table
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
I would like to group the table by group, insert the grouped sum into value, and then ungroup:
+-------+--------+
| item | value |
+-------+--------+
| 1 | 30 |
| a | 10 |
| b | 20 |
| 2 | 70 |
| b | 30 |
| c | 40 |
+-------+--------+
The purpose of the result is to interpret the first column as items a and b belonging to group 1 with sum 30 and items b and c belonging to group 2 with sum 70.
Such a data transformation can be indicative of a reporting requirement more than a useful data structure for downstream processing. Proc REPORT can create output in the form desired.
data have;
infile datalines;
input group $ item $ value ##; datalines;
1 a 10 1 b 20 2 b 30 2 c 40
;
proc report data=have;
column group item value;
define group / order order=data noprint;
break before group / summarize;
compute item;
if missing(item) then item=group;
endcomp;
run;
I assume that both group and item are character variables
data have;
infile datalines firstobs=4 dlm='|';
input group $ item $ value;
datalines;
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
;
data want (keep=group value);
do _N_=1 by 1 until (last.group);
set have;
by group;
v + value;
end;
value = v;output;v=0;
do _N_=1 to _N_;
set have;
group = item;
output;
end;
run;

Proc sql and macro variables

I am trying to run a code that should work on tables created considering different factors. As these factors can be more than 1, I decided to create a macro %let to list them:
%let list= factor1 factor2 ...;
What I would like to do is run a code to create these tables using different factors. For each factor, I computed using proc means the mean and the standard deviation, so I should have the variables &list._mean and &list._stddev in the table created by the proc means for each factor. This table is labelled as t2 and I need to join to another table, t1. From t1 I am considering all the variables.
My main difficulties are, therefore, in the proc sql:
proc sql;
create table new_table as
select t1.*
, t2.&list._mean as mean
, t2.&list._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.&list.
quit;
This code is returning an error and I think because I am considering t2.factor1 factor2, i.e. t2 is only applied to the first factor, not to the second one.
What I would expect is the following:
proc sql;
create table new_table as
select t1.*
, t2.factor1._mean as mean
, t2.factor1._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.factor1.
quit;
and another one for factor2.
UPDATE CODE:
%macro test_v1(
_dtb
,_input
,_output
,_time
,_factor
);
data &_input.;
set &_dtb..&_input.;
keep &_col_period. &_factor.;
run;
proc sort data = work.&_input.
out = &_input._1;
by &_factor. &_time.;
run;
%put ERROR: 2
proc means data=&_input._1 nonobs mean stddev;
class &_time.;
var &_factor.;
output out=&_input._n (drop=_TYPE_) mean= stddev= /autoname ;
run;
%put ERROR: 3
proc sql;
create table work.&_input._data as
select t1.*
,t2.&_factor._mean as mean
,t2.&_factor._stddev as stddev
from &_input. as t1
left join &_input._n as t2
on t1.&_time.=t2.&_time.
order by &_factor.;
quit;
%mend test_v1;
Then my question is on how I can consider multiple factors, defined into a macro as a list, as columns of tables and as input data into a macro (for example: %test(dataset, tablename, list).
I suspect that trying to use PROC SQL is what is making the problem hard. If you stick to just using normal SAS syntax your space delimited list of variable names is easy to use.
So taking your code and tweaking it a little:
%macro test_v1
(_dtb /* Input libref */
,_input /* Input member name */
,_output /* Output dataset */
,_time /* Class/By variable(s) */
,_factor /* Analysis variable(s) */
);
proc sort data= &_dtb..&_input. out=_temp1;
by &_time. ;
run;
proc means data=_temp1 nonobs mean stddev;
by &_time.;
var &_factor.;
output out=_temp2 (drop=_TYPE_) mean= stddev= /autoname ;
run;
data &_output. ;
merge _temp1 _temp2 ;
by &_time.;
run;
%mend test_v1;
We can then test it using SASHELP.CLASS by using SEX as the "time" variable and HEIGHT and WEIGHT as the analysis variables.
%test_v1(_dtb=sashelp,_input=class,_output=want,_time=sex,_factor=height weight);
You can try to add macro loop to your macros by scanning list of factors. It could look like:
%macro test(list);
%do i=1 to %sysfunc(countw(&list,%str( )));
%let factorname=%scan(&list,&i,%str( ));
/* if macro variable list equals factor1 factor2 then there would be
two iterations in loop, i=1 factorname=factor1 and i=2 factorname=2*/
/*your code here*/
%end
%mend test;
UPDATE:
%macro test(_input, _output, factors_list); %macro d; %mend d;
%do i=1 %to %sysfunc(countw(&factors_list,%str( )));
%let tfactor=%scan(&factors_list,&i,%str( ));
proc sort data = work.&_input.
out = &_input._1;
by &factors_list. time;
run;
proc means data=&_input._1 nonobs mean stddev;
class time;
var &tfactor.;
output out=&_input._num (drop=_TYPE_) mean= stddev= /autoname ;
run;
proc sql;
create table &_output._&tfactor as
select t1.*
, t2.&tfactor._mean as mean
, t2.&tfactor._stddev as stddev
from &_input as t1
left join &_input._num as t2
on t1.time=t2.time
order by t1.&tfactor;
quit;
%end;
%mend test;
%test(have,newdata,factor1 factor2);
Have dataset:
+------+---------+---------+
| time | factor1 | factor2 |
+------+---------+---------+
| 1 | 12345 | 1234 |
| 2 | 123 | 12 |
| 3 | 1 | -1 |
| 4 | -12 | -123 |
| 5 | -1234 | -12345 |
| 6 | 9876 | 987 |
| 7 | 98 | 8 |
| 8 | 9 | 7 |
| 1 | 1234 | 123 |
| 2 | 12 | 1 |
| 3 | 12 | -12 |
| 4 | -123 | -1234 |
| 5 | -12345 | -123456 |
| 6 | 987 | 98 |
| 7 | 9 | -9 |
| 8 | 1234 | 1234 |
+------+---------+---------+
NEWDATA_FACTOR1:
+------+---------+---------+---------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+---------+--------------+
| 5 | -12345 | -123456 | -6789.5 | 7856.6634458 |
| 5 | -1234 | -12345 | -6789.5 | 7856.6634458 |
| 4 | -123 | -1234 | -67.5 | 78.488852712 |
| 4 | -12 | -123 | -67.5 | 78.488852712 |
| 3 | 1 | -1 | 6.5 | 7.7781745931 |
| 7 | 9 | -9 | 53.5 | 62.932503526 |
| 8 | 9 | 7 | 621.5 | 866.20580695 |
| 3 | 12 | -12 | 6.5 | 7.7781745931 |
| 2 | 12 | 1 | 67.5 | 78.488852712 |
| 7 | 98 | 8 | 53.5 | 62.932503526 |
| 2 | 123 | 12 | 67.5 | 78.488852712 |
| 6 | 987 | 98 | 5431.5 | 6285.472178 |
| 1 | 1234 | 123 | 6789.5 | 7856.6634458 |
| 8 | 1234 | 1234 | 621.5 | 866.20580695 |
| 6 | 9876 | 987 | 5431.5 | 6285.472178 |
| 1 | 12345 | 1234 | 6789.5 | 7856.6634458 |
+------+---------+---------+---------+--------------+
NEWDATA_FACTOR2:
+------+---------+---------+----------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+----------+--------------+
| 5 | -12345 | -123456 | -67900.5 | 78567.341564 |
| 5 | -1234 | -12345 | -67900.5 | 78567.341564 |
| 4 | -123 | -1234 | -678.5 | 785.5956339 |
| 4 | -12 | -123 | -678.5 | 785.5956339 |
| 3 | 12 | -12 | -6.5 | 7.7781745931 |
| 7 | 9 | -9 | -0.5 | 12.02081528 |
| 3 | 1 | -1 | -6.5 | 7.7781745931 |
| 2 | 12 | 1 | 6.5 | 7.7781745931 |
| 8 | 9 | 7 | 620.5 | 867.62002052 |
| 7 | 98 | 8 | -0.5 | 12.02081528 |
| 2 | 123 | 12 | 6.5 | 7.7781745931 |
| 6 | 987 | 98 | 542.5 | 628.61792847 |
| 1 | 1234 | 123 | 678.5 | 785.5956339 |
| 6 | 9876 | 987 | 542.5 | 628.61792847 |
| 1 | 12345 | 1234 | 678.5 | 785.5956339 |
| 8 | 1234 | 1234 | 620.5 | 867.62002052 |
+------+---------+---------+----------+--------------+

Merging as of a date

I am trying to merge two tables. table A has an id column, a date column, and an amount value for every date in a period
Table B has both id and date, but also other columns with details. However, there is only one entry any time there is a change in the details, so I do not know how to merge with normal joins. I want that for every entry in A, the details are populated as of the latest day available in B for that ID before the date in A.
Table A
| ID | date | amount |
| 1 | 01JAN| 56 |
| 1 | 02JAN| 54 |
| 1 | 03JAN| 23 |
| 1 | 04JAN| 43 |
Table B
| ID | date | details|
| 1 | 01JAN| x |
| 1 | 03JAN| y |
Wanted Output
Table A
| ID | date | amount | details |
| 1 | 01JAN| 56 | x |
| 1 | 02JAN| 54 | x |
| 1 | 03JAN| 23 | y |
| 1 | 04JAN| 43 | y |
for the jan2 entry, the latest available details as of that date is 'x', for jan3 it is y
Thank you in advance for any guidance you could provide
This will work for the question you have asked literally:
data want;
retain details_last;
merge table1 table2;
by ID date;
if not missing(details) then details_last = details;
else details = details_last;
drop details_last;
run;
But this will only work if your data meets the conditions that you have presented like the date ranges in table B should always fall within the date ranges in table A and not outside (i.e. only interpolation, no extrapolation).