I have the following table
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
I would like to group the table by group, insert the grouped sum into value, and then ungroup:
+-------+--------+
| item | value |
+-------+--------+
| 1 | 30 |
| a | 10 |
| b | 20 |
| 2 | 70 |
| b | 30 |
| c | 40 |
+-------+--------+
The purpose of the result is to interpret the first column as items a and b belonging to group 1 with sum 30 and items b and c belonging to group 2 with sum 70.
Such a data transformation can be indicative of a reporting requirement more than a useful data structure for downstream processing. Proc REPORT can create output in the form desired.
data have;
infile datalines;
input group $ item $ value ##; datalines;
1 a 10 1 b 20 2 b 30 2 c 40
;
proc report data=have;
column group item value;
define group / order order=data noprint;
break before group / summarize;
compute item;
if missing(item) then item=group;
endcomp;
run;
I assume that both group and item are character variables
data have;
infile datalines firstobs=4 dlm='|';
input group $ item $ value;
datalines;
+-------+--------+---------+
| group | item | value |
+-------+--------+---------+
| 1 | a | 10 |
| 1 | b | 20 |
| 2 | b | 30 |
| 2 | c | 40 |
+-------+--------+---------+
;
data want (keep=group value);
do _N_=1 by 1 until (last.group);
set have;
by group;
v + value;
end;
value = v;output;v=0;
do _N_=1 to _N_;
set have;
group = item;
output;
end;
run;
Related
I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+
Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;
I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .
I'm looking to add a sequence column to my sas dataset, but according to ids and transaction dates. To illustrate, below is the table I'm referring to:
ID | TXN_DT |
01 | 01JAN2020 |
01 | 01JAN2020 |
01 | 02JAN2020 |
01 | 03JAN2020 |
02 | 01JAN2020 |
02 | 02JAN2020 |
02 | 02JAN2020 |
02 | 03JAN2020 |
02 | 03JAN2020 |
and I want to add a sequence like so:
ID | TXN_DT | SEQ |
01 | 01JAN2020 | 1 |
01 | 01JAN2020 | 1 |
01 | 02JAN2020 | 2 |
01 | 03JAN2020 | 3 |
02 | 01JAN2020 | 1 |
02 | 02JAN2020 | 2 |
02 | 02JAN2020 | 2 |
02 | 03JAN2020 | 3 |
02 | 03JAN2020 | 3 |
I'm trying to run the following code, but it seems to jump a row up and not copying the previous' row's value, and instead skips to 2 rows above.
data want;
set have;
by id;
if first.id then seq=1;
else seq+1;
if txn_dt=lag(txn_dt) then seq = lag(seq);
keep id seq txn_dt;
run;
any help? Thanks in advance!
Try
if first.id then seq=0;
seq + (first.id or txn_dt ne lag(txn_dt);
Try to use retain and first.
data want(drop=txn_dt_group);
set have;
by id txn_dt;
retain txn_dt_group seq;
if first.id then do;
txn_dt_group=txn_dt;
seq=1;
end;
if txn_dt ne txn_dt_group then do;
seq=seq+1;
txn_dt_group=txn_dt;
end;
run;
Output:
+-----------+----+-----+
| txn_dt | ID | seq |
+-----------+----+-----+
| 01JAN2020 | 1 | 1 |
| 01JAN2020 | 1 | 1 |
| 02JAN2020 | 1 | 2 |
| 03JAN2020 | 1 | 3 |
| 01JAN2020 | 2 | 1 |
| 02JAN2020 | 2 | 2 |
| 02JAN2020 | 2 | 2 |
| 03JAN2020 | 2 | 3 |
| 03JAN2020 | 2 | 3 |
+-----------+----+-----+
data want;
set have;
by id txn_dt;
if first.id then seq=1;
else if first.txn_dt then seq+1;
run;
I think that should do it.
For completeness, here is a hash solution that does not depend on the order of your data.
data have;
input ID $ TXN_DT :date9.;
infile datalines dlm='|';
format TXN_DT date9.;
datalines;
01|01JAN2020
01|01JAN2020
01|02JAN2020
01|03JAN2020
02|01JAN2020
02|02JAN2020
02|02JAN2020
02|03JAN2020
02|03JAN2020
;
data want(drop=rc);
if _N_ = 1 then do;
dcl hash h1 ();
h1.definekey ('ID', 'TXN_DT');
h1.definedata ('SEQ');
h1.definedone ();
dcl hash h2 ();
h2.definekey ('ID');
h2.definedata ('SEQ');
h2.definedone ();
do until (lr);
set have end=lr;
if h2.find() = 0 then do;
if h1.check() ne 0 then seq + 1;
end;
else seq = 1;
h1.ref();
h2.replace();
end;
end;
set have;
rc = h1.find();
run;
Asked on SAS communitiesas well , havent gotten a correct response.
https://communities.sas.com/t5/SAS-Programming/Identifying-overlap-medication-use/m-p/628115#M185541
I have a problem similar to the problem in -
https://communities.sas.com/t5/SAS-Programming/Concomitant-drug-medication-use/m-p/339879#M77587
However I have an issue , I have overlapping of same drug as well -
Eg:
+----+------+-----------+-----------+-----------+
| ID | DRUG | START_DT | DAYS_SUPP | END_DT |
+----+------+-----------+-----------+-----------+
| 1 | A | 2/17/2010 | 30 | 3/19/2010 |
| 1 | A | 3/17/2010 | 30 | 4/16/2010 |
| 1 | A | 4/12/2010 | 30 | 5/12/2010 |
| 1 | A | 8/20/2010 | 30 | 9/19/2010 |
| 1 | B | 5/6/2009 | 30 | 6/5/2009 |
+----+------+-----------+-----------+-----------+
Here the three A prescriptions are over lapping .
So using the code in the link gives me combinations like A-A-B
whereas I don't want that.
However I want to account for the overlapping days for drug A. So I want to shift the second row prescription to 3/20/2010 to 4/19/2010. Similarly for 3rd A prescription.
the code I have tried -
data have2;
set have_sorted1;
format NEW_START_DT NEW_END_DT _lagEND_DT date9.;
_lagID = lag(patient_ID);
_lagDRUG = lag(drg_cls);
_lagEND_DT = lag(rx_ed_dt);
if patient_ID = _lagID and drg_cls= _lagDRUG and rx_st_dt <= _lagEND_DT then flag=1;
else flag = 0;
retain NEW_START_DT NEW_END_DT;
if flag=0 then do;
NEW_START_DT = rx_st_dt;
NEW_END_DT = rx_ed_dt;
end;
else do;
New_start_dt = NEW_End_DT + 1;
NEW_END_DT = new_start_dt + DAY_SUPP ;
end;
/* drop flag _:;*/
run;
But even then I get incorrect result -
id Drug drug_start day_supp drug_end New_start New_end
15 A 6-Sep-15 30 5-Oct-15 6-Sep-15 5-Oct-15
15 A 24-Sep-15 90 22-Dec-15 6-Oct-15 4-Jan-16
15 A 6-Dec-15 90 4-Mar-16 5-Jan-16 4-Apr-16
15 A 26-Feb-16 90 25-May-16 5-Apr-16 4-Jul-16
15 A 29-May-16 90 26-Aug-16 29-May-16 26-Aug-16
15 A 7-Dec-16 90 6-Mar-17 7-Dec-16 6-Mar-17
15 A 17-Feb-17 90 17-May-17 7-Mar-17 5-Jun-17
It might be easier to track the 'flag' state implicitly in a shift variable that tracks how many days to shift forward.
Example:
Shift is always applied, but will be zero when no overlap occurs. The prior end, after computation, is tracked in a retained variable. The code does not need to rely on LAG.
data have;
infile cards firstobs=3 dlm='|';
input ID DRUG: $ START_DT: mmddyy10. DAYS_SUPP END_DT: mmddyy10.;
format start_dt end_dt mmddyy10.;
datalines;
| ID | DRUG | START_DT | DAYS_SUPP | END_DT |
+----+------+-----------+-----------+-----------+
| 1 | A | 2/17/2010 | 30 | 3/19/2010 |
| 1 | A | 3/17/2010 | 30 | 4/16/2010 |
| 1 | A | 4/12/2010 | 30 | 5/12/2010 |
| 1 | A | 8/20/2010 | 30 | 9/19/2010 |
| 1 | B | 5/6/2009 | 30 | 6/5/2009 |
;
data want;
set have;
by id drug;
retain shift prior_shifted_end;
select;
when (first.drug) shift = 0;
when (prior_shifted_end > start_dt) shift = prior_shifted_end - start_dt + 1;
otherwise shift = 0;
end;
original_start_dt = start_dt;
original_end_dt = end_dt;
start_dt + shift;
end_dt + shift;
prior_shifted_end = end_dt;
format prior: original: mmddyy10.;
run;
I am trying to run a code that should work on tables created considering different factors. As these factors can be more than 1, I decided to create a macro %let to list them:
%let list= factor1 factor2 ...;
What I would like to do is run a code to create these tables using different factors. For each factor, I computed using proc means the mean and the standard deviation, so I should have the variables &list._mean and &list._stddev in the table created by the proc means for each factor. This table is labelled as t2 and I need to join to another table, t1. From t1 I am considering all the variables.
My main difficulties are, therefore, in the proc sql:
proc sql;
create table new_table as
select t1.*
, t2.&list._mean as mean
, t2.&list._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.&list.
quit;
This code is returning an error and I think because I am considering t2.factor1 factor2, i.e. t2 is only applied to the first factor, not to the second one.
What I would expect is the following:
proc sql;
create table new_table as
select t1.*
, t2.factor1._mean as mean
, t2.factor1._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.factor1.
quit;
and another one for factor2.
UPDATE CODE:
%macro test_v1(
_dtb
,_input
,_output
,_time
,_factor
);
data &_input.;
set &_dtb..&_input.;
keep &_col_period. &_factor.;
run;
proc sort data = work.&_input.
out = &_input._1;
by &_factor. &_time.;
run;
%put ERROR: 2
proc means data=&_input._1 nonobs mean stddev;
class &_time.;
var &_factor.;
output out=&_input._n (drop=_TYPE_) mean= stddev= /autoname ;
run;
%put ERROR: 3
proc sql;
create table work.&_input._data as
select t1.*
,t2.&_factor._mean as mean
,t2.&_factor._stddev as stddev
from &_input. as t1
left join &_input._n as t2
on t1.&_time.=t2.&_time.
order by &_factor.;
quit;
%mend test_v1;
Then my question is on how I can consider multiple factors, defined into a macro as a list, as columns of tables and as input data into a macro (for example: %test(dataset, tablename, list).
I suspect that trying to use PROC SQL is what is making the problem hard. If you stick to just using normal SAS syntax your space delimited list of variable names is easy to use.
So taking your code and tweaking it a little:
%macro test_v1
(_dtb /* Input libref */
,_input /* Input member name */
,_output /* Output dataset */
,_time /* Class/By variable(s) */
,_factor /* Analysis variable(s) */
);
proc sort data= &_dtb..&_input. out=_temp1;
by &_time. ;
run;
proc means data=_temp1 nonobs mean stddev;
by &_time.;
var &_factor.;
output out=_temp2 (drop=_TYPE_) mean= stddev= /autoname ;
run;
data &_output. ;
merge _temp1 _temp2 ;
by &_time.;
run;
%mend test_v1;
We can then test it using SASHELP.CLASS by using SEX as the "time" variable and HEIGHT and WEIGHT as the analysis variables.
%test_v1(_dtb=sashelp,_input=class,_output=want,_time=sex,_factor=height weight);
You can try to add macro loop to your macros by scanning list of factors. It could look like:
%macro test(list);
%do i=1 to %sysfunc(countw(&list,%str( )));
%let factorname=%scan(&list,&i,%str( ));
/* if macro variable list equals factor1 factor2 then there would be
two iterations in loop, i=1 factorname=factor1 and i=2 factorname=2*/
/*your code here*/
%end
%mend test;
UPDATE:
%macro test(_input, _output, factors_list); %macro d; %mend d;
%do i=1 %to %sysfunc(countw(&factors_list,%str( )));
%let tfactor=%scan(&factors_list,&i,%str( ));
proc sort data = work.&_input.
out = &_input._1;
by &factors_list. time;
run;
proc means data=&_input._1 nonobs mean stddev;
class time;
var &tfactor.;
output out=&_input._num (drop=_TYPE_) mean= stddev= /autoname ;
run;
proc sql;
create table &_output._&tfactor as
select t1.*
, t2.&tfactor._mean as mean
, t2.&tfactor._stddev as stddev
from &_input as t1
left join &_input._num as t2
on t1.time=t2.time
order by t1.&tfactor;
quit;
%end;
%mend test;
%test(have,newdata,factor1 factor2);
Have dataset:
+------+---------+---------+
| time | factor1 | factor2 |
+------+---------+---------+
| 1 | 12345 | 1234 |
| 2 | 123 | 12 |
| 3 | 1 | -1 |
| 4 | -12 | -123 |
| 5 | -1234 | -12345 |
| 6 | 9876 | 987 |
| 7 | 98 | 8 |
| 8 | 9 | 7 |
| 1 | 1234 | 123 |
| 2 | 12 | 1 |
| 3 | 12 | -12 |
| 4 | -123 | -1234 |
| 5 | -12345 | -123456 |
| 6 | 987 | 98 |
| 7 | 9 | -9 |
| 8 | 1234 | 1234 |
+------+---------+---------+
NEWDATA_FACTOR1:
+------+---------+---------+---------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+---------+--------------+
| 5 | -12345 | -123456 | -6789.5 | 7856.6634458 |
| 5 | -1234 | -12345 | -6789.5 | 7856.6634458 |
| 4 | -123 | -1234 | -67.5 | 78.488852712 |
| 4 | -12 | -123 | -67.5 | 78.488852712 |
| 3 | 1 | -1 | 6.5 | 7.7781745931 |
| 7 | 9 | -9 | 53.5 | 62.932503526 |
| 8 | 9 | 7 | 621.5 | 866.20580695 |
| 3 | 12 | -12 | 6.5 | 7.7781745931 |
| 2 | 12 | 1 | 67.5 | 78.488852712 |
| 7 | 98 | 8 | 53.5 | 62.932503526 |
| 2 | 123 | 12 | 67.5 | 78.488852712 |
| 6 | 987 | 98 | 5431.5 | 6285.472178 |
| 1 | 1234 | 123 | 6789.5 | 7856.6634458 |
| 8 | 1234 | 1234 | 621.5 | 866.20580695 |
| 6 | 9876 | 987 | 5431.5 | 6285.472178 |
| 1 | 12345 | 1234 | 6789.5 | 7856.6634458 |
+------+---------+---------+---------+--------------+
NEWDATA_FACTOR2:
+------+---------+---------+----------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+----------+--------------+
| 5 | -12345 | -123456 | -67900.5 | 78567.341564 |
| 5 | -1234 | -12345 | -67900.5 | 78567.341564 |
| 4 | -123 | -1234 | -678.5 | 785.5956339 |
| 4 | -12 | -123 | -678.5 | 785.5956339 |
| 3 | 12 | -12 | -6.5 | 7.7781745931 |
| 7 | 9 | -9 | -0.5 | 12.02081528 |
| 3 | 1 | -1 | -6.5 | 7.7781745931 |
| 2 | 12 | 1 | 6.5 | 7.7781745931 |
| 8 | 9 | 7 | 620.5 | 867.62002052 |
| 7 | 98 | 8 | -0.5 | 12.02081528 |
| 2 | 123 | 12 | 6.5 | 7.7781745931 |
| 6 | 987 | 98 | 542.5 | 628.61792847 |
| 1 | 1234 | 123 | 678.5 | 785.5956339 |
| 6 | 9876 | 987 | 542.5 | 628.61792847 |
| 1 | 12345 | 1234 | 678.5 | 785.5956339 |
| 8 | 1234 | 1234 | 620.5 | 867.62002052 |
+------+---------+---------+----------+--------------+
I have a database with 3 columns. ID, Date and amount. It is ordered by ID and Date. All I want to do is to add a row after the latest occurrence of every ID with the same ID, Date = Date + 1 Month and Amount = 0.
As an Illustration I want to go from this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
B | 01FEB| 0 |
B | 01MAR| 1 |
to this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
A | 01MAR| 0 | <- ADD THIS ROW
B | 01FEB| 0 |
B | 01MAR| 1 |
B | 01APR| 0 |<- ADD THIS ROW
I know I should use intxn but beyond that I don't really know what to do. I appreciate any input.
Assuming that the DATE variable has actual date values in it you just need to output twice on the last observation in each group.
data want;
set have;
by id;
output;
if last.id then do;
date=intnx('month',date,1,'b');
amount=0;
output;
end;
run;