I have a big panel dataset that looks somewhat like this:
data have;
input id t a b ;
datalines;
1 1 0 0
1 2 0 0
1 3 1 0
1 4 0 0
1 5 0 1
1 6 1 0
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 1
2 1 0 0
2 2 1 0
2 3 0 0
2 4 0 0
2 5 0 1
2 6 0 1
2 7 0 1
2 8 0 1
2 9 1 0
2 10 0 1
3 1 0 0
3 2 0 0
3 3 0 0
3 4 0 0
3 5 0 0
3 6 0 0
3 7 1 0
3 8 0 0
3 9 0 0
3 10 0 0
;
run;
For every ID I want to record all 'trigger' events, namely when a=1 and then I need to how long it takes to the next occurrence of b=1. The final output should then give me the following:
data want;
input id a_no a_t b_t diff ;
datalines;
1 1 3 5 2
1 2 6 10 4
2 1 2 5 3
2 2 9 10 1
3 1 7 . .
;
run;
It is of course no problem to get all a=1 and b=1 events, but as it is a very big dataset with a lot of both events for every ID I am searching for an elegant and straight-forward solution. Any ideas?
Here's a fairly simple SQL approach that gives more or less the desired output:
proc sql;
create table want
as select
t1.id,
t1.t as a_t,
t2.t as b_t,
t2.t - t1.t as diff
from
have(where = (a=1)) t1
left join
have(where = (b=1)) t2
on
t1.id = t2.id
and t2.t > t1.t
group by t1.id, t1.t
having diff = min(diff)
;
quit;
The only part missing is a_no. This sort of row-incrementing ID is quite a lot of work to generate consistently in SQL, but it's trivial with an extra data step:
data want;
set want;
by id;
if first.id then a_no = 0;
a_no + 1;
run;
An elegant DATA step way can use nested DOW loops. It's straight forward when you understand DOW loops.
data want(keep=id--diff);
length id a_no a_t b_t diff 8;
do until (last.id); * process each group;
do a_no = 1 by 1 until(last.id); * counter for each output;
do until ( output_condition or end); * process each triggering state change;
SET have end=end; * read data;
by id; * setup first. last. variables for group;
if a=1 then a_t = t; * detect and record start of trigger state;
output_condition = (b=1 and t > a_t > 0); * evaluate for proper end of trigger state;
end;
if output_condition then do;
b_t = t; * compute remaining info at output point;
diff = b_t - a_t;
OUTPUT;
a_t = .; * reset trigger state tracking variables;
b_t = .;
end;
else
OUTPUT; * end of data reached without triggered output;
end;
end;
run;
Note: A SQL way (not shown) can use self join within groups.
Related
Suppose i have random diagnostic codes, such as 001, v58, ..., 142,.. How can I construct columns from the codes which is 1 for the records?
Input:
id found code
1 1 001
2 0 v58
3 1 v58
4 1 003
5 0 v58
......
......
15000 0 v58
Output:
id code_001 code_v58 code_003 .......
1 1 0 0
2 0 0 0
3 0 1 0
4 1 0 0
5 0 0 0
.........
.........
You will want to TRANSPOSE the values and name the pivoted columns according to data (value of code) with an ID statement.
Example:
In real world data it is often the case that missing diagnoses will be flagged zero, and that has to be done in a subsequent step.
data have;
input id found code $;
datalines;
1 1 001
2 0 v58
2 1 003 /* second diagnosis result for patient 2 */
3 1 v58
4 1 003
5 0 v58
;
proc transpose data=have out=want(drop=_name_) prefix=code_;
by id;
id code; * column name becomes <prefix><code>;
var found;
run;
* missing occurs when an id was not diagnosed with a code;
* if that means the flag should be zero (for logistic modeling perhaps)
* the missings need to be changed to zeroes;
data want;
set want;
array codes code_:;
do _n_ = 1 to dim(codes); /* repurpose automatic variable _n_ for loop index */
if missing(codes(_n_)) then codes(_n_) = 0;
end;
run;
I want to find the number of unique ids for every subset combination of the variables. For example
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
run;
I want the result to be
var1 var2 var3 count
. . 0 5
. . 1 5
. 0 . 7
. 0 0 5
. 0 1 3
. 1 . 2
. 1 1 2
0 . . 3
0 . 0 2
0 . 1 1
0 0 . 3
0 0 0 2
0 0 1 1
1 . . 7
1 . 0 4
1 . 1 4
1 0 . 5
1 0 0 4
1 0 1 2
1 1 . 2
1 1 1 2
which is the result of appending all the possible proc sql; group bys (var1 is shown below)
proc sql;
create table sub1 as
select var1, count(distinct id) as count
from have
where not missing(var1)
group by var1
;
quit;
I don't care about the case where all variables are missing or when any of the variables in the group by are missing. Is there a more efficient way of doing this?
You can use Proc SUMMARY to compute the combinations of var1-var3 values for each id by group. From the SUMMARY output a SQL query can count the distinct ids per combination.
Example:
data have;
input id var1 var2 var3;
datalines;
5 1 0 0
5 1 1 1
5 1 0 1
5 0 0 0
6 1 0 0
7 1 1 1
8 1 0 1
9 0 0 0
10 1 0 0
11 1 0 0
12 1 . 1
13 0 0 1
;
proc summary noprint missing data=have;
by id;
class var1-var3;
output out=combos;
run;
proc sql;
create table want as
select var1, var2, var3, count(distinct id) as count
from combos
group by var1, var2, var3
;
I have this database:
data temp;
input ID monitoring_date score ;
datalines;
1 10/11/2006 0
1 10/12/2006 0
1 15/01/2007 1
1 20/01/2007 1
1 20/04/2007 1
2 10/08/2008 0
2 11/09/2008 0
2 17/10/2008 1
2 12/11/2008 0
3 10/12/2008 0
3 10/08/2008 0
3 11/09/2008 0
3 17/10/2009 1
3 12/12/2009 1
3 05/01/2010 0
4 10/12/2006 0
4 10/08/2006 0
4 11/09/2006 0
4 17/10/2007 0
4 12/12/2007 0
4 09/04/2008 1
4 05/08/2008 1
5 10/12/2013 0
5 03/09/2013 0
5 11/09/2013 0
5 19/10/2014 0
5 10/12/2014 1
5 14/01/2015 1
6 10/12/2017 0
6 10/08/2018 0
6 11/09/2018 0
6 17/10/2018 1
6 12/12/2018 1
6 09/04/2019 1
6 25/07/2019 0
6 05/08/2019 1
6 15/09/2019 0
;
I would like to create a new database with a new column where I note, for each ID, the date of the first progression of the score from 0 to 1 and if this progression is stable at least 3 months until at the end of monitoring else date_progresion = . :
data want;
input ID date_progression;
datalines;
1 15/01/2007
2 .
3 .
4 09/04/2008
5 .
6 .
;
I really have no idea to code this and I would like to get the wanted data to generate a cox model where the progression (Yes/No) is my event.
I am really stuck !
Thank you in advance for your help!
A DOW loop can process the ID groups, tracking for a single active run of 1s. A run has a start date and duration.
Example:
data want;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
select;
when (pscore = 0 and score = 1) do; state = 1; start = date; dur = 1; end;
when (pscore = 1 and score = 1) do; state = 2; dur + 1; end;
when (pscore = 1 and score = 0) do; state = 3; start = .; dur = .; end;
when (pscore = 0 and score = 0) do; state = 4; end;
otherwise;
end;
pscore = score;
end;
if state = 2 and dur >= 3 then progression_date = start;
keep ID progression_date;
format progression_date yymmdd10.;
run;
data test;
input Index Indicator value FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;
run;
I have a data set with the first 3 columns. How do I get the 4th columns based on the indicators? For example, for the index, when the indicator =1, the value is 21, so I put 21 is the final values in all lines for index 1.
Use the SAS Retain Keyword.
You can do this in a data step; by Retaining the Value where indicator = 1.
Steps:
Sort your data by Index and Indicator
Group by the Index & Retain the Value where Indicator=1
Code:
/*Sort Data by Index and Indicator & remove the hardcodeed finalvalue*/
proc sort data=test (keep= Index Indicator value);
by index descending indicator ;
run;
/*Retain the FinalValue*/
data want;
set test;
retain FinalValue;
keep Index Indicator value FinalValue;
if indicator =1 then do;FinalValue=value;end;
/*The If statement below will assign . to records that doesn't have an indicator value of 1*/
if indicator ne 1 and FIRST.Index=1 then FinalValue=.;
by index;
run;
Output:
Index=1 Indicator=1 value=21 FinalValue=21
Index=1 Indicator=0 value=5 FinalValue=21
Index=2 Indicator=1 value=0 FinalValue=0
Index=3 Indicator=1 value=7 FinalValue=7
Index=3 Indicator=0 value=4 FinalValue=7
Index=3 Indicator=0 value=8 FinalValue=7
Index=3 Indicator=0 value=2 FinalValue=7
Index=4 Indicator=1 value=1 FinalValue=1
Index=4 Indicator=0 value=4 FinalValue=1
Use proc sql by left join. Select value which indicator=1 and group by index, then left join with original dataset. It seemed that your first row of index=3 should be 7, not 0.
proc sql;
select a.*,b.finalvalue from test a
left join (select *,value as finalvalue from test group by index having indicator=1) b
on a.index=b.index;
quit;
This is rather old school but should be adequate. I reckon you call it a self merge or something.
data test;
input Index Indicator value;* FinalValue;
datalines;
1 0 5 21
1 1 21 21
2 1 0 0
3 0 4 7
3 1 7 7
3 0 8 7
3 0 2 7
4 1 1 1
4 0 4 1
;;;;
run;
data final;
if 0 then set test;
merge test(where=(indicator eq 1) rename=(value=FinalValue)) test;
by index;
run;
proc print;
run;
Final
Obs Index Indicator value Value
1 1 0 5 21
2 1 1 21 21
3 2 1 0 0
4 3 0 4 7
5 3 1 7 7
6 3 0 8 7
7 3 0 2 7
8 4 1 1 1
9 4 0 4 1
I wish to get last number in a sequence of equal numbers. For example, I have the following dataset:
X
1
1
0
0
0
1
1
0
Given that sequence of numbers I need extract the last number of a sequence of "ones" until appear a 0. That is what I want:
X Seq
1 1
1 2
1 3
0 1
0 2
0 3
1 1
1 2
0 1
1 1
1 2
1 3
0 1
I need create a new dataset with the numbers in bold, that is:
Seq1
3
2
3
Thanks for any advice.
One more option - use the NOTSORTED BY group option.
Data want;
Set have;
By x NOTSORTED;
Retain count;
If first.x then count=1;
Else count+1;
If last.x then output;
Keep count;
Run;
You can create group variables using a lag and then keep the last observation of each group you have created:
data temp;
input x $;
datalines;
1
1
1
0
0
0
1
1
0
1
1
1
0
;
run;
data temp2;
set temp;
retain flag;
if lag(x) > x then flag = _n_;
if x = 0 then delete;
run;
data temp3 (keep = seq1);
set temp2;
seq1 + 1;
by flag;
if first.flag then seq1 = 1;
if last.flag then save = 1;
if missing(save) then delete;
run;
Use a proc summary with notsorted option here:
data math;
input x;
datalines;
1
1
1
0
0
0
1
1
0
1
0
1
1
1
1
0
1
;
run;
proc summary data=math;
by x notsorted;
class x;
output Out=z;
run;
data z (drop=_type_ x);
set z (rename=(_FREQ_=COUNT));
where _type_=1 and x=1;/*if you are looking for other number then 1, replace it here*/
run;
proc print data=z noobs;
run;
result is:
Here's a solution using only one data step with a retain statement.
data have;
input x ##;
output;
datalines;
1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1
;
data want(keep = count);
set have end = last;
retain x_previous . count .;
if x = 0 then do;
if x_previous = 1 then do;
output;
count = 0;
end;
end;
else if x = 1 then count + 1;
if last = 1 and count > 0 then output;
x_previous = x;
run;
Results
count
-----
3
2
1
4
1