I have a dataset with 4 observations (rows) per person.
I want to create three new variables that calculate the difference between the second and first, third and second, and fourth and third rows.
I think retain can do this, but I'm not sure how.
Or do I need an array?
Thanks!
data test;
input person var;
datalines;
1 5
1 10
1 12
1 20
2 1
2 3
2 5
2 90
;
run;
data test;
set test;
by person notsorted;
retain pos;
array diffs{*} diff0-diff3;
retain diff0-diff3;
if first.person then do;
pos = 0;
end;
pos + 1;
diffs{pos} = dif(var);
if last.person then output;
drop var diff0 pos;
run;
Why not use The Lag function.
data test; input person var;
cards;
1 5
1 10
1 12
1 20
2 1
2 3
2 5
2 90
run;
data test; set test;
by person;
LagVar=Lag(Var);
difference=var-Lagvar;
if first.person then difference=.;
run;
An alternative approach without arrays.
/*-- Data from simonn's answer --*/
data SO1019005;
input person var;
datalines;
1 5
1 10
1 12
1 20
2 1
2 3
2 5
2 90
;
run;
/*-- Why not just do a transpose? --*/
proc transpose data=SO1019005 out=NewData;
by person;
run;
/*-- Now calculate your new vars --*/
data NewDataWithVars;
set NewData;
NewVar1 = Col2 - Col1;
NewVar2 = Col3 - Col2;
Newvar3 = Col4 - Col3;
run;
Why not use the dif() function instead?
/* test data */
data one;
do id = 1 to 2;
do v = 1 to 4 by 1;
output;
end;
end;
run;
/* check */
proc print data=one;
run;
/* on lst
Obs id v
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
*/
/* now create diff within id */
data two;
set one;
by id notsorted; /* assuming already in order */
dif = ifn(first.id, ., dif(v));
run;
proc print data=two;
run;
/* on lst
Obs id v dif
1 1 1 .
2 1 2 1
3 1 3 1
4 1 4 1
5 2 1 .
6 2 2 1
7 2 3 1
8 2 4 1
*/
data output_data;
retain count previous_value diff1 diff2 diff3;
set data input_data
by person;
if first.person then do;
count = 0;
end;
else do;
count = count + 1;
if count = 1 then diff1 = abs(value - previous_value);
if count = 2 then diff2 = abs(value - previous_value);
if count = 3 then do;
diff3 = abs(value - previous_value);
output output_data;
end;
end;
previous_value = value;
run;
Related
I have a dataset that looks like:
Hour Flag
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
I want to have an output dataset like:
Total_Hours Count
2 2
3 1
4 1
As you can see, I want to count the number of hours included in each period with consecutive "1s". A missing value ends the consecutive sequence.
How should I go about doing this? Thanks!
You'll need to do this in two steps. First step is making sure the data is sorted properly and determining the number of hours in a consecutive period:
PROC SORT DATA = <your dataset>;
BY hour;
RUN;
DATA work.consecutive_hours;
SET <your dataset> END = lastrec;
RETAIN
total_hours 0
;
IF flag = 1 THEN total_hours = total_hours + 1;
ELSE
DO;
IF total_hours > 0 THEN output;
total_hours = 0;
END;
/* Need to output last record */
IF lastrec AND total_hours > 0 THEN output;
KEEP
total_hours
;
RUN;
Now a simple SQL statement:
PROC SQL;
CREATE TABLE work.hour_summary AS
SELECT
total_hours
,COUNT(*) AS count
FROM
work.consecutive_hours
GROUP BY
total_hours
;
QUIT;
You will have to do two things:
compute the run lengths
compute the frequency of the run lengths
For the case of using the implict loop
Each run length occurnece can be computed and maintained in a retained tracking variable, testing for a missing value or end of data for output and a non missing value for run length reset or increment.
Proc FREQ
An alternative is to use an explicit loop and a hash for frequency counts.
Example:
data have; input
Hour Flag; datalines;
1 1
2 1
3 .
4 1
5 1
6 .
7 1
8 1
9 1
10 .
11 1
12 1
13 1
14 1
;
data _null_;
declare hash counts(ordered:'a');
counts.defineKey('length');
counts.defineData('length', 'count');
counts.defineDone();
do until (end);
set have end=end;
if not missing(flag) then
length + 1;
if missing(flag) or end then do;
if length > 0 then do;
if counts.find() eq 0
then count+1;
else count=1;
counts.replace();
length = 0;
end;
end;
end;
counts.output(dataset:'want');
run;
An alternative
data _null_;
if _N_ = 1 then do;
dcl hash h(ordered : "a");
h.definekey("Total_Hours");
h.definedata("Total_Hours", "Count");
h.definedone();
end;
do Total_Hours = 1 by 1 until (last.Flag);
set have end=lr;
by Flag notsorted;
end;
Count = 1;
if Flag then do;
if h.find() = 0 then Count+1;
h.replace();
end;
if lr then h.output(dataset : "want");
run;
Several weeks ago, #Richard taught me how to use DOW-loop and direct addressing array. Today, I give it to you.
data want(keep=Total_Hours Count);
array bin[99]_temporary_;
do until(eof1);
set have end=eof1;
if Flag then count + 1;
if ^Flag or eof1 then do;
bin[count] + 1;
count = .;
end;
end;
do i = 1 to dim(bin);
Total_Hours = i;
Count = bin[i];
if Count then output;
end;
run;
And Thanks Richard again, he also suggested me this article.
In a summarized dataset, I have the status of an event at each hour after baseline in which it was recorded. I also have the last hour the event could have been recorded. I want to create a new dataset with one record for each hour from the first through the last hour, with the status for each record being the one from the last recorded status.
Here is an example dataset:
data new;
input hour status last_hour;
cards;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
run;
In this case, the first recorded hour was the second, and the last recorded hour was the 10th. The last possible hour to record data was the 12th.
The final dataset should look like so:
0 . 12
1 . 12
2 1 12
3 1 12
4 1 12
5 1 12
6 1 12
7 0 12
8 0 12
9 1 12
10 0 12
11 0 12
12 0 12
I sort of have it working with this series of data steps, but I'm not sure if there's a cleaner way I'm not seeing.
data step1;
set new (keep=id hour);
by id;
do hour = 0 to last_hour;
output;
end;
run;
proc sort data=step1;
by id hour;
run;
proc sql;
create table step2 as
select distinct a.id, a.hour, b.status
from step1 as a
left join new as b
on a.id = b.id
and a.hour = b.hour
order by a.id, a.hour;
quit;
data step3;
set step2;
by id hour;
retain previous_status;
if first.id then do;
previous_status = .;
if status > . then previous_status = status;
end;
if not first.id then do;
if status = . and previous_status > . then status = previous_status;
if status > . then previous_status = status;
end;
run;
Seeing your code, it seems you left out of your question the fact that you also have id's. So this is a newer solution that deals with different id's. See further below for my first solution ignoring id's.
Since last_hour is always 12, I left it out of the have dataset. It will be added later on.
data have;
input id hour status;
cards;
1 2 1
1 4 1
1 5 1
1 6 1
1 7 0
1 9 1
1 10 0
2 2 1
2 4 1
2 5 1
2 6 1
2 7 0
2 9 1
2 10 0
;
Create a hours dataset, just containing numbers 0 thru 12;
data hours;
do i = 0 to 12;
hour = i;
output;
end;
drop i;
run;
Create a temporary dataset that will have the right number of rows (13 rows for every id, with valid hour values where they exist in the have table).
proc sql;
create table tmp as
select distinct t1.id, t2.hour, 12 as last_hour
from have as t1
cross join
(select hour from hours) as t2;
quit;
Then use merge and retain to fill in the missing hour column where appropriate.
data want;
merge have
tmp;
by id hour;
retain status_previous;
if not first.id then do;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
end;
if last.id then status_previous = .;
drop status_previous;
run;
Previous solution (no id's)
If last_hour is always 12, then this should do it:
data have;
input hour status last_hour;
datalines;
2 1 12
4 1 12
5 1 12
6 1 12
7 0 12
9 1 12
10 0 12
;
data hours;
do i = 0 to 12;
hour = i;
last_hour = 12;
output;
end;
drop i;
run;
data want;
merge have
hours;
by hour;
retain status_previous;
if status ne . then status_previous = status;
else if status_previous ne . then status = status_previous;
drop status_previous;
run;
I have a time series SAS dataset and I want to transfer it to vertical dataset.
My data looks like..
ID A2009 A2010 A2011 A2012
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
data multcol;
infile datalines;
input ID A2009 A2010 A2011 A2012 A2013;
return;
datalines;
1 1 2 3 4 5
2 1 2 3 4 5
3 1 2 3 4 5
4 1 2 3 4 5
5 1 2 3 4 5
;
run;
proc print data=multcol noobs;
run;
I search the web only find someone's solution as following.Not worked.
But my dataset is too large, this method shut down my computer.
data cmbcol(keep=a orig_varname orig_obsnum);
set multcol;
array myvars _numeric_;
do i = 2 to dim(myvars);
orig_varname = vname(myvars(i));
orig_obsnum = _n_;
A = myvars(i);
output;
end;
run;
proc print data=cmbcol ;
title 'cmbcol';
run;
proc sort data=cmbcol;
by orig_varname a;
run;
proc print data=cmbcol noobs;
title 'cmbcol';
run;
And I want them to become like this.
ID t t+1
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
1 3 4
2 3 4
3 3 4
4 3 4
5 3 4
How can we do that?
Thanks in advance.
That is an unusual data structure for sure, but you could achieve this using the following macro (adjust to your needs).
options validvarname = any;
%macro transp;
%let i = 2009;
%do %while (&i <= 2011);
%let j = %eval(&i + 1);
data part_&i(rename = (A&i = t A&j = 't+1'n));
set multcol(keep = ID A&i A&j);
run;
%let i = %eval(&i + 1);
%end;
data combined;
set part_:;
run;
proc datasets nolist nodetails;
delete part_:;
quit;
%mend transp;
%transp
I checked out this previous post (LINK) for potential solution, but still not working. I want to sum across rows using the ID as the common identifier. The num variable is constant. The id and comp the two variables I want to use to creat a pct variable, which = sum of [comp = 1] / num
Have:
id Comp Num
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
Want:
id tot pct
1 2 100
2 3 0.666666667
3 1 100
Currently have:
proc sort data=have;
by id;
run;
data want;
retain tot 0;
set have;
by id;
if first.id then do;
tot = 0;
end;
if comp in (1) then tot + 1;
else tot + 0;
if last.id;
pct = tot / num;
keep id tot pct;
output;
run;
I use SQL for things like this. You can do it in a Data Step, but the SQL is more compact.
data have;
input id Comp Num;
datalines;
1 1 2
2 0 3
3 1 1
2 1 3
1 1 2
2 1 3
;
run;
proc sql noprint;
create table want as
select id,
sum(comp) as tot,
sum(comp)/count(id) as pct
from have
group by id;
quit;
Hi there is a much more elegant solution to your problem :)
proc sort data = have;
by id;
run;
data want;
do _n_ = 1 by 1 until (last.id);
set have ;
by id ;
tot = sum (tot, comp) ;
end ;
pct = tot / num ;
run;
I hope it is clear. I use sql too because I am new and the DOW loop is rather complicated but in your case its pretty straightforward.
I'm working on a project in SAS and I wanted to create a dummy variable that accounted for ``preferences in medicine''. I have a long data-set, by time period, of individuals taking either medicine type 1 or type 2. For my research, I want to create a variable to represent if individuals who take type 1 medicine, then switched to type 2, but went back to type 1. I am unconcerned with the time interval that the individual was on the medication for, just that they followed this pattern.
id month type
1 1 2
1 2 2
1 3 2
2 1 1
2 2 2
2 3 1
...
I have more months, but just wanted to provide something to elucidate what I'm trying to get. Basically, I want to tally those subjects who are like subject 2.
well, nothing fancy, but it works for me:
DATA LONG1;
input id month type;
cards;
1 1 2
1 2 2
1 3 2
1 4 2
1 5 2
1 6 2
1 7 2
1 8 2
1 9 2
1 10 2
2 1 1
2 2 1
2 3 1
2 4 1
2 5 1
2 6 1
2 7 1
2 8 1
2 9 1
2 10 1
3 1 1
3 2 1
3 3 1
3 4 2
3 5 1
3 6 1
3 7 1
3 8 1
3 9 1
3 10 1
;
Proc Print; run;
* 1) make a wide dataset by deconstructing the initial long data by month & rejoining by id
2) then use if/then statements to create your dummy variable,
3) then merge the dummy variable back into your long dataset using ID;
DATA month1; set long1; where month=1; rename month=month_1 type=type_1; Proc Sort; by ID; run;
DATA month2; set long1; where month=2; rename month=month_2 type=type_2; Proc Sort; by ID; run;
DATA month3; set long1; where month=3; rename month=month_3 type=type_3; Proc Sort; by ID; run;
DATA month4; set long1; where month=4; rename month=month_4 type=type_4; Proc Sort; by ID; run;
DATA month5; set long1; where month=5; rename month=month_5 type=type_5; Proc Sort; by ID; run;
DATA month6; set long1; where month=6; rename month=month_6 type=type_6; Proc Sort; by ID; run;
DATA month7; set long1; where month=7; rename month=month_7 type=type_7; Proc Sort; by ID; run;
DATA month8; set long1; where month=8; rename month=month_8 type=type_8; Proc Sort; by ID; run;
DATA month9; set long1; where month=9; rename month=month_9 type=type_9; Proc Sort; by ID; run;
DATA month10; set long1; where month=10; rename month=month_10 type=type_10; Proc Sort; by ID; run;
DATA WIDE;
merge month1 month2 month3 month4 month5 month6 month7 month8 month9 month10; by ID;
if (type_1=1 and type_2=1 and type_3=1 and type_4=1 and type_5=1
and type_6=1 and type_7=1 and type_8=1 and type_9=1 and type_10=1) or
(type_1=2 and type_2=2 and type_3=2 and type_4=2 and type_5=2
and type_6=2 and type_7=2 and type_8=2 and type_9=2 and type_10=2)
then switch='no '; else switch='yes '; keep ID switch; run;
DATA LONG2;
merge wide long1; by ID;
Proc Print; run;
btw: also go to the SAS listserv, they love stuff like this:
http://www.listserv.uga.edu/archives/sas-l.html
This worked on the limited data I used:
DATA Have;
input id month type;
datalines;
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
2 1 1
2 2 2
2 3 1
2 4 1
2 5 1
3 1 1
3 2 1
3 3 2
3 4 2
3 5 1
4 1 2
4 2 2
4 3 2
4 4 2
4 5 2
;
Data Temp(keep=id dummy);
length dummy $15;
retain Start Type2 dummy;
set Have;
by id;
if first.id then Do;
Start=0;
Type2=0;
Dummy="";
end;
If Type=1 then do;
If Start=0 then Start=1;
else if Start=1 and Type2=1 then Dummy="Switch-er-Roo";
end;
else do;
if Start=1 then Type2=1;
end;
if last.id then output;
run;
Data Want;
merge temp(in=a) have(in=b);
by id;
run;
I prefer #CarolinaJay65 approach, it's a lot cleaner and just involves one pass of the data. If all you are interested in are the patients who start and finish on Type1, but use Type2 at some point, then the code can be simplified slightly. The following code (using #CarolinaJay65 source data) will only output the patient_id's matching this criteria.
data switch_id (keep=id);
set have;
by id month;
retain switch;
if first.id then do;
call missing(switch);
if type=1 then switch=0;
end;
else if not missing(switch) and type=2 then switch=1;
if last.id and type=1 and switch=1 then output;
run;
If you just wanted the number of patients who match the criteria then you could tweak this code further.
data switch (keep=count);
set have end=final;
by id month;
retain switch count 0;
if first.id then do;
call missing(switch);
if type=1 then switch=0;
end;
else if not missing(switch) and type=2 then switch=1;
if last.id and type=1 and switch=1 then count+1;
if final then output;
run;
I think the following should work:
DATA Have;
input id month type;
if _n_ ^= 1 and id ^= lag(id) then diftype = .;
else diftype = dif(type);
datalines;
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
2 1 1
2 2 2
2 3 1
2 4 1
2 5 1
3 1 1
3 2 1
3 3 2
3 4 2
3 5 1
4 1 2
4 2 2
4 3 2
4 4 2
4 5 2
;
proc sql;
select case when max(diftype) = 1 and min(diftype) = -1 then 1 else 0 end as flag, * from have
group by id
;
quit;