In a clinical trial, Systolic and diastolic blood pressure are measured pre-dose (0 hr) and at 1,2,4,8 hour post- dose.
Twelve subjects were studied. The SAS dataset has the following structure
Variable-Vol Length - 8 Label- Subject Number
Variable- Ntime Length- 8 Label Nominal time post-dose (hours)
Variable- Sups Length- 8 Label- Supine Systolic BP (mmHg)
What SAS code could I use to calculate the change from baseline (Oh) at each time point, and then calculate the mean, minimum, maximum change from baseline for the 12 subjects? Edit: This is what I've tried so far
data postbase;
do until (last.vol);
*** Only keep pre-dose values;
set save.vitals (where=(not(ntime <= 0 )));
by Vol Ntime;
if Ntime <= 0 then bl = Sups;
else do;
chgbl = Sups - bl;
output;
end;
end;
run;
data postbase;
set save.vitals;
by subject time volume;
retain baseline;
if time=0 then baseline=volume;
else change = volume - baseline;
run;
I think your code is too complex by far and I couldn't parse your variable names so just made them up.
I set baseline volume whenever time = 0 and then do the change every other time.
RETAIN causes the value to stay until it's reset. If you have times that may not be 0 or missing baseline then you may need to modify the query.
Related
I have an edit check
"If Period = 1,2,3 or 4 and Study Hour = 1 then the Time should be 1 hour plus or minus 15 minutes post-dose of study drug from the same period".These are to be programmed with a +/- 20-minute window of Study Hour 1.00 (relative to their dosing time) It is the protocol window, so even if the event was scheduled not exactly at the 1 hour, we are looking for the deviation window from the 1 hour not the time point of the event. Here is the merged data
This is my code. I'm getting a lot of flags here so what am I doing wrong?. For context, there is a prothour variable that is 1 but the actual hour time point is 0.77. Should I adjust the 0.77 somehow to account for this?
data medfst;
set dm.ex;
ptno=strip(compress(clientid,'-'))+0;
if ex_stdat=. or ex_sttim=. then delete;
medday= day;
rename hour=medhour;
proc sort;
by ptno period day medhour;
run;
data medfst;
set medfst;
by ptno period;
if first.period;
ex_datetime1=put(ex_stdat,date9.-r)||' '||put(ex_sttim,time8.-l);
ex_datetime=input(ex_datetime1,datetime20.);
keep scrid clientid ptno period ex_datetime ex_stdat ex_sttim medhour day;
format ex_datetime datetime20.;
proc sort;
by ptno period day medhour;
run;
data vs;
set dm.vs;
ptno=strip(compress(clientid,'-'))+0;
if VS_TEST in ('SYSTOLIC');
if prothour in ('1');
proc sort nodupkey;
by ptno period day hour;
run;
data vs1;
set vs;
vs_datetime1=put(vs_dat,date9.-r)||' '||put(vs_tim,time8.-l);
vs_datetime=input(vs_datetime1,datetime20.);
keep scrid clientid day hour ptno period vs_dat vs_tim vs_datetime vs_com;
format vs_datetime datetime20.;
proc sort;
by ptno period day;
run;
data temp;
merge medfst (in=a) vs1;
by ptno period;
if a;
run;
data final_temp;
set temp;
newhour=hour-medhour;
datediff=vs_dat-ex_stdat;
timediff=vs_tim-ex_sttim;
diff=datediff*24*3600+timediff;
newdiff=round(diff-newhour*(60*60));
format diff time8. newdiff time8. timediff time8.;
run;
data final;
set final_temp;
%inc_subjs;
***** *****;
*********************************************************************************************************;
attrib extra reason length=$5000.;
*********************************************************************************************************;
* Edit check code and footnote *;
***** *****;
if abs(diff) lt '00:45:00't or abs(diff) gt '01:15:00't then do;
reason=trim(reason)||'If Period = 1,2,3 or 4 and Study Hour = 1 then the Time should be 1 hour plus or minus 15 minutes post dose of study drug from the same period#';
extra = trim(extra)||', Hour based on Dose = '||trim(left(medhour))||', Vital Signs hour = '||trim(left(prothour))||', Time deviated = '||trim(put(diff,time8.))||', comment = '||trim(left(vs_com));
end;
You can round to a nearest multiple using the second argument of ROUND function.
ROUND(argument <, rounding-unit>)
Required Argument
argument
is a numeric constant, variable, or expression to be rounded.
Optional Argument
rounding-unit
is a positive, numeric constant, variable, or expression that specifies the rounding unit.
Round a time value to the nearest hour (time is seconds, hour is 3600 seconds)
closest_hour = ROUND(mytime, 3600);
Round hour (number) to nearest hour (time value)
closest_hour = ROUND(myhour*3600, 3600);
and of course, round hour (number) to nearest whole hour (number)
closest_hr = ROUND(myhour); * default rounding unit is 1;
I'd like to set all values in an array to 1 if some sort of condition is met, and perform a calculation if the condition isn't met. I'm using a do loop at the moment which is very slow.
I was wondering if there was a faster way.
data test2;
set test1;
array blah_{*} blah1-blah100;
array a_{*} a1-a100;
array b_{*} b1-b100;
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
if b1=0 then blah_{i}=1;
end;
run;
I feel like the if statement is inefficient as I am setting the value 1 cell at a time. Is there a better way?
There are already several good answers, but for the sake of completeness, here is an extremely silly and dangerous way of changing all the array values at once without using a loop:
data test2;
set test1;
array blah_{*} blah1-blah100 (100*1);
array a_{*} a1-a100;
array b_{*} b1-b100;
/*Make a character copy of what an array of 100 1s looks like*/
length temp $800; *Allow 8 bytes per numeric variable;
retain temp;
if _n_ = 1 then temp = peekclong(addrlong(blah1), 800);
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
end;
/*Overwrite the array using the stored value from earlier*/
if b1=0 then call pokelong(temp,addrlong(blah1),800);
run;
You have 100*NOBS assignments to do. Don't see how using a DO loop over an ARRAY is any more inefficient than any other way.
But there is no need to do the calculation when you know it will not be needed.
do i=1 to 100;
if b1=0 then blah_{i}=1;
else blah_{i}=a_{i}/b_{i};
end;
This example uses a data set to "set" all values of an array without DOingOVER the array. Note that using SET in this way changes INIT-TO-MISSING for array BLAH to don't. I cannot comment on performance you will need to do your own testing.
data one;
array blah[10];
retain blah 1;
run;
proc print;
run;
data test1;
do b1=0,1,0;
output;
end;
run;
data test2;
set test1;
array blah[10];
array a[10];
array b[10];
if b1 eq 0 then set one nobs=nobs point=nobs;
else do i = 1 to dim(blah);
blah[i] = i;
end;
run;
proc print;
run;
This is not a response to the original question, but as a response to the discussion on the efficiency between using loops vs set to set the values for multiple variables
Here is a simple experiment that I ran:
%let size = 100; /* Controls size of dataset */
%let iter = 1; /* Just to emulate different number of records in the base dataset */
data static;
array aa{&size} aa1 - aa&size (&size * 1);
run;
data inp;
do ii = 1 to &iter;
x = ranuni(234234);
output;
end;
run;
data eg1;
set inp;
array aa{&size} aa1 - aa&size;
set static nobs=nobs point=nobs;
run;
data eg2;
set inp;
array aa{&size} aa1 - aa&size;
do ii = 1 to &size;
aa(ii) = 1;
end;
run;
What I see when I run this with various values of &iter and &size is as follows:
As &size increases for a &iter value of 1, assignment method is faster than the SET.
However for a given &size, as iter increases (i.e. the number of times the set statement / loop is called), the speed of the SET approach increases while the assignment method starts to decrease at a certain point at which they cross. I think this is because the transfer from physical disk to buffer happens just once (since static is a relatively small dataset) whereas the assignment loop cost is fixed.
For this use case, where the fixed dataset used to set values will be smaller, I admit that SET will be faster especially when the logic needs to execute on multiple records on the input and the number of variables that needs to be assigned are relatively few. This however will not be the case if the dataset cannot be cached in memory between two records in which case the additional overhead of having to read it into the buffer can slow it down.
I think this test isolates the statements of interest.
SUMMARY:
SET+create init array 0.40 sec. + 0.03 sec,
DO OVER array 11.64 sec.
NOTE: Additional host information:
X64_SRV12 WIN 6.2.9200 Server
NOTE: SAS initialization used:
real time 4.70 seconds
cpu time 0.07 seconds
1 options fullstimer=1;
2 %let d=1e4; /*array size*/
3 %let s=1e5; /*reps (obs)*/
4 data one;
5 array blah[%sysevalf(&d,integer)];
6 retain blah 1;
7 run;
NOTE: The data set WORK.ONE has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
memory 7788.90k
OS Memory 15232.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 1 Switch Count 0
8
9 sasfile one open;
NOTE: The file WORK.ONE.DATA has been opened by the SASFILE statement.
10 data _null_;
11 array blah[%sysevalf(&d,integer)];
12 do _n_ = 1 to &s;
13 set one nobs=nobs point=nobs;
14 end;
15 stop;
16 run;
NOTE: DATA statement used (Total process time):
real time 0.40 seconds
user cpu time 0.40 seconds
system cpu time 0.00 seconds
memory 7615.31k
OS Memory 16980.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 2 Switch Count 0
2 The SAS System 06:57 Saturday, August 17, 2019
17 sasfile one close;
NOTE: The file WORK.ONE.DATA has been closed by the SASFILE statement.
18
19 data _null_;
20 array blah[%sysevalf(&d,integer)];
21 do _n_ = 1 to &s;
22 do i=1 to dim(blah); blah[i]=1; end;
23 end;
24 stop;
25 run;
NOTE: DATA statement used (Total process time):
real time 11.64 seconds
user cpu time 11.64 seconds
system cpu time 0.00 seconds
memory 3540.65k
OS Memory 11084.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 0
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 16.78 seconds
user cpu time 12.10 seconds
system cpu time 0.04 seconds
memory 15840.62k
OS Memory 16980.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 16
Some more interesting tests results based on data null 's original test. I added the following test also:
%macro loop;
data _null_;
array blah[%sysevalf(&d,integer)] blah1 - blah&d;
do _n_ = 1 to &s;
%do i = 1 %to &d;
blah&i = 1;
%end;
end;
stop;
run;
%mend;
%loop;
d s SET Method (real/cpu) %Loop (real/cpu) array based(real/cpu)
100 1e5 0.03/0.01 0.00/0.00 0.07/0.07
100 1e8 11.16/9.51 4.78/4.78 1:22.38/1:21.81
500 1e5 0.03/0.04 0.02/0.01 Did not measure
500 1e8 16.53/15.18 32.17/31.62 Did not measure
1000 1e5 0.03/0.03 0.04/0.03 0.74/0.70
1000 1e8 20.24/18.65 42.58/42.46 Did not measure
So with array based assignments, it is not the assignment that is the big culprit itself. Since arrays use a memory map to map the original memory locations, it appears that the memory location lookup for a given subscript is what really impacts performance. A direct assignment avoids this and significantly improves performance.
So if your array size is in the lower 100s, then direct assignment may not be a bad way to go. SET becomes effective when the array sizes go beyond a few hundreds.
I have some extreme outliers throwing my regression model off, and I removed them using If-Then-Else statements. However, SAS eliminated those data points completely and found new outliers in the ones remaining. Is there a way to remove the outliers from analysis without it throwing more into the mix?
I calculated Q3 + 1.5 * IQR and used that value as so:
Data lungcancer; input trt surv age sex ##;
/* create a new variable diff */
diff = surv - 365;
/* create a new categorical variable resp */
If diff > 0 then resp= 1;
If diff <= 0 then resp= 0;
/* create a new categorical variable sev */
if 2276 > surv >= 1621 then sev=0;
Else If 456 <= surv <= 1620 then sev=1;
Else if 181 <= surv <= 455 then sev=2;
Else if 1 <= surv <= 180 then sev=3;
Else if surv > 2276 then delete; /* Remove outliers */
So, you removed some data points that were on the edge of your data, and then got a new set of data, and recalculated IQR, and ... are surprised that there are new "outliers"?
This isn't SAS doing anything particular, it's doing what it's asked, identifying things in 1.5*IQR. Outlier removal is always up to you (when you're doing things this way, anyway, and not using one of the more advanced procs I suppose): you decide what's an outlier and remove it or not, depending on your data. So - do you think these new data points are outliers? Remove or not depending on that.
I would like to assign IDs with blank Sizes a size based on the frequency distribution of their Group.
Dataset A contains a snapshot of my data:
ID Group Size
1 A Large
2 B Small
3 C Small
5 D Medium
6 C Large
7 B Medium
8 B -
Dataset B shows the frequency distribution of the Sizes among the Groups:
Group Small Medium Large
A 0.31 0.25 0.44
B 0.43 0.22 0.35
C 0.10 0.13 0.78
D 0.29 0.27 0.44
For ID 8, we know that it has a 43% probability of being "small", a 22% probability of being "medium" and a 35% probability of being "large". That's because these are the Size distributions for Group B.
How do I assign ID 8 (and other blank IDs) a Size based on the Group distributions in Dataset B? I'm using SAS 9.4. Macros, SQL, anything is welcome!
The table distribution is ideal for this. The last datastep here shows that; before that I set things up to create the data at random and determine the frequency table, so you can skip that if you already do that.
See Rick Wicklin's blog about simulating multinomial data for an example of this in other use cases (and more information about the function).
*Setting this up to help generate random data;
proc format;
value sizef
low - 1.3 = 'Small'
1.3 <-<2.3 = 'Medium'
2.3 - high = 'Large'
;
quit;
*Generating random data;
data have;
call streaminit(7);
do id = 1 to 1e5;
group = byte(65+rand('Uniform')*4); *A = 65, B = 66, etc.;
size = put((rank(group)-66)*0.5 + rand('Uniform')*3,sizef.); *Intentionally making size somewhat linked to group to allow for differences in the frequency;
if rand('Uniform') < 0.05 then call missing(size); *A separate call to set missingness;
output;
end;
run;
proc sort data=have;
by group;
run;
title "Initial frequency of size by group";
proc freq data=have;
by group;
tables size/list out=freq_size;
run;
title;
*Transpose to one row per group, needed for table distribution;
proc transpose data=freq_size out=table_size prefix=pct_;
var percent;
id size;
by group;
run;
data want;
merge have table_size;
by group;
array pcts pct_:; *convenience array;
if first.group then do _i = 1 to dim(pcts); *must divide by 100 but only once!;
pcts[_i] = pcts[_i]/100;
end;
if missing(size) then do;
size_new = rand('table',of pcts[*]); *table uses the pcts[] array to tell SAS the table of probabilities;
size = scan(vname(pcts[size_new]),2,'_');
end;
run;
title "Final frequency of size by group";
proc freq data=want;
by group;
tables size/list;
run;
title;
You can also do this with a random value and some if-else logic:
proc sql;
create table temp_assigned as select
a.*, rand("Uniform") as random_roll, /*generate a random number from 0 to 1*/
case when missing(size) then
case when calculated random_roll < small then small
when calculated random_roll < sum(small, medium) then medium
when calculated random_roll < sum(small, medium, large) then large
end end as value_selected, /*pick the value of the size associated with that value in each group*/
coalesce(case when calculated value_selected = small then "Small"
when calculated value_selected = medium then "Medium"
when calculated value_selected = large then "Large" end, size) as group_assigned /*pick the value associated with that size*/
from temp as a
left join freqs as b
on a.group = b.group;
quit;
Obviously you can do this without creating the value_selected variable, but I thought showing it for demonstrative purposes would be helpful.
I have a SAS issue that I know is probably fairly straightforward for SAS users who are familiar with array programming, but I am new to this aspect.
My dataset looks like this:
Data have;
Input group $ size price;
Datalines;
A 24 5
A 28 10
A 30 14
A 32 16
B 26 10
B 28 12
B 32 13
C 10 100
C 11 130
C 12 140
;
Run;
What I want to do is determine the rate at which price changes for the first two items in the family and apply that rate to every other member in the family.
So, I’ll end up with something that looks like this (for A only…):
Data want;
Input group $ size price newprice;
Datalines;
A 24 5 5
A 28 10 10
A 30 14 12.5
A 32 16 15
;
Run;
The technique you'll need to learn is either retain or diff/lag. Both methods would work here.
The following illustrates one way to solve this, but would need additional work by you to deal with things like size not changing (meaning a 0 denominator) and other potential exceptions.
Basically, we use retain to cause a value to persist across records, and use that in the calculations.
data want;
set have;
by group;
retain lastprice rateprice lastsize;
if first.group then do;
counter=0;
call missing(of lastprice rateprice lastsize); *clear these out;
end;
counter+1; *Increment the counter;
if counter=2 then do;
rateprice=(price-lastprice)/(size-lastsize); *Calculate the rate over 2;
end;
if counter le 2 then newprice=price; *For the first two just move price into newprice;
else if counter>2 then newprice=lastprice+(size-lastsize)*rateprice; *Else set it to the change;
output;
lastprice=newprice; *save the price and size in the retained vars;
lastsize=size;
run;
Here a different approach that is obviously longer than Joe's, but could be generalized to other similar situations where the calculation is different or depends on more values.
Add a sequence number to your data set:
data have2;
set have;
by group;
if first.group the seq = 0;
seq + 1;
run;
Use proc reg to calculate the intercept and slope for the first two rows of each group, outputting the estimates with outest:
proc reg data=have2 outest=est;
by group;
model price = size;
where seq le 2;
run;
Join the original table to the parameter estimates and calculate the predicted values:
proc sql;
create table want as
select
h.*,
e.intercept + h.size * e.size as newprice
from
have h
left join est e
on h.group = e.group
order by
group,
size
;
quit;