I have a large dataset, with hundreds of variables and hundreds of observations, coming from a clinical trial. Variable V1 is a yes/no variable, indicating some condition. V2 is numeric, and representing a dose. T is a time variable. The dataset is "long" shaped, every subject has few observations, one for each time point. For every subject, I want to create a new yes/no variable (can be in a new dataset), which is yes if: V1 is "yes" in at least one time point, OR, V2 is above 0 in at least one time point. How do I do that? Thank you.
Try the following:
data ds;
set ds;
if V1="yes" or V2>0 then do;
flag="yes;
end;
else do;
flag= "no";
end;
summarize the dataset to ID level:
proc sql;
create table summary as
select ID, count(flag) as flag_cnt
from ds
where flag="yes"
group by ID;
quit;
These are the IDs which satisfy the condition
You can submit the code on the example below to verify.
Here (V1="yes" or V2>0) gives a dummy variable for eauch row. When we sum it we have the number of rows satisfying the condition you mentioned for each ID.
To have a flag, we compare the sum to 0 and put it between () to create a 0/1 variable that you want to have.
hope it helps !
MK
data have;
input ID V1 $ V2;
cards;
1 yes 0
1 no 0
1 no 0
2 no 0
2 no 0
2 no 0
3 no 1
3 no 0
4 yes 0
4 yes 0
5 yes 1
5 no 1
5 yes 0
;
run;
proc sql;
select ID
, (sum((V1="yes")or(V2>0))>0) as new_flag
from have
group by ID;
quit;
data want (keep=id flag);
flag = 'no ';
do until (last.id);
set have;
by id;
if v1 = 'yes' or v2 > 0 then flag = 'yes';
end;
output;
run;
Related
I have a dataset that contains an ID and some additional data. I want to perform transformations based on the ID with a by statement. The transformation works. Unfortunately SAS automatically reduces the dataset to one row per group. Does anybody know how to keep the original (number of) rows and still perform the group actions?
Here is some sample code to illustrate my problem
data dat;
input ID X $;
datalines;
1 a
1 b
1 c
1 d
2 a
2 b
3 a
4 k
5 z
5 a
5 c
;
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
drop x;
run;
Just add an OUTPUT statement inside the DO loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
output;
end;
drop x;
run;
When you do not have an explicit OUTPUT statement in a data step then an implied OUTPUT statement executes at the end of the data step. Your DO loop around the SET statement means that the end of the data step is only reached for the last observation per group.
If you want the final calculated value to be replicated on each observation then just add another loop to re-read the observations and put the OUTPUT statement in that loop.
data dat_new;
length x_new $2100.;
do until(last.ID);
set dat;
by ID notsorted;
x_new = ',' ||catx(',',x,x_new);
end;
do until(last.ID);
set dat;
by ID notsorted;
output;
end;
drop x;
run;
When you want to associate a group level computation result to EACH row in the group you will need to first iterate over the group to compute the result, and then have a second loop that reads the same rows of the group and outputs each. Use additional variables if you need to know the sequence number within the group and the total number of rows in the group.
data want(keep=id x_csv_list by_group_size seq);
length x_csv_list $2100.;
do by_group_size = 1 by 1 until(last.ID);
set dat;
by ID notsorted;
x_csv_list = catx(',',x_csv_list,x);
end;
do seq = 1 to by_group_size;
set dat;
output;
end;
run;
Also, if you are at the 'never really get it' stage, remember NOTSORTED means contiguous rows with the same by group variable values.
by s
s group first.s last.s
- ----- ------- ------
A 1st 1 0
A 1st 0 0 /* trick knowledge both 0 means row is interior */
A 1st 0 1
B 2nd 1 1 /* trick knowledge both 1 means group size is 1 row */
A 3rd 1 0
A 3rd 0 1
B 4th 1 0
B 4th 0 0
B 4th 0 1
C 5th 1 0
C 5th 0 1
I want to recode the max value of a variable as 1 and 0 when it is not. For each variable, there may be multiple observations with the max value. The max value for each value is not fixed, i.e. from cycle to cycle the max value for each variable may change. And there are hundreds of variables, cannot "hard-code" anything.
The final product would have the same dimensions as the original table, i.e. equal number of rows and columns as a matrix of 0s and 1s.
This is within SAS. I attempted to calculate the max of each variable and then append these max as a new observation into the data. Then comparing down the column of each variable against the "max" observation... looking into examples of the following did not help:
SQL
Array in datastep
proc transpose
formatting
Any insight would be much appreciated.
Here is a version done with SQL:
The idea is that we first calculate the maximum. The Latter select. Then we join the data to original and the outer the case-select specifies if the flag is set up or not.
data begin;
input var value;
cards;
1 1
1 2
1 3
1 2.5
1 1.7
1 3
2 34
2 33
2 33
2 33.7
2 34
2 34
; run;
proc sql;
create table result as
select a.var, a.value, case when a.value = b.maximum then 1 else 0 end as is_max from
(select * from begin) a
left join
(select max(value) as maximum, var from begin group by var) b
on a.var = b.var
;
quit;
To avoid "hard-code" you need to use some code generation.
First let's figure out what code you could use to solve the problem. Later we can look into ways to generate that code.
It is probably easiest to do this with PROC SQL code. SAS will allow you to reference the MAX() value of a variable. Also note that SAS evaluates boolean expressions to 1 (TRUE) or 0 (FALSE). So you just want to generate code like:
proc sql;
create table want as
select var1=max(var1) as var1
, var2=max(var2) as var2
from have
;
quit;
To generate the code you need a list of the variables in your source dataset. You can get those with PROC CONTENTS but also with the metadata table (view) DICTIONARY.COLUMNS (also accessible as SASHELP.VCOLUMN from outside PROC SQL).
If the list of variables is small then you could generate the code into a single macro variable.
proc sql noprint;
select catx(' ',cats(name,'=max(',name,')'),'as',name)
into :varlist separated by ','
from dictionary.columns
where libname='WORK' and memname='HAVE'
order by varnum
;
create table want as
select &varlist
from have
;
quit;
The maximum number of characters that will fit into a macro variable is 64K. So long enough for about 2,000 variables with names of 8 characters each.
Here is little more complex way that uses PROC SUMMARY and a data step with a temporary array. It does not really need any code generation.
%let dsin=sashelp.class(obs=10);
%let dsout=want;
%let varlist=_numeric_;
proc summary data=&dsin nway ;
var &varlist;
output out=summary(drop=_type_ _freq_) max= ;
run;
data &dsout;
if 0 then set &dsin;
array vars &varlist;
array max [10000] _temporary_;
if _n_=1 then do;
set summary ;
do _n_=1 to dim(vars);
max[_n_]=vars[_n_];
end;
end;
set &dsin;
do _n_=1 to dim(vars);
vars[_n_]=vars[_n_]=max[_n_];
end;
run;
Results:
Obs Name Sex Age Height Weight
1 Alfred M 0 1 1
2 Alice F 0 0 0
3 Barbara F 0 0 0
4 Carol F 0 0 0
5 Henry M 0 0 0
6 James M 0 0 0
7 Jane F 0 0 0
8 Janet F 1 0 1
9 Jeffrey M 0 0 0
10 John M 0 0 0
I have a situation where I would like to put the value of a variable in the label in SAS.
Example: Median for Total_Days is 2. I would like to put this value in Days_Median_Split label. The median keeps on changing with varying data, so I would like to automate it.
Phy_Activity Total_Days "Days_Median_Split: Number of Days with Median 2"
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
Sample Dataset
Thanks so much!
* step 1 create data;
data have;
input Phy_Activity $ Total_Days Days_Median_Split;
datalines;
No 0 0
No 0 0
Yes 2 1
Yes 3 1
Yes 5 1
run;
*step 2 sort data on Total_days;
proc sort data = have;
by Total_days;
run;
*step 3 get count of obs;
proc sql noprint;
select count(*) into: cnt
from have;quit;
* step 4 calulate median;
%let median = %sysevalf(&cnt/2 + .5);
*step 5 get median obsevation;
proc sql noprint;
select Total_days into: medianValue
from have
where monotonic()=&median;quit;
*step 6 create label;
data have;
set have;
label Days_Median_split = 'Days_Median_split: Number of Days with Median '
%trim(&medianValue);
run;
Similar to here, I can count the number of missing observations:
data dataset;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc means data=dataset NMISS N;
run;
But how can I also count the number of observations that are 0?
If you want to count the number of observations that are 0, you'd want to use proc tabulate or proc freq, and do a frequency count.
If you have a lot of values and you just want "0/not 0", that's easy to do with a format.
data have;
input a b c;
cards;
1 2 3
0 1 0
0 0 0
7 6 .
. 3 0
0 0 .
;
run;
proc format;
value zerof
0='Zero'
.='Missing'
other='Not Zero';
quit;
proc freq data=have;
format _numeric_ zerof.;
tables _numeric_/missing;
run;
Something along those lines. Obviously be careful about _numeric_ as that's all numeric variables and could get messy quickly if you have a lot of them...
I add this as an additional answer. It requires you to have PROC IML.
This uses matrix manipulation to do the count.
(ds=0) -- creates a matrix of 0/1 values (false/true) of values = 0
[+,] -- sums the rows for all columns. If we have 0/1 values, then this is the number of value=0 for each column.
' -- operator is transpose.
|| -- merge matrices {0} || {1} = {0 1}
Then we just print the values.
proc iml;
use dataset;
read all var _num_ into ds[colname=names];
close dataset;
ds2 = ((ds=0)[+,])`;
n = nrow(ds);
ds2 = ds2 || repeat(n,ncol(ds),1);
cnames = {"N = 0", "Count"};
mattrib ds2 rowname=names colname=cnames;
print ds2;
quit;
Easiest to use PROC SQL. You will have to use a UNION to replicate the MEANS output;
Each section of the first FROM counts the 0 values for each variable and UNION stacks them up.
The last section just counts the number of observations in DATASET.
proc sql;
select n0.Variable,
n0.N_0 label="Number 0",
n.count as N
from (
select "A" as Variable,
count(a) as N_0
from dataset
where a=0
UNION
select "B" as Variable,
count(b) as N_0
from dataset
where b=0
UNION
select "C" as Variable,
count(c) as N_0
from dataset
where c=0
) as n0,
(
select count(*) as count
from dataset
) as n;
quit;
there is levels options in proc freq you could use.
proc freq data=dataset levels;
table _numeric_;
run;
I hope this is not a duplicate question. I've searched the forum and retain function seems to be choice of weapon but it copies down an observation, and I'm trying to do the following; for a given id, copy the second line to the first line for the x value. also first value of x is always 2.
Here's my data;
id x
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
and i want it to look like this;
id x
3 1
3 1
3 1
2 1
2 1
2 1
6 0
6 0
6 0
and here's the starter code;
data have;
input id x;
cards;
3 2
3 1
3 1
2 2
2 1
2 1
6 2
6 0
6 0
;
run;
Lead is tricky in SAS. You can sort in reverse and use a lag function to get around it though, and you are right: a retain statement will allow us to add an order variable so we can sort it back to its original format.
data have;
set have;
retain order;
lagid = lag(id);
if id ne lagid then order = 0;
order = order + 1;
drop lagid;
run;
proc sort data=have; by id descending order; run;
data have;
set have;
leadx = lag(x);
run;
proc sort data=have; by id order; run;
data have;
set have;
if order = 3 then x_fixed = x;
else x_fixed = leadx;
run;
If your data is exactly as you say, then you can use a lookahead merge. It literally takes the dataset and merges itself to a copy of the dataset that starts on row 2, side-to-side. You just have to check that you're still on the same ID. This does change the value of x for all records to the value one hence, not just the first; you could add additional code to pay attention to that (but can't use FIRST and LAST).
data want;
merge have have(firstobs=2 rename=(id=newid x=newx));
if newid=id then x=newx;
keep x id;
run;
If you don't have any additional variables of interest, then you can do something even more interesting: duplicate the second row in its entirety and delete the first row.
data want;
set have;
by id notsorted;
if first.id then do;
firstrow+1;
delete;
end;
if firstrow=1 then do;
firstrow=0;
output;
end;
output;
run;
However, the "safest" method (in terms of doing most likely what you want precisely) is the following, which is a DoW loop.
data want;
idcounter=0;
do _n_ = 1 by 1 until (last.id);
set have;
by id notsorted;
idcounter+1;
if idcounter=2 then second_x = x;
end;
do _n_=1 by 1 until (last.id);
set have;
by id notsorted;
if first.id then x=second_x;
output;
end;
run;
This identifies the second x in the first loop, for that BY group, then in the second loop sets it to the correct value for row 1 and outputs.
In both of the latter examples I assume your data is organized by ID but not truly sorted (like your initial example is). If it's not organized by ID, you need to perform a sort first (but then can remove the NOTSORTED).