In SAS, find mean value of all other observations in groups - sas

I am trying to find the mean value of all other observations in the same group.
My data is like
Value Name Group Mean_all_other
544 Pete 1 ....
997 Sara 1 ....
772 Tom 1 ....
725 Tris 2 ....
872 Lulu 2 ....
434 Mica 2 ....
728 Tina 2 ....
827 Bo 3 ....
322 Zu 3 ....
..... ... ... ...
I know that proc means can give you the mean value within groups.
But here I want o create the Mean value of all other in the same group.
In this case, Pete under Mean_all_other will show 884.5, which equals to (997+772)/2.
And Sara= (544+772)/2=658; Tris=(872+434+728)/3=678
Anyone has any idea ?

Consider a proc sql solution using a subquery to average same Group for each row and conditioning out the current Name. Below query uses SAS's not equal operator ^= and mean() function which in regular SQL would use the <> operator and avg() -both still compliant in SAS).
proc sql;
create table NewTable as
select * from
(select main.Value, main.Name, main.Group,
(select mean(sub.Value)
from CurrentTable sub
where sub.Group = main.Group
and sub.Name ^= main.Name) As Mean_all_other
from CurrentTable as main)
quit;
* Value Name Group Mean_all_other
* 544 Pete 1 884.5
* 997 Sara 1 658
* 772 Tom 1 770.5
* 725 Tris 2 678
* 872 Lulu 2 629
* 434 Mica 2 775
* 728 Tina 2 677
* 827 Bo 3 322
* 322 Zu 3 827

Once you have the mean for each whole group, the case-deleted mean for each observation is much easier to calculate. I'd suggest doing this via double DOW loop:
data have;
input Value Name $ Group;
cards;
544 Pete 1
997 Sara 1
772 Tom 1
725 Tris 2
872 Lulu 2
434 Mica 2
728 Tina 2
827 Bo 3
322 Zu 3
;
run;
data want;
do _N_ = 1 by 1 until(last.group);
set have;
by group;
value_sum = sum(value_sum,value);
value_count = sum(value_count,1);
end;
do _N_ = 1 to _N_;
set have;
mean_all_other = (value_sum - value)/(value_count - 1);
output;
end;
drop value_:;
run;

PROC SQL will happily remerge the summary statistics for you. Note that this syntax probably will not work in other SQL implementations, but works fine in SAS. You can use the DIVIDE function to avoid dividing by zero for groups with only one member.
create table want as
select *
, divide(sum(value) - value, n(value) - 1) as mean_all_other
from have
group by group
;
For other SQL implementations you will need to re-merge the aggregate results yourself.
create table want as
select a.*
, divide(b.sum_value - a.value, b.n_value - 1) as mean_all_other
from have a
, (select group,sum(value) as sum_value,n(value) as n_value
from have
group by group
) b
where a.group = b.group
;
If the value of VALUE could be missing then you need to add a CASE statement to handle those cases.
create table want as
select *
, case when (missing(value)) then mean(value)
else divide(sum(value) - value, n(value) - 1)
end as mean_all_other
from have
group by group
;

Related

How do I get the following combined table through SAS?

My goal is to combine these tables into one without having to manually run my macro each time for each column.
The code I currently have with me is the following:
%macro task_Oct(set,col_name);
data _type_;
set &set;
call symputx('col_type', vtype(&col_name));
run;
proc sql;
create table work.oct27 as
select "&col_name" as variable,
"&col_type" as type,
nmiss(&col_name) as missing_val,
count(distinct &col_name) as distinct_val
from &set;
quit;
%mend task_Oct;
%task_Oct(sashelp.cars,Origin)
The above code gives me the following output:
|Var |Type |missing_val|distinct_val|
|Origin|Character|0 | 3 |
But the sashelp.cars data sheet has 15 columns and so I would like to output a new data sheet which has 15 rows with 4 columns.
I would like to get the following combined table as the output of my code:
|Var |Type |missing_val|distinct_val|
|Make |Character|0 | 38 |
|Model |Character|0 | 425 |
|Type |Character|0 | 6 |
|Origin|Character|0 | 3 |
...
...
...
Since I'm using a macro, I could run my code 15 different times by manually entering the names of the columns and then merging the tables into 1; and it wouldn't be a problem. But what if I have a table with 100s of columns? I could use some loop statement but I'm not sure how to go about that in this case. Help would be appreciated. Thank you.
The main output you appear to want can be generated directly by PROC FREQ with the NLEVELS option. If you want to add in the variable type then just merge it with the output of PROC CONTENTS.
ods exclude all;
ods output nlevels=nlevels;
proc freq data=sashelp.cars nlevels;
tables _all_ / noprint ;
run;
ods select all;
proc contents data=sashelp.cars noprint out=contents;
run;
proc sql;
create table want(drop=table:) as
select c.varnum,c.name,c.type,n.*
from contents c inner join nlevels n
on c.name=n.TableVar
order by varnum
;
quit;
Result
NNon
NMiss Miss
Obs VARNUM NAME TYPE NLevels Levels Levels
1 1 Make 2 38 0 38
2 2 Model 2 425 0 425
3 3 Type 2 6 0 6
4 4 Origin 2 3 0 3
5 5 DriveTrain 2 3 0 3
6 6 MSRP 1 410 0 410
7 7 Invoice 1 425 0 425
8 8 EngineSize 1 43 0 43
9 9 Cylinders 1 8 1 7
10 10 Horsepower 1 110 0 110
11 11 MPG_City 1 28 0 28
12 12 MPG_Highway 1 33 0 33
13 13 Weight 1 348 0 348
14 14 Wheelbase 1 40 0 40
15 15 Length 1 67 0 67
The NMissLevels variable counts the number of distinct types of missing values.
If instead you want to count the number of observations with (any) missing value you will need to use code generation. So use the CONTENTS data to generate an SQL query to generate all of the counts you want into a single observation with only one pass through the data. You can then transpose that to make it usable for re-merging with the CONTENTS data.
filename code temp;
data _null_;
set contents end=eof;
length nliteral $65 dsname $80;
nliteral=nliteral(name);
dsname = catx('.',libname,nliteral(memname));
file code;
if _n_=1 then put 'create table counts as select ' / ' ' # ;
else put ',' #;
put 'nmiss(' nliteral ') as missing_' varnum
/',count(distinct ' nliteral ') as distinct_' varnum
;
if eof then put 'from ' dsname ';';
run;
proc sql;
%include code /source2;
quit;
proc transpose data=counts out=count2 name=name ;
run;
proc sql ;
create table want as
select c.varnum, c.name, c.type
, m.col1 as nmissing
, d.col1 as ndistinct
from contents c
left join count2 m on m.name like 'missing%' and c.varnum=input(scan(m.name,-1,'_'),32.)
left join count2 d on d.name like 'distinct%' and c.varnum=input(scan(d.name,-1,'_'),32.)
order by varnum
;
quit;
Result
Obs VARNUM NAME TYPE nmissing ndistinct
1 1 Make 2 0 38
2 2 Model 2 0 425
3 3 Type 2 0 6
4 4 Origin 2 0 3
5 5 DriveTrain 2 0 3
6 6 MSRP 1 0 410
7 7 Invoice 1 0 425
8 8 EngineSize 1 0 43
9 9 Cylinders 1 2 7
10 10 Horsepower 1 0 110
11 11 MPG_City 1 0 28
12 12 MPG_Highway 1 0 33
13 13 Weight 1 0 348
14 14 Wheelbase 1 0 40
15 15 Length 1 0 67

SAS problem: sum up rows and divide till it reach a specific value

I have the following problem, I would like to sum up a column and divide the sum every line through the sum of the whole column till a specific value is reached. so in Pseudocode it would look like that:
data;
set auto;
sum_of_whole_column = sum(price);
subtotal[i] = 0;
i =1;
do until (subtotal[i] = 70000)
subtotal[i] = (subtotal[i] + subtotal[i+1])/sum_of_whole_column
i = i+1
end;
run;
I get the error that I haven't defined an array... so can I use something else instead of subtotal[i]?and how can I put a column in an array? I tried but it doesn't work (data = auto and price the column I want to put into an array)
data invent_array;
set auto;
array price_array {1} price;
run;
EDIT: maybe the dataset I used is helpful :)
DATA auto ;
LENGTH make $ 20 ;
INPUT make $ 1-17 price mpg rep78 ;
CARDS;
AMC Concord 4099 22 3
AMC Pacer 4749 17 3
Audi 5000 9690 17 5
Audi Fox 6295 23 3
BMW 320i 9735 25 4
Buick Century 4816 20 3
Buick Electra 7827 15 4
Buick LeSabre 5788 18 3
Cad. Eldorado 14500 14 2
Olds Starfire 4195 24 1
Olds Toronado 10371 16 3
Plym. Volare 4060 18 2
Pont. Catalina 5798 18 4
Pont. Firebird 4934 18 1
Pont. Grand Prix 5222 19 3
Pont. Le Mans 4723 19 3
;
RUN;
Perhaps I am missing your point but your subtotal will never be equal to 70 000 if you divide by the sum of its column. The maximum value will be 1. Your incremental sum however can be equal or superior to 70 000.
data stage1;
retain _sum 0;
set auto;
_sum = sum(_sum, price);
if _sum < 70000 then output;
run;
proc sql;
create table want as
select t1.*, t1._sum/sum(price) as subtotal
from stage1 as t1;
quit;
subtotal
0.0607268256
0.1310834235
0.2746411058
0.3679017467
0.5121261056
0.5834753107
0.6994325842
0.7851820027
1

how to extract specific variable to to fill another cell value?

I have the following dataset created for this example
/*sample data*/
data have;
input subj param value visit$ base;
cards;
1 1 50 scr .
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr .
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
I want to make sure that the base value of scr is the same as 'base' when visit='rand' so that it looks like the following
/*sample data*/
data want;
set have;
input subj param value visit$ base;
cards;
1 1 50 scr 55
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr 125
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
Double do-loop:
data want;
do until(last.param);
set have;
by subj param notsorted;
if visit='rand' then temp=base;
end;
do until(last.param);
set have;
by subj param notsorted;
if visit='scr' then base=temp;
output;
end;
drop temp;
run;
Here's a different approach, which uses SQL and joins the table with itself.
Not quite sure how to get back your original order, if that's important you may need to add a sort variable ahead of time.
proc sql;
create table want as
select t1.subj, t1.param, t1.value, t1.visit,
case when visit='scr' then t2.base
else t1.base
end as base
from have as t1
left join (select subj, param, base from have where visit = 'rand') as t2
on t1.subj=t2.subj
and t1.param=t2.param
order by 1, 2, 3, 4;
quit;
Lots of ways to do this. Assuming the data is not huge in size, use a hash table to update the values.
Get the %create_hash() macro from here
data have;
input subj param value visit$ base;
cards;
1 1 50 scr .
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr .
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
data want;
set have;
if _n_ =1 then do;
%create_hash(lk,subj param,base,"have(where=(visit='rand'))");
/* The macro above generates the following code
declare hash lk(dataset:"have(where=(visit='rand'))");
rc = lk.definekey( "subj", "param" );
rc = lk.definedata("base");
rc = lk.definedone();
*/
end;
rc = lk.find();
drop rc;
run;
This reads the data set into memory and gives you a hash based look up into the values. Use a subsetting where clause to only load the rand records into the hash.

Automatically replace outlying values with missing values

Suppose the data set have contains various outliers which have been identified in an outliers data set. These outliers need to be replaced with missing values, as demonstrated below.
Have
Obs group replicate height weight bp cholesterol
1 1 A 0.406 0.887 0.262 0.683
2 1 B 0.656 0.700 0.083 0.836
3 1 C 0.645 0.711 0.349 0.383
4 1 D 0.115 0.266 666.000 0.015
5 2 A 0.607 0.247 0.644 0.915
6 2 B 0.172 333.000 555.000 0.924
7 2 C 0.680 0.417 0.269 0.499
8 2 D 0.787 0.260 0.610 0.142
9 3 A 0.406 0.099 0.263 111.000
10 3 B 0.981 444.000 0.971 0.894
11 3 C 0.436 0.502 0.563 0.580
12 3 D 0.814 0.959 0.829 0.245
13 4 A 0.488 0.273 0.463 0.784
14 4 B 0.141 0.117 0.674 0.103
15 4 C 0.152 0.935 0.250 0.800
16 4 D 222.000 0.247 0.778 0.941
Want
Obs group replicate height weight bp cholesterol
1 1 A 0.4056 0.8870 0.2615 0.6827
2 1 B 0.6556 0.6995 0.0829 0.8356
3 1 C 0.6445 0.7110 0.3492 0.3826
4 1 D 0.1146 0.2655 . 0.0152
5 2 A 0.6072 0.2474 0.6444 0.9154
6 2 B 0.1720 . . 0.9241
7 2 C 0.6800 0.4166 0.2686 0.4992
8 2 D 0.7874 0.2595 0.6099 0.1418
9 3 A 0.4057 0.0988 0.2632 .
10 3 B 0.9805 . 0.9712 0.8937
11 3 C 0.4358 0.5023 0.5626 0.5799
12 3 D 0.8138 0.9588 0.8293 0.2448
13 4 A 0.4881 0.2731 0.4633 0.7839
14 4 B 0.1413 0.1166 0.6743 0.1032
15 4 C 0.1522 0.9351 0.2504 0.8003
16 4 D . 0.2465 0.7782 0.9412
The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
EDIT: Updated outliers so that parameter avoids truncation and changed measurement to be numeric type so as to match the corresponding height, weight, bp, cholesterol. This shouldn't change the responses.
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
This, however, doesn't utilize the outliers data set. How can the replacement process be made automatic?
I immediately think of the IN= operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if or where statements, but that seems excessive.
I don't think generating statements is excessive in this case. The complexity arises here because your outlier dataset cannot be merged easily since the parameter values represent variable names in the have dataset. If it is possible to reorient the outliers dataset so you have a 1 to 1 merge, this logic would be simpler.
Let's assume you cannot. There are a few ways to use a variable in a dataset that corresponds to a variable in another.
You could use an array like array params{*} height -- cholesterol; and then use the vname function as you loop through the array to compare to the value in the parameter variable, but this gets complicated in your case because you have a one to many merge, so you would have to retain the replacements and only output the last record for each by group... so it gets complicated.
You could transpose the outliers data using proc transpose, but that will get lengthy because you will need a transpose for each parameter, and then you'd need to merge all the transposed datasets back to the have dataset. My main issue with this method is that code with a bunch of transposes like that gets unwieldy.
You create the macro variable logic you are thinking might be excessive. But compared to the other ways of getting the values of the parameter variable to match up with the variable names in the have dataset, I don't think something like this is excessive:
data _null_;
set outliers;
call symput("outlierstatement"||_n_,"if group = "||group||" and replicate = '"||replicate||"' and "||parameter||" = "||measurement||" then "|| parameter ||" = .;");
call symput("outliercount",_n_);
run;
%macro makewant();
data want;
set have;
%do i = 1 %to &outliercount;
&&outlierstatement&i;
%end;
run;
%mend;
Lorem:
Transposition is the key to a fully automatic programmatic approach. The transposition that will occur is of the filter data, not the original data. The transposed filter data will have fewer rows than the original. As John indicated, transposition of the want data can create a very tall table and has to be transposed back after applying the filters.
As to the the filter data, the presence of a filter row for a specific group, replicate and parameter should be enough to mark a cell for filtering. This is on the presumption that you have a system for automatic outlier detection and the filter values will always be in concordance with the original values.
So, what has to be done to automate the filter application process without code generating a wall of test and assign statements ?
Transpose filter data into same form as want data, call it Filter^
Merge Want and Filter^ by record key (which is the by group of Group and Replicate)
Array process the data elements, looking for filtering conditions.
For your consideration, try the following SAS code. There is an erroneous filter record added to the mix.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
5 E 222 0.2465 0.7782 0.9412 /* test record for filter value misalignment test */
;
run;
data outliers;
length parameter $32; %* <--- widened parameter so it can transposed into column via id;
input parameter $ group replicate $ measurement ; %* <--- changed measurement to numeric variable;
datalines;
cholesterol 3 A 111
height 4 D 222
height 5 E 223 /* test record for filter value misalignment test */
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
/* Create a view with 1st row having all the filtered parameters
* This is necessary so that the first transposed filter row
* will have the parameters as columns in alphabetic order;
*/
proc sql noprint;
create view outliers_transpose_ready as
select distinct parameter from outliers
union
select * from outliers
order by group, replicate, parameter
;
/* Generate a alphabetic ordered list of parameters for use
* as a variable (aka column) list in the filter application step */
select distinct parameter
into :parameters separated by ' '
from outliers
order by parameter
;
quit;
%put NOTE: &=parameters;
/* tranpose the filter data
* The ID statement pivots row data into column names.
* The prefix=_filter_ ensure the new column names
* will not collide with the original data, and can be
* the shortcut listed with _filter_: in an array statement.
*/
proc transpose data=outliers_transpose_ready out=outliers_apply_ready prefix=_filter_;
by group replicate notsorted;
id parameter;
var measurement;
run;
/* Robust production code should contain a bin for
* data that does not conform to the filter application conditions
*/
data
want2(label="Outlier filtering applied" drop=_i_ _filter_:)
want2_warnings(label="Outlier filtering: misaligned values")
;
merge have outliers_apply_ready(keep=group replicate _filter_:);
by group replicate;
/* The arrays are for like named columns
* due to the alphabetic ordering enforced in data and codegen preparation
*/
array value_filter_check _filter_:;
array value &parameters;
if group ne .;
do _i_ = 1 to dim(value);
if value(_i_) EQ value_filter_check(_i_) then
value(_i_) = .;
else
if not missing(value_filter_check(_i_)) AND
value(_i_) NE value_filter_check(_i_)
then do;
put 'WARNING: Filtering expected but values do not match. ' group= replicate= value(_i_)= value_filter_check(_i_)=;
output want2_warnings;
end;
end;
output want2;
run;
Confirm your want and automated want2 agree.
proc compare noprint data=want compare=want2 outnoequal out=diffs;
by group replicate;
run;
Enjoy your SAS
You could use a hash table. Load a hash table with the outlier dataset, with parameter-group-replicate defined as the key. Then read in the data, and as you read each record, check each of the variables to see if that combination of parameter-group-replicate can be found in the hash table. I think below works (I'm no hash expert):
data want;
if 0 then set outliers (keep=parameter group replicate);
if _N_ = 1 then
do;
declare hash h(dataset:'outliers') ;
h.defineKey('parameter', 'group', 'replicate') ;
h.defineDone() ;
end;
set have ;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
parameter=vname(vars{i});
if h.check()=0 then call missing(vars{i});
end;
drop i parameter;
run;
I like #John's suggestion:
You could use an array like array params{*} height -- cholesterol; and
then use the vname function as you loop through the array to compare
to the value in the parameter variable, but this gets complicated in
your case because you have a one to many merge, so you would have to
retain the replacements and only output the last record for each by
group... so it gets complicated.
Generally in a one to many merge I would avoid recoding variables from the dataset that is unique, because variables are retained within BY groups. But in this case, it works out well.
proc sort data=outliers;
by group replicate;
run;
data want (keep=group replicate height weight bp cholesterol);
merge have (in=a)
outliers (keep=group replicate parameter in=b)
;
by group replicate;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
if vname(vars{i})=parameter then call missing(vars{i});
end;
if last.replicate;
run;
Thank you #John for providing a proof of concept. My implementation is a little different and I think worth making a separate entry for posterity. I went with a macro variable approach because I feel it is the most intuitive, being a simple text replacement. However, since a macro variable can contain only 65534 characters, it is conceivable that there could be sufficient outliers to exceed this limit. In such a case, any of the other solutions would make fine alternatives. Note that it is important that the put statement use something like best32. Too short a width will truncate the value.
If you desire to have a dataset containing the if statements (perhaps for verification), simply remove the into : statement and place a create table statements as line at the beginning of the PROC SQL step.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
proc sql noprint;
select
cat('if group = '
, strip(put(group, best32.))
, " and replicate = '"
, strip(replicate)
, "' and "
, strip(parameter)
, ' = '
, strip(put(measurement, best32.))
, ' then '
, strip(parameter)
, ' = . ;')
into : listIfs separated by ' '
from outliers
;
quit;
%put %quote(&listIfs);
data want;
set have;
&listIfs;
run;

SAS: code to create 'ever' variable for subsequent obsevations once event occurs

data have;
input ID Herpes;
datalines;
111 1
111 .
111 1
111 1
111 1
111 .
111 .
254 0
254 0
254 1
254 .
254 1
331 1
331 1
331 1
331 0
331 1
331 1
;
Where 1=Positive, 0=Negative, .=Missing/Not Indicated
Observations are sorted by ID (random numbers, no meaning) and date of visit (not included because not needed from here forward). Once you have Herpes, you always have Herpes. How do I adjust the Herpes variable (or create a new one) so that once a Positive is indicated (Herpes=1), all following obs will show Herpes=1 for that ID?
I want the resulting set to look like this:
111 1
111 1 (missing changed to 1)
111 1
111 1
111 1 (missing changed to 1)
111 1 (missing changed to 1)
111 1
254 0
254 0
254 1
254 1 (missing changed to 1 following positive at prior visit)
254 1
331 1
331 1
331 1
331 1 (patient-indicated negative/0 changed to 1 because of prior + visit)
331 1
331 1
The below code should do the trick. The trick is to use by-group processing in conjunction with the retain statement.
proc sort data=have;
by id;
run;
data want;
set have;
by id;
retain uh_oh .;
if first.id then do;
uh_oh = .;
end;
if herpes then do;
uh_oh = 1;
end;
if uh_oh then do;
herpes = 1;
end;
drop uh_oh;
run;
You could create a new variable that sums the herpes flag within ID:-
proc sort data=have;
by id;
data have_too;
set have;
by id;
if first.id then sum_herpes_in_id = 0;
sum_herpes_in_id ++ herpes;
run;
That way it's always positive from the first time herpes=1 within id. You can access these observations in other datasteps / procs with where sum_herpes_in_id;.
And for free, you also have the total number of herpes flags per id (if that's of any use).
This can also be done in SQL. Here is an example using UPDATE to update the table in place. (This could also be done in base SAS with MODIFY.)
proc sql undopolicy=none;
update have H
set herpes=1 where exists (
select 1 from have V
where h.id=v.id
and h.dtvar ge v.dtvar
and v.herpes=1
);
quit;
The SAS version using modify. BY doesn't work in a one-dataset modify for some reason, so you have to do your own version of first.id.
data have;
modify have;
drop _:;
retain _t _i;
if _i ne id then _t=.;
_i=id;
_t = _t or herpes;
if _t then herpes=1;
run;