Creating a new variable by comparing others - sas

I have a dataset that consists of a series of readings made by different people/instruments, of a bunch of different dimensions. It looks like this:
SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
... ... ... ... ... ... ... ... ... ... ...
My real dataset contains around 190 dimensions, with up to 5 measures in each one
I have to obey a set of rules to create a new variable for each dimension:
If there are 2 different values in the same dimension (missings excluded), the new variable is a missing.
If all values are the same (missings excluded), the new variable assumes the same value.
My new variables should look like this:
SUBJECT ... DIM1_X DIM2_X DIM3_X
1 ... . 3 2
2 ... 1 . 1
3 ... 2 1 5
The problem here is that i don't have the same number of measures for each dimension. Also, i could only come up with a lot of IF's (and I mean a LOT, as more measures in a given dimension increases the number of comparisons), so I wonder if there is some easier way to handle this particular problem.
Any help would be apreciated.
Thanks in advance.

Easiest way is to transpose it to vertical (one row per DIMx_y), summarize, then set the ones you want missing to missing, then retranspose (and if needed merge back on).
data have;
input SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2;
datalines;
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
;;;;
run;
data have_pret;
set have;
array dim_data DIM:;
do _t = 1 to dim(dim_Data); *dim function is not related to the name - it gives # of vars in array;
dim_Group = scan(vname(dim_data[_t]),1,'_');
dim_num = input(scan(vname(dim_data[_t]),2,'_'),BEST12.);
dim_val=dim_data[_t];
output;
end;
keep dim_group dim_num subject dim_val;
run;
proc freq data=have_pret noprint;
by subject dim_group;
tables dim_val/out=want_pret(where=(not missing(dim_val)));
run;
data want_pret2;
set want_pret;
by subject dim_Group;
if percent ne 100 then dim_val=.;
idval = cats(dim_Group,'_X');
if last.dim_Group;
run;
proc transpose data=want_pret2 out=want;
by subject;
id idval;
var dim_val;
run;

Related

page break by length and group sas proc report

I would like to create a page break value that can help me break the page when I use proc report.
Now my data looks like this:
Group Value
a 1
a 2
a 3
...
b 1
b 2
...
c 1
c 2
c 3
And suppose I only want two lines per page, and break if the group changed.
So I need a dataset like this:
Group Value Page
a 1 1
a 2 1
a 3 2
...
b 1 3
b 2 3
...
c 1 4
c 2 4
c 3 5
Can anyone help me with this? Thanks!
Retain holds values across rows. Create a counter value that you can use to track the number of records per group. This allows you to split it into pages of N amount.
Use BY and FIRST to reset counter at the start of each group
Check if the you need to increment page
data have;
input Group $ Value;
cards;
a 1
a 2
a 3
b 1
b 2
c 1
c 2
c 3
;;;;
data want;
set have;
by group;
retain counter page;
if first.group then counter=0;
counter+1;
if mod(counter, 2) =1 or first.group then page+1;
run;
proc print data=want;
run;
Results:
Obs Group Value counter page
1 a 1 1 1
2 a 2 2 1
3 a 3 3 2
4 b 1 1 3
5 b 2 2 3
6 c 1 1 4
7 c 2 2 4
8 c 3 3 5

Calculating median across multiple rows and columns in SAS 9.4

I tried searching multiple places but have not been able to find a solution yet. I was wondering if someone here would be able to please help me?
I am trying to calculate a median value (with Q1 and Q3) across multiple rows and columns in SAS 9.4 The dataset I am working with looks like the following:
Obs tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
The context is this is for a medical condition where a person may have 1 (or more) tumors. Each row represents 1 person. Each person may have up to 4 tumors. I would like to determine the median size of all tumors for the entire cohort (not just the median size for each person). Is there a way to calculate this? Thank you in advance.
A transpose of the data will yield a data structure (form) that is amenable to median and quartile computations, at a variety of aggregate combinations, made with PROC SUMMARY and a CLASS statement.
Example:
data have;
input
patient tumor_size_1 tumor_size_2 tumor_size_3 tumor_size_4; datalines;
1 4 1.5 1 1
2 2.5 2 . .
3 3 . . .
4 4 . . .
5 3.5 1 . .
;
proc transpose data=have out=new_have;
by patient;
var tumor:;
run;
proc summary data=new_have;
class patient;
var col1;
output out=want Q1=Q1 Q3=Q3 MEDIAN=MEDIAN N=N;
run;
Results
patient _TYPE_ _FREQ_ Q1 Q3 MEDIAN N
. 0 20 1 3.50 2.25 10
1 1 4 1 2.75 1.25 4
2 1 4 2 2.50 2.25 2
3 1 4 3 3.00 3.00 1
4 1 4 4 4.00 4.00 1
5 1 4 1 3.50 2.25 2
The _TYPE_ column describes the ways in which the CLASS variables are combined in order to achieve the results for the requested statistics. The _TYPE_ = 0 case is for all values, and, in this problem, the _FREQ_ = 20 indicates 20 inputs went into the computation consideration, and that N = 10 of those were non-missing and were involved in the actual computation. The role of _TYPE_ becomes more obvious when there is more than one CLASS variable.
From the Output Data Set documentation:
the variable _TYPE_ that contains information about the class variables. By default _TYPE_ is a numeric variable. If you specify CHARTYPE in the PROC statement, then _TYPE_ is a character variable. When you use more than 32 class variables, _TYPE_ is automatically a character variable.
and
The value of _TYPE_ indicates which combination of the class variables PROC MEANS uses to compute the statistics. The character value of _TYPE_ is a series of zeros and ones, where each value of one indicates an active class variable in the type. For example, with three class variables, PROC MEANS represents type 1 as 001, type 5 as 101, and so on.
A far less elegant way to compute the median of all is to store all the values in an oversized array and use the MEDIAN function on the array after the last row is read in:
data median_all;
set have end=lastrow;
array values [1000000] _temporary_;
array sizes tumor_size_1-tumor_size_4;
do sIndex = 1 to dim(sizes);
/* if not missing (sizes[sIndex]) then do; */ %* decomment for dense fill;
vIndex + 1;
values[vIndex] = sizes[sIndex];
/* end; */ %* decomment for dense fill;
end;
if lastrow then do;
median_all_tumor_sizes = median (of values(*));
output;
put (median:) (=);
end;
keep median:;
run;
-------- LOG -------
median_all_tumor_sizes=2.25

An alternative to a PROC SUMMMARY approach when summing variables across multiple observations

I am dealing with a repeated measures dataset in a wide format. Each observation represents one measurement for one subject and each subject is measures six times. The data contains mainly dummy variables.
I am looking to do a count of unique dummy variable values across all six observations for each subject.
Have:
MeasurementNum SubjectID Dummy0 Dummy1 Dummy2 Dummy3 Dummy4
-----------------------------------------------------------------------------
1 1 1 1 0 0 0
2 1 0 1 0 1 0
3 1 - - - - -
4 1 0 0 1 1 0
5 1 - - - - -
6 1 0 0 0 1 0
1 2 1 0 0 1 0
2 2 0 0 0 0 0
3 2 0 1 0 0 0
4 2 1 1 0 1 0
5 2 - - - - -
6 2 1 1 1 0 0
Want:
Total for Overall
MeasurementNum SubjectID ... MeasurementNUM Total
--------------------------------...-----------------------------
1 1 ... 2 4
2 1 ... 2 4
3 1 ... - 4
4 1 ... 2 4
5 1 ... - 4
6 1 ... 1 4
1 2 ... 2 4
2 2 ... 0 4
3 2 ... 1 4
4 2 ... 3 4
5 2 ... - 4
6 2 ... 3 4
My current approach is to consolidate all six rows within each subject to one rows retaining value 1 using Proc MEANS with BY and OUTPUT statements, as described in this related question. I then use Proc SUMMARY to get the values listed under variable 'Total` in the have statement.
proc summary
data=have;
By SubjectID
class Dummy1-4;
output out=want sum=sum;
Is there a way to get the distinct/unique counts across observations without consolidating rows first?
I prefer PROC SQL as it will also allow me to do conditional counts according to subject covariates present in my working dataset. I.e. producing the want descriptives on condition of a covariate specific to the subject.
I suspect that using PROC SUMMARY (aka PROC MEANS) will be the easiest way. Sounds like you want to find the MAX for each SUBJECT and then SUM those to get the subject totals.
proc summary data=have nway ;
class SubjectID ;
var Dummy0-Dummy999;
output out=any(drop=_type_ _freq_) n=n_reps max= ;
run;
data want ;
set any ;
total = sum(of Dummy0-Dummy999) ;
run;
Not sure how SQL helps any with conditional counts. But you could generate the counts and total in one step with PROC SQL, but it would require wallpaper code like this:
proc sql ;
create table want as
select SubjectID
, count(*) as n_reps
, max(dummy0) as dummy0
, max(dummy1) as dummy1
...
, max(dummy999) as dumyy999
, sum
( max(dummy0)
, max(dummy1)
...
, max(dummy999)
) as Total
from have
group by 1
;
quit;
You could probably define a macro (or some other tool) to generate that wallpaper code for you from a list of variable names.

SAS - Replicate multiple observations across rows

I have a data structure that looks like this:
DATA have ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 1000
1 2 2 2000
1 2 3 3000
1 2 4 4000
1 2 5 5000
1 3 1 .
1 3 2 .
1 3 3 .
1 3 4 .
1 3 5 .
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 .
2 2 2 .
2 2 3 .
2 2 4 .
2 2 5 .
2 3 1 41000
2 3 2 39000
2 3 3 24000
2 3 4 32000
2 3 5 53000
RUN ;
So, we have family id, individual id, implicate number and imputed income for each implicate.
What i need is to replicate the results of the first individual in each family (all of the five implicates) for the remaining individuals within each family, replacing whatever values we previously had on those cells, like this:
DATA want ;
INPUT famid indid implicate imp_inc;
CARDS ;
1 1 1 40000
1 1 2 25000
1 1 3 34000
1 1 4 23555
1 1 5 49850
1 2 1 40000
1 2 2 25000
1 2 3 34000
1 2 4 23555
1 2 5 49850
1 3 1 40000
1 3 2 25000
1 3 3 34000
1 3 4 23555
1 3 5 49850
2 1 1 40000
2 1 2 45000
2 1 3 50000
2 1 4 34000
2 1 5 23500
2 2 1 40000
2 2 2 45000
2 2 3 50000
2 2 4 34000
2 2 5 23500
2 3 1 40000
2 3 2 45000
2 3 3 50000
2 3 4 34000
2 3 5 23500
RUN ;
In this example I'm trying to replicate only one variable but in my project I will have to do this for dozens of variables.
So far, I came up with this solution:
%let implist_1=imp_inc;
%macro copyv1(list);
%let nwords=%sysfunc(countw(&list));
%do i=1 %to &nwords;
%let varl=%scan(&list, &i);
proc means data=have max noprint;
var &varl;
by famid implicate;
where indid=1;
OUTPUT OUT=copy max=max_&varl;
run;
data want;
set have;
drop &varl;
run;
data want (drop=_TYPE_ _FREQ_);
merge want copy;
by famid implicate;
rename max_&varl=&varl;
run;
%end;
%mend;
%copyv1(&imp_list1);
This works well for one or two variables. However it is tremendously slow once you do it for 400 variables in a data-set with the size of 1.5 GB.
I'm pretty sure there is a faster way to do this with some form of proc sql or first.var etc., but i'm relatively new to SAS and so far I couldn't come up with a better solution.
Thank you very much for your support.
Best regards
Yes, this can be done in DATA step using a first. reference made available via the by statement.
data want;
set have (keep=famid indid implicate imp_inc /* other vars */);
by famid indid implicate; /* by implicate is so step logs an error (at run-time) if data not sorted */
if first.famid then if indid ne 1 then abort;
array across imp_inc /* other vars */;
array hold [1,5] _temporary_; /* or [<n>,5] where <n> means the number of variables in the across array */
if indid = 1 then do; /* hold data for 1st individuals implicate across data */
do _n_ = 1 to dim(across);
hold[_n_,implicate] = across[_n_]; /* store info of each implicate of first individual */
end;
end;
else do;
do _n_ = 1 to dim(across);
across[_n_] = hold[_n_,implicate]; /* apply 1st persons info to subsequent persons */
end;
end;
run;
The DATA step could be significantly faster due to single pass through data, however there is an internal processing cost associated with calculating all those pesky [] array addresses at run; time, and that cost could become impactful at some <n>
SQL is simpler syntax, clearer understanding and works if have data set is unsorted or has some peculiar sequencing in the by group.
This is fairly straightforward with a bit of SQL:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have a
left join (
select * from have
group by famid
having indid = min(indid)
) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
The idea is to join the table to a subset of itself containing only the rows corresponding to the first individual within each family.
It is set up to pick the lowest numbered individual within each family, so it will work even if there is no row with indid = 1. If you are sure that there will always be such a row, you can use a slightly simpler query:
proc sql;
create table want as
select a.famid, a.indid, a.implicate, b.* from
have(sortedby = famid) a
left join have(where = (indid = 1)) b
on
a.famid = b.famid
and a.implicate = b.implicate
order by a.famid, a.indid, a.implicate
;
quit;
Specifying sortedby = famid provides a hint to the query optimiser that it can skip one of the initial sorts required for the join, which may improve performance a bit.

How do you replace missing even values In a variable having 1 to 100 values?

dataset looks like this
variable
1
.
3
.
5
.
7
.
9
How do you replace missing even values with the correct one
and resulting data should appear as
1
2
3
4
5
6
7
8
9
do you mean that the data looks like this:?
var
---
1
.
3
.
etc
and you want the ones in between to be one more than the one before? if so ...
data one;
input var;
datalines;
1
.
3
.
5
;
run;
data two (drop=prev_var);
set one;
retain prev_var;
if missing(var) then do;
var = prev_var + 1;
end;
prev_var=var;
run;
proc print data = two noobs; run;