Flattening a file while preserving duplicate contents - sas

I want to flatten a file to consolidate the variable contents for any occurrence of an ID into one record. Consider the example below...
I have:
ID Date Color Letter
1012 01/23 Red X
1012 10/17 Blu F
1012 07/28 Red N
1012 04/09 Ylw G
1392 04/12 Ylw P
1392 03/11 Blu A
1001 03/11 Blu E
I want:
ID Date1 Date2 Date3 Date4 Clr1 Clr2 Clr3 Clr4 Ltr1 Ltr2 Ltr3 Ltr4
1012 01/23 10/17 07/28 04/09 Red Blu Red Ylw X F N G
1392 04/12 03/11 . . Ylw Blu P A
1001 03/11 . . . Blu E
What is an efficient way to do this?

This works well if you have 100 or less obs per group(id). It works to flop both character and numeric variables at the same time. If you wanted preserve the original order for ID you can add the PROC statement option ORDER=DATA.
data tall;
input (ID Date Color Letter)($);
cards;
1012 01/23 Red X
1012 10/17 Blu F
1012 07/28 Red N
1012 04/09 Ylw G
1392 04/12 Ylw P
1392 03/11 Blu A
1001 03/11 Blu E
;;;;
run;
proc sql noprint;
select max(obs) into :obs
from (select count(*) as obs from tall group by id);
quit;
%put NOTE: &=obs;
proc summary data=tall nway;
class id;
output out=wide(drop=_: id_:) idgroup(out[&obs](_all_)=);
run;

I currently do this by transposing every variable (in a macro when there are more than a few), then merging the resulting datasets (that just contain IDs and the transposed variable of choice) together.
Transposing:
%macro flattener(minids= , fix= , trnvar= );
proc transpose data=have out=&minids prefix=&fix;
by ID;
var &trnvar;
run;
%mend flattener;
%flattener(minids=datDS, fix=Date, trnvar=Date );
%flattener(minids=clrDS, fix=Clr , trnvar=Color );
%flattener(minids=ltrDS, fix=Ltr , trnvar=Letter);
Merging the result:
data ostudentflat;
merge datDS (drop=_NAME_ _LABEL_)
clrDS (drop=_NAME_ _LABEL_)
ltrDS (drop=_NAME_ _LABEL_);
by ID;
run;
I feel like there has to be an easier and faster way to do this, but it gets the job done.

Related

Populating a dataset depending on the values of a variable in another dataset

I have two data sets INPUT and OUTPUT.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
run;
The OUTPUT dataset has a different structure. The variables do not have the same name.
data work.output;
attrib
variable_1 length=8 format=best12. label="Variable 1"
variable_2 length=$50 format=$50. label="Variable 2"
Variable_3 length=8 format=date9. label="Variable 3";
stop;
run;
OUTPUT will be filled with the values from input based on what is specified in column "transformation" in table INPUT: when "transformation" equals "1:1", I want to fill the OUTPUT ds with the values of the corresponding INPUT dataset. If this were a small excel, I would do copy & paste or a lookup.
For example, obs1 of dataset INPUT has transformation = 1:1, so I want to fill variable_1 of dataset OUTPUT with "apple", variable_2 with "banana" and variable_3 with "oats".
For the second observation of ds INPUT I want to multiply each variable with two and assign them to variable_1 - variable_3 respectively.
In my real dataset I have much more columns so I need to automate this, probalby via index, since the variable names do not correspond.
You probably need to code each transformation rule separately.
This works for your example. But you did not include any date transformations so variable3 is not used.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
proc transpose data=input prefix=value out=step1;
by id transformation;
var var1-var3 ;
run;
data output;
set step1;
length variable1 8 variable2 $50 variable3 8;
format variable3 date9.;
if transformation='1:1' then variable2=value1;
if transformation='2x' then variable1 = 2*input(value1,32.);
run;
Result
Obs id transformation _NAME_ value1 variable1 variable2 variable3
1 1023 1:1 var1 apple . apple .
2 1023 1:1 var2 banana . banana .
3 1023 1:1 var3 oats . oats .
4 1049 2x var1 12 24 .
5 1049 2x var2 22 44 .
6 1049 2x var3 8 16 .
7 1219 1:1 var1 milk . milk .
8 1219 1:1 var2 cream . cream .
9 1219 1:1 var3 fish . fish .

SGPLOT Two Y Axes

I have a dataset that looks like the following:
pt_fin Admit_Type MONTH_YEAR BED_ORDERED_TO_DISPO (minutes)
1 Acute Jan 214
2 Acute Jan 628
3 ICU Jan 300
4 ICU Feb 99
I already have a code (see below) that produces a plot with a x (admit type grouped my month) and y axes (median bed to dispo time), but I want to add a secondary Y axes which counts the number of patients which were used to compute each respective median.
For example, I want a secondary Y axis data point that corresponds to the month and admit type, so for Jan, the secondary Y axis data point will have a 2 separate counts 1)of the patients admitted to acute and 2) of the patients admitted to ICU.
proc sgplot data=Combined;
title "Median Bed Order To Dispo By Month, Admit Location";
vbar MONTH_YEAR / response=BED_ORDERED_TO_DISPO stat=median
group = Admit_Type groupdisplay=cluster ;
run;
I've been trying to adapt what I've found here but the plots my code produces are super messy and incorrect.
https://blogs.sas.com/content/iml/2019/01/14/align-y-y2-axes-sgplot.html
Desired output(pretend X's and *'s, respectively, are connected in a line graph corresponding to the Y axis):
| * |
m | | | X | | #
e | x | | * |
d | | | | | |
|-------------------------------|
Acute ICU Acute ICU
Jan FEb
Code which I've tried that produce rubbish
proc sgplot data=Combined;
vbarbasic MONTH_YEAR/ response=Bed_Order_Hour y2axis; /*needs to be on y axis 1*/
group = Admit_Type
series x=MONTH_YEAR y=Pt_fin/ markers; *Pt_fin needs to be on y axis 2*/
run;
Your visualization explanation is weak. You might want to use two plotting statements in your SGPLOT, VBAR and VLINE.
data have;
do type = 'Acute', 'ICU';
do month = '01jan2018'd to '31dec2018'd;
do _n_ = 1 to floor (50 * ranuni(123));
patid + 1;
minutes = 10 + floor(1000 * ranuni(123));
output;
end;
month = intnx ('month', month, 0, 'e');
end;
end;
format month monname3.;
run;
ods html5 file="plot.html" path="c:\temp";
proc sgplot data=have;
title "Median of patient minutes by month";
vbar month / group=type groupdisplay=cluster response=minutes stat=median;
vline month / group=type groupdisplay=cluster response=minutes stat=freq y2axis ;
run;
ods html5 close;
The vline presents the viewer a secondary focus on the frequency for each median. The same information (as an aspect) of the median could be communicated instead with just a modification of the vbar intensity. The highest freq bars (of median) would be 'strongest' shade and the lower 'freq' bars would be faded.

Automatically replace outlying values with missing values

Suppose the data set have contains various outliers which have been identified in an outliers data set. These outliers need to be replaced with missing values, as demonstrated below.
Have
Obs group replicate height weight bp cholesterol
1 1 A 0.406 0.887 0.262 0.683
2 1 B 0.656 0.700 0.083 0.836
3 1 C 0.645 0.711 0.349 0.383
4 1 D 0.115 0.266 666.000 0.015
5 2 A 0.607 0.247 0.644 0.915
6 2 B 0.172 333.000 555.000 0.924
7 2 C 0.680 0.417 0.269 0.499
8 2 D 0.787 0.260 0.610 0.142
9 3 A 0.406 0.099 0.263 111.000
10 3 B 0.981 444.000 0.971 0.894
11 3 C 0.436 0.502 0.563 0.580
12 3 D 0.814 0.959 0.829 0.245
13 4 A 0.488 0.273 0.463 0.784
14 4 B 0.141 0.117 0.674 0.103
15 4 C 0.152 0.935 0.250 0.800
16 4 D 222.000 0.247 0.778 0.941
Want
Obs group replicate height weight bp cholesterol
1 1 A 0.4056 0.8870 0.2615 0.6827
2 1 B 0.6556 0.6995 0.0829 0.8356
3 1 C 0.6445 0.7110 0.3492 0.3826
4 1 D 0.1146 0.2655 . 0.0152
5 2 A 0.6072 0.2474 0.6444 0.9154
6 2 B 0.1720 . . 0.9241
7 2 C 0.6800 0.4166 0.2686 0.4992
8 2 D 0.7874 0.2595 0.6099 0.1418
9 3 A 0.4057 0.0988 0.2632 .
10 3 B 0.9805 . 0.9712 0.8937
11 3 C 0.4358 0.5023 0.5626 0.5799
12 3 D 0.8138 0.9588 0.8293 0.2448
13 4 A 0.4881 0.2731 0.4633 0.7839
14 4 B 0.1413 0.1166 0.6743 0.1032
15 4 C 0.1522 0.9351 0.2504 0.8003
16 4 D . 0.2465 0.7782 0.9412
The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
EDIT: Updated outliers so that parameter avoids truncation and changed measurement to be numeric type so as to match the corresponding height, weight, bp, cholesterol. This shouldn't change the responses.
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
This, however, doesn't utilize the outliers data set. How can the replacement process be made automatic?
I immediately think of the IN= operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if or where statements, but that seems excessive.
I don't think generating statements is excessive in this case. The complexity arises here because your outlier dataset cannot be merged easily since the parameter values represent variable names in the have dataset. If it is possible to reorient the outliers dataset so you have a 1 to 1 merge, this logic would be simpler.
Let's assume you cannot. There are a few ways to use a variable in a dataset that corresponds to a variable in another.
You could use an array like array params{*} height -- cholesterol; and then use the vname function as you loop through the array to compare to the value in the parameter variable, but this gets complicated in your case because you have a one to many merge, so you would have to retain the replacements and only output the last record for each by group... so it gets complicated.
You could transpose the outliers data using proc transpose, but that will get lengthy because you will need a transpose for each parameter, and then you'd need to merge all the transposed datasets back to the have dataset. My main issue with this method is that code with a bunch of transposes like that gets unwieldy.
You create the macro variable logic you are thinking might be excessive. But compared to the other ways of getting the values of the parameter variable to match up with the variable names in the have dataset, I don't think something like this is excessive:
data _null_;
set outliers;
call symput("outlierstatement"||_n_,"if group = "||group||" and replicate = '"||replicate||"' and "||parameter||" = "||measurement||" then "|| parameter ||" = .;");
call symput("outliercount",_n_);
run;
%macro makewant();
data want;
set have;
%do i = 1 %to &outliercount;
&&outlierstatement&i;
%end;
run;
%mend;
Lorem:
Transposition is the key to a fully automatic programmatic approach. The transposition that will occur is of the filter data, not the original data. The transposed filter data will have fewer rows than the original. As John indicated, transposition of the want data can create a very tall table and has to be transposed back after applying the filters.
As to the the filter data, the presence of a filter row for a specific group, replicate and parameter should be enough to mark a cell for filtering. This is on the presumption that you have a system for automatic outlier detection and the filter values will always be in concordance with the original values.
So, what has to be done to automate the filter application process without code generating a wall of test and assign statements ?
Transpose filter data into same form as want data, call it Filter^
Merge Want and Filter^ by record key (which is the by group of Group and Replicate)
Array process the data elements, looking for filtering conditions.
For your consideration, try the following SAS code. There is an erroneous filter record added to the mix.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
5 E 222 0.2465 0.7782 0.9412 /* test record for filter value misalignment test */
;
run;
data outliers;
length parameter $32; %* <--- widened parameter so it can transposed into column via id;
input parameter $ group replicate $ measurement ; %* <--- changed measurement to numeric variable;
datalines;
cholesterol 3 A 111
height 4 D 222
height 5 E 223 /* test record for filter value misalignment test */
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
/* Create a view with 1st row having all the filtered parameters
* This is necessary so that the first transposed filter row
* will have the parameters as columns in alphabetic order;
*/
proc sql noprint;
create view outliers_transpose_ready as
select distinct parameter from outliers
union
select * from outliers
order by group, replicate, parameter
;
/* Generate a alphabetic ordered list of parameters for use
* as a variable (aka column) list in the filter application step */
select distinct parameter
into :parameters separated by ' '
from outliers
order by parameter
;
quit;
%put NOTE: &=parameters;
/* tranpose the filter data
* The ID statement pivots row data into column names.
* The prefix=_filter_ ensure the new column names
* will not collide with the original data, and can be
* the shortcut listed with _filter_: in an array statement.
*/
proc transpose data=outliers_transpose_ready out=outliers_apply_ready prefix=_filter_;
by group replicate notsorted;
id parameter;
var measurement;
run;
/* Robust production code should contain a bin for
* data that does not conform to the filter application conditions
*/
data
want2(label="Outlier filtering applied" drop=_i_ _filter_:)
want2_warnings(label="Outlier filtering: misaligned values")
;
merge have outliers_apply_ready(keep=group replicate _filter_:);
by group replicate;
/* The arrays are for like named columns
* due to the alphabetic ordering enforced in data and codegen preparation
*/
array value_filter_check _filter_:;
array value &parameters;
if group ne .;
do _i_ = 1 to dim(value);
if value(_i_) EQ value_filter_check(_i_) then
value(_i_) = .;
else
if not missing(value_filter_check(_i_)) AND
value(_i_) NE value_filter_check(_i_)
then do;
put 'WARNING: Filtering expected but values do not match. ' group= replicate= value(_i_)= value_filter_check(_i_)=;
output want2_warnings;
end;
end;
output want2;
run;
Confirm your want and automated want2 agree.
proc compare noprint data=want compare=want2 outnoequal out=diffs;
by group replicate;
run;
Enjoy your SAS
You could use a hash table. Load a hash table with the outlier dataset, with parameter-group-replicate defined as the key. Then read in the data, and as you read each record, check each of the variables to see if that combination of parameter-group-replicate can be found in the hash table. I think below works (I'm no hash expert):
data want;
if 0 then set outliers (keep=parameter group replicate);
if _N_ = 1 then
do;
declare hash h(dataset:'outliers') ;
h.defineKey('parameter', 'group', 'replicate') ;
h.defineDone() ;
end;
set have ;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
parameter=vname(vars{i});
if h.check()=0 then call missing(vars{i});
end;
drop i parameter;
run;
I like #John's suggestion:
You could use an array like array params{*} height -- cholesterol; and
then use the vname function as you loop through the array to compare
to the value in the parameter variable, but this gets complicated in
your case because you have a one to many merge, so you would have to
retain the replacements and only output the last record for each by
group... so it gets complicated.
Generally in a one to many merge I would avoid recoding variables from the dataset that is unique, because variables are retained within BY groups. But in this case, it works out well.
proc sort data=outliers;
by group replicate;
run;
data want (keep=group replicate height weight bp cholesterol);
merge have (in=a)
outliers (keep=group replicate parameter in=b)
;
by group replicate;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
if vname(vars{i})=parameter then call missing(vars{i});
end;
if last.replicate;
run;
Thank you #John for providing a proof of concept. My implementation is a little different and I think worth making a separate entry for posterity. I went with a macro variable approach because I feel it is the most intuitive, being a simple text replacement. However, since a macro variable can contain only 65534 characters, it is conceivable that there could be sufficient outliers to exceed this limit. In such a case, any of the other solutions would make fine alternatives. Note that it is important that the put statement use something like best32. Too short a width will truncate the value.
If you desire to have a dataset containing the if statements (perhaps for verification), simply remove the into : statement and place a create table statements as line at the beginning of the PROC SQL step.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
proc sql noprint;
select
cat('if group = '
, strip(put(group, best32.))
, " and replicate = '"
, strip(replicate)
, "' and "
, strip(parameter)
, ' = '
, strip(put(measurement, best32.))
, ' then '
, strip(parameter)
, ' = . ;')
into : listIfs separated by ' '
from outliers
;
quit;
%put %quote(&listIfs);
data want;
set have;
&listIfs;
run;

How to run ttest in SAS with selected groups as data set?

I have a group of numbers, each labeled by a group letter, like
Group | x | y
A 135 12
B 281 32
C 221 2
A 201 4
B 294 4
C 950 ... etc
I am trying to run ttest on it, but ONLY on groups with prefix A or C
I cannot use "data = " statement.
So far I have
proc ttest where group = 'A', 'C'
var x y;
run;
But this doesnt work. Any help?
Here you go:
proc ttest data=dataname;
where Group="A" OR Group="C";
var x y;
run;
You can use OR but then you need to list the variable each time:
Where Group = 'A' OR Group = 'B';
Or you can use IN
Where Group in ('A', 'B');
Here's a worked example. Check the results of the check_where table. And look at the different results for the t-test, specifically the different p-values and N to show that you're using different data. Good Luck.
data have;
input Group $ x y;
cards;
A 135 12
B 281 32
C 221 2
A 201 4
B 294 4
C 950 8
;
run;
data check_where;
set have;
where group='A' or 'C';
run;
proc ttest data=have;
where group = 'A' or 'C';
var x y;
run;
proc ttest data=have;
where group in ('A', 'B');
var x y;
run;
proc ttest;
where group = 'A' or 'C';
var x y;
run;

SAS Transpose and Summarize entries

I have a data set that has quantitave values by time and type. I wish to summarize these in a different way, so that I can see percntage breakdown of type by time. Here is the data set:
data have;
input loc $ prod $ time $ type $ total;
cards;
L1 P1 1 xxx 10
L1 P1 1 yyy 30
L1 P1 1 yyy 60
L1 P2 1 xxx 20
L1 P1 2 xxx 25
L1 P2 2 yyy 60
;
run;
I want to end up with something like this:
loc prod type time1 time2
L1 P1 xxx .1 1
L1 P1 yyy .9 0
L1 P2 xxx 1 0
L1 P2 yyy 0 1
I imagine this will require an array of some sort, but am having trouble sorting out how to get the syntax right. I also thought maybe proc report may work, but not sure. I will need the output to be a dataset.
Thanks for help.
Pyll
I will put it in an answer I think is what you want.
proc sort data=have;
by loc prod;
run;
proc freq data=have noprint;
by loc prod;
tables time*type /out=statout outpct;
weight total;
run;
proc sort data=statout;
by loc prod type;
run;
proc tranpose data=statout out=statout2;
by loc prod type;
id time;
var pct_row;
run;