Is there a way to simply transpose data in SAS - sas

I'm a newbie to SAS. I am trying to document the table structure of the 50+ data sets and so I want to just take the top 5 rows from each data set and output it on console. However, since many of these data sets have many columns I would like to transpose them. I tried to use proc transpose but apparently it doesn't just flip the results and keeps dropping columns.
For example, the following code only produce results with MSGID and LINENO only...
proc print data=sashelp.smemsg;
run;
proc transpose data=sashelp.smemsg out=work.test;
run;
proc print data=work.test;
run;
Update:
I think it didn't work because SAS doesn't know how to "normalize" the data types after the transformation. I would like to something similar to this in R where all numbers became string.
> df <- data.frame(x=11:20, y=letters[1:10])
> df
x y
1 11 a
2 12 b
3 13 c
4 14 d
5 15 e
6 16 f
7 17 g
8 18 h
9 19 i
10 20 j
> t(df)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
y "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

To quickly look at the data in SAS dataset I normally just use a PUT statement and look at the log.
data _null_;
set have (obs=5);
put (_all_) (=/);
run;
If you just want to transpose the data then use PROC TRANSPOSE. You need to specify the variables or you will only get the numeric ones.
proc transpose data=have (obs=5) out=want ;
var _all_ ;
run;
proc print data=want ;
run;

Here is roughly how to do it.
Generate sample dataset having 25 rows, 6 numeric vars and 6 string vars
data sample;
array num_col_(6);
array str_col_(6) $;
do row_number = 1 to 25;
do col_number = 1 to 6;
num_col_(col_number) = round(ranuni(0),.01);
str_col_(col_number) = byte(ceil(ranuni(0)*10)+97);
end;
output;
end;
drop row_number col_number;
run;
Transpose data, keeping only 5 first rows
proc transpose data=sample(obs=5) prefix=row
out=sample_tr(rename=(_name_=column));
var num_col_: str_col_:;
/* You could also use keywords on the var statement */
* var _character_ _numeric_; * Lets you decide which type to show first;
* var _all_; * Keeps original order of variables;
run;
Show the results
proc print data=sample_tr noobs;
id column;
var row1-row5;
run;
Results
column row1 row2 row3 row4 row5
--------- ---- ---- ---- ---- ----
num_col_1 0.66 0.96 0.85 0.45 0.32
num_col_2 0.78 0.79 0.64 0.85 0.74
num_col_3 0.23 0.62 0.46 0.46 0.51
num_col_4 0.91 0.15 0.16 0.77 0.13
num_col_5 0.6 0.48 0.32 0.6 0.77
num_col_6 0.13 0.76 0.67 0.16 0.67
str_col_1 c i i i c
str_col_2 j k f f c
str_col_3 e g k h i
str_col_4 b h d k e
str_col_5 c h f e f
str_col_6 i b k i f

Related

Is it possible to get specified values in the stat variable of proc means output?

So I am working on generating summary dataset. I need to print only the values of N,MIN,MEDIAN,MAX,STD. It will be convenient for me to get the result as the statistics in a single variable stat. But if I use proc means without specifying the stats after output statement, I just get the default stats. Is there any way of doing this? This is what I tried.
PROC MEANS DATA=sashelp.class NWAY N MIN MAX MEDIAN STD;
CLASS name;
VAR height weight;
OUTPUT OUT=output (DROP=_type_ _freq_ RENAME=(_stat_=stat)) ;
RUN;
It only shows the default stats.
Anyway, I can specify the stats in output option, but I want the output like from the code I have provided.
Thanks in advance for helping.
No.
Generate them in "wide" format and then transpose to "tall" if you want.
PROC SUMMARY DATA=sashelp.class NWAY N MIN MAX MEDIAN STD;
CLASS name;
VAR height weight;
OUTPUT OUT=wide n= min= max= median= std= /autoname ;
RUN;
proc transpose data=wide(drop=_type_ _freq_) out=tall ;
by name ;
run;
data tall ;
set tall ;
length Stat $32 ;
stat = scan(_name_,-1,'_');
_name_=substr(_name_,1,length(_name_)-length(stat)-1);
run;
proc transpose data=tall out=want(drop=_name_);
by name stat notsorted;
id _name_;
var col1 ;
run;
Results:
Obs Name Stat Height Weight
1 Alfred N 1.0 1.0
2 Alfred Min 69.0 112.5
3 Alfred Max 69.0 112.5
4 Alfred Median 69.0 112.5
5 Alfred StdDev . .
6 Alice N 1.0 1.0
7 Alice Min 56.5 84.0
8 Alice Max 56.5 84.0
9 Alice Median 56.5 84.0
10 Alice StdDev . .
...

Make the output of proc tabulate to vertical?

My SAS code is as follow:
DATA CLASS;
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
CARDS;
ALFRED M 14 69.0 112.5
ALICE F 13 56.5 84.0
BARBARA F 13 65.3 98.0
CAROL F 14 62.8 102.5
HENRY M 14 63.5 102.5
RUN;
PROC PRINT;
TITLE 'DATA';
RUN;
proc print data=CLASS;run;
proc tabulate data=CLASS;
var AGE HEIGHT WEIGHT;
table (AGE HEIGHT WEIGHT)*(MEAN STD MEDIAN Q1 Q3 MIN MAX n NMISS);
title 'summary';
run;
The out put looks like
How can make the output list in the vertical direction?
A TABLE statement without a comma (,) is specifying only a column expression.
Use a comma in your table statement
table <row-expression> , <column-expression> ;
Example:
DATA CLASS;
INPUT NAME $ SEX $ AGE HEIGHT WEIGHT;
CARDS;
ALFRED M 14 69.0 112.5
ALICE F 13 56.5 84.0
BARBARA F 13 65.3 98.0
CAROL F 14 62.8 102.5
HENRY M 14 63.5 102.5
;
ods html file='tabulate.html' style=plateau;
TITLE 'DATA';
proc print data=CLASS;
run;
proc tabulate data=CLASS;
var AGE HEIGHT WEIGHT;
table (AGE HEIGHT WEIGHT)*(MEAN STD MEDIAN Q1 Q3 MIN MAX n NMISS);
* comma being used;
table (AGE HEIGHT WEIGHT),(MEAN STD MEDIAN Q1 Q3 MIN MAX n NMISS);
* comma being used, swapping row and column expressions;
table (MEAN STD MEDIAN Q1 Q3 MIN MAX n NMISS),(AGE HEIGHT WEIGHT);
title 'summary';
run;

How does SAS proc stdize method=range work?

How does PROC STDIZE METHOD = RANGE work?
I thought that it would work like this:
Score = (Observation - Min) / ( Max - Min)
However, the range is [1,100] and there is never a 0 i.e. when you would substract the min observation from itself on the numerator.
I've tried reading the SAS documentation and running some trials in an excel workbook
PROC STDIZE
DATA = SASHELP.BASEBALL
METHOD = RANGE
OUT = BASEBALL_STDIZE
;
VAR CRHITS;
RUN;
range [0,100] expected, range [1,100] found
Obs _TYPE_ crhit2
1 LOCATION 34
2 SCALE 4222
3 ADD 0
4 MULT 1
5 N 322
6 NObsRead 322
7 NObsUsed 322
8 NObsMiss 0

Reading horizontal data then transposing it. Looking for a more elegant solution

I have this horizontal data:
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
But I want it to be vertical:
Placebo Alcohol
0.90 1.46
0.37 1.45
... ...
I successfully read and transpose the data this way, but I'm searching for a more elegant solution that does the same thing without creating 2 unnecessary datasets:
data female;
input cost_female :comma. ##;
datalines;
871 684 795 838 1,033 917 1,047 723 1,179 707 817 846 975 868 1,323 791 1,157 932 1,089 770
;
data male;
input cost_male :comma. ##;
datalines;
792 765 511 520 618 447 548 720 899 788 927 657 851 702 918 528 884 702 839 878
;
data repair_costs;
merge female male;
run;
You can use proc transpose to do the same.
data have;
input medicine :$7. a1-a10;
datalines;
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
;
run;
proc transpose data=have out=want(drop=_name_);
id medicine;
var a1-a10;
run;
Let me know in case of any doubts.
For arbitrarily wide input data you will have to use binary mode input, which is specified with RECFM=N.
This sample code creates a wide data file in transposed form. Thus the data file has one row per final dataset column and one column per final dataset row.
The code presumes CRLF line termination and tests for it explicitly. The input data set is reshaped using a single Proc TRANSPOSE.
filename flipflop 'c:\temp\rowdata-across.txt';
%let NUM_ROWS = 10000; * thus 10,000 columns of data in flipflop;
%let NUM_COLS = 30;
* simulate input data where row data is across a line of arbitrary length (that means > 32K);
* recfm=n means binary mode output, hence no LRECL limit;
data _null_;
file flipflop recfm=n;
do colindex = 1 to &NUM_COLS;
put 'column' +(-1) colindex #; * first column of output data is column name;
do rowindex=1 to &NUM_ROWS;
value = (rowindex-1) * 10 ** floor(log10(&NUM_COLS)) * 10 + colindex;
put value #; * data for rows goes across;
end;
put '0d0a'x;
end;
run;
* recfm=n means binary mode input, hence no LRECL limit;
* as filesize increases, binary mode will become slower than <32K line orientated input;
data flipflop(keep=id rowseq colseq value);
length id $32 value 8;
infile flipflop unbuffered recfm=n col=p;
colseq+1;
input id +(-1);
do rowseq=1 by 1;
input value;
output;
input test $char2.;
if test = '0d0a'x then leave;
input #+(-2);
end;
run;
proc sort data=flipflop;
by rowseq colseq;
run;
proc transpose data=flipflop out=want(drop=_name_ rowseq);
by rowseq;
id id;
var value;
run;
There might be a way to speed up reading larger (say, a file with dataline width > 32k) files in binary mode, but I have not investigated such.
Other variations could utilize a hash object, however, the entire data set would have to fit in memory.

Automatically replace outlying values with missing values

Suppose the data set have contains various outliers which have been identified in an outliers data set. These outliers need to be replaced with missing values, as demonstrated below.
Have
Obs group replicate height weight bp cholesterol
1 1 A 0.406 0.887 0.262 0.683
2 1 B 0.656 0.700 0.083 0.836
3 1 C 0.645 0.711 0.349 0.383
4 1 D 0.115 0.266 666.000 0.015
5 2 A 0.607 0.247 0.644 0.915
6 2 B 0.172 333.000 555.000 0.924
7 2 C 0.680 0.417 0.269 0.499
8 2 D 0.787 0.260 0.610 0.142
9 3 A 0.406 0.099 0.263 111.000
10 3 B 0.981 444.000 0.971 0.894
11 3 C 0.436 0.502 0.563 0.580
12 3 D 0.814 0.959 0.829 0.245
13 4 A 0.488 0.273 0.463 0.784
14 4 B 0.141 0.117 0.674 0.103
15 4 C 0.152 0.935 0.250 0.800
16 4 D 222.000 0.247 0.778 0.941
Want
Obs group replicate height weight bp cholesterol
1 1 A 0.4056 0.8870 0.2615 0.6827
2 1 B 0.6556 0.6995 0.0829 0.8356
3 1 C 0.6445 0.7110 0.3492 0.3826
4 1 D 0.1146 0.2655 . 0.0152
5 2 A 0.6072 0.2474 0.6444 0.9154
6 2 B 0.1720 . . 0.9241
7 2 C 0.6800 0.4166 0.2686 0.4992
8 2 D 0.7874 0.2595 0.6099 0.1418
9 3 A 0.4057 0.0988 0.2632 .
10 3 B 0.9805 . 0.9712 0.8937
11 3 C 0.4358 0.5023 0.5626 0.5799
12 3 D 0.8138 0.9588 0.8293 0.2448
13 4 A 0.4881 0.2731 0.4633 0.7839
14 4 B 0.1413 0.1166 0.6743 0.1032
15 4 C 0.1522 0.9351 0.2504 0.8003
16 4 D . 0.2465 0.7782 0.9412
The "get it done" approach is to manually enter each variable/value combination in a conditional which replaces with missing when true.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
EDIT: Updated outliers so that parameter avoids truncation and changed measurement to be numeric type so as to match the corresponding height, weight, bp, cholesterol. This shouldn't change the responses.
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
This, however, doesn't utilize the outliers data set. How can the replacement process be made automatic?
I immediately think of the IN= operator, but that won't work. It's not the entire row which needs to be matched. Perhaps an SQL key matching approach would work? But to match the key, don't I need to use a where statement? I'd then effectively be writing everything out manually again. I could probably create macro variables which contain the various if or where statements, but that seems excessive.
I don't think generating statements is excessive in this case. The complexity arises here because your outlier dataset cannot be merged easily since the parameter values represent variable names in the have dataset. If it is possible to reorient the outliers dataset so you have a 1 to 1 merge, this logic would be simpler.
Let's assume you cannot. There are a few ways to use a variable in a dataset that corresponds to a variable in another.
You could use an array like array params{*} height -- cholesterol; and then use the vname function as you loop through the array to compare to the value in the parameter variable, but this gets complicated in your case because you have a one to many merge, so you would have to retain the replacements and only output the last record for each by group... so it gets complicated.
You could transpose the outliers data using proc transpose, but that will get lengthy because you will need a transpose for each parameter, and then you'd need to merge all the transposed datasets back to the have dataset. My main issue with this method is that code with a bunch of transposes like that gets unwieldy.
You create the macro variable logic you are thinking might be excessive. But compared to the other ways of getting the values of the parameter variable to match up with the variable names in the have dataset, I don't think something like this is excessive:
data _null_;
set outliers;
call symput("outlierstatement"||_n_,"if group = "||group||" and replicate = '"||replicate||"' and "||parameter||" = "||measurement||" then "|| parameter ||" = .;");
call symput("outliercount",_n_);
run;
%macro makewant();
data want;
set have;
%do i = 1 %to &outliercount;
&&outlierstatement&i;
%end;
run;
%mend;
Lorem:
Transposition is the key to a fully automatic programmatic approach. The transposition that will occur is of the filter data, not the original data. The transposed filter data will have fewer rows than the original. As John indicated, transposition of the want data can create a very tall table and has to be transposed back after applying the filters.
As to the the filter data, the presence of a filter row for a specific group, replicate and parameter should be enough to mark a cell for filtering. This is on the presumption that you have a system for automatic outlier detection and the filter values will always be in concordance with the original values.
So, what has to be done to automate the filter application process without code generating a wall of test and assign statements ?
Transpose filter data into same form as want data, call it Filter^
Merge Want and Filter^ by record key (which is the by group of Group and Replicate)
Array process the data elements, looking for filtering conditions.
For your consideration, try the following SAS code. There is an erroneous filter record added to the mix.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
5 E 222 0.2465 0.7782 0.9412 /* test record for filter value misalignment test */
;
run;
data outliers;
length parameter $32; %* <--- widened parameter so it can transposed into column via id;
input parameter $ group replicate $ measurement ; %* <--- changed measurement to numeric variable;
datalines;
cholesterol 3 A 111
height 4 D 222
height 5 E 223 /* test record for filter value misalignment test */
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
data want;
set have;
if group = 3 and replicate = 'A' and cholesterol = 111 then cholesterol = .;
if group = 4 and replicate = 'D' and height = 222 then height = .;
if group = 2 and replicate = 'B' and weight = 333 then weight = .;
if group = 3 and replicate = 'B' and weight = 444 then weight = .;
if group = 2 and replicate = 'B' and bp = 555 then bp = .;
if group = 1 and replicate = 'D' and bp = 666 then bp = .;
run;
/* Create a view with 1st row having all the filtered parameters
* This is necessary so that the first transposed filter row
* will have the parameters as columns in alphabetic order;
*/
proc sql noprint;
create view outliers_transpose_ready as
select distinct parameter from outliers
union
select * from outliers
order by group, replicate, parameter
;
/* Generate a alphabetic ordered list of parameters for use
* as a variable (aka column) list in the filter application step */
select distinct parameter
into :parameters separated by ' '
from outliers
order by parameter
;
quit;
%put NOTE: &=parameters;
/* tranpose the filter data
* The ID statement pivots row data into column names.
* The prefix=_filter_ ensure the new column names
* will not collide with the original data, and can be
* the shortcut listed with _filter_: in an array statement.
*/
proc transpose data=outliers_transpose_ready out=outliers_apply_ready prefix=_filter_;
by group replicate notsorted;
id parameter;
var measurement;
run;
/* Robust production code should contain a bin for
* data that does not conform to the filter application conditions
*/
data
want2(label="Outlier filtering applied" drop=_i_ _filter_:)
want2_warnings(label="Outlier filtering: misaligned values")
;
merge have outliers_apply_ready(keep=group replicate _filter_:);
by group replicate;
/* The arrays are for like named columns
* due to the alphabetic ordering enforced in data and codegen preparation
*/
array value_filter_check _filter_:;
array value &parameters;
if group ne .;
do _i_ = 1 to dim(value);
if value(_i_) EQ value_filter_check(_i_) then
value(_i_) = .;
else
if not missing(value_filter_check(_i_)) AND
value(_i_) NE value_filter_check(_i_)
then do;
put 'WARNING: Filtering expected but values do not match. ' group= replicate= value(_i_)= value_filter_check(_i_)=;
output want2_warnings;
end;
end;
output want2;
run;
Confirm your want and automated want2 agree.
proc compare noprint data=want compare=want2 outnoequal out=diffs;
by group replicate;
run;
Enjoy your SAS
You could use a hash table. Load a hash table with the outlier dataset, with parameter-group-replicate defined as the key. Then read in the data, and as you read each record, check each of the variables to see if that combination of parameter-group-replicate can be found in the hash table. I think below works (I'm no hash expert):
data want;
if 0 then set outliers (keep=parameter group replicate);
if _N_ = 1 then
do;
declare hash h(dataset:'outliers') ;
h.defineKey('parameter', 'group', 'replicate') ;
h.defineDone() ;
end;
set have ;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
parameter=vname(vars{i});
if h.check()=0 then call missing(vars{i});
end;
drop i parameter;
run;
I like #John's suggestion:
You could use an array like array params{*} height -- cholesterol; and
then use the vname function as you loop through the array to compare
to the value in the parameter variable, but this gets complicated in
your case because you have a one to many merge, so you would have to
retain the replacements and only output the last record for each by
group... so it gets complicated.
Generally in a one to many merge I would avoid recoding variables from the dataset that is unique, because variables are retained within BY groups. But in this case, it works out well.
proc sort data=outliers;
by group replicate;
run;
data want (keep=group replicate height weight bp cholesterol);
merge have (in=a)
outliers (keep=group replicate parameter in=b)
;
by group replicate;
array vars {*} height weight bp cholesterol ;
do i=1 to dim(vars);
if vname(vars{i})=parameter then call missing(vars{i});
end;
if last.replicate;
run;
Thank you #John for providing a proof of concept. My implementation is a little different and I think worth making a separate entry for posterity. I went with a macro variable approach because I feel it is the most intuitive, being a simple text replacement. However, since a macro variable can contain only 65534 characters, it is conceivable that there could be sufficient outliers to exceed this limit. In such a case, any of the other solutions would make fine alternatives. Note that it is important that the put statement use something like best32. Too short a width will truncate the value.
If you desire to have a dataset containing the if statements (perhaps for verification), simply remove the into : statement and place a create table statements as line at the beginning of the PROC SQL step.
data have;
input group replicate $ height weight bp cholesterol;
datalines;
1 A 0.4056 0.8870 0.2615 0.6827
1 B 0.6556 0.6995 0.0829 0.8356
1 C 0.6445 0.7110 0.3492 0.3826
1 D 0.1146 0.2655 666 0.0152
2 A 0.6072 0.2474 0.6444 0.9154
2 B 0.1720 333 555 0.9241
2 C 0.6800 0.4166 0.2686 0.4992
2 D 0.7874 0.2595 0.6099 0.1418
3 A 0.4057 0.0988 0.2632 111
3 B 0.9805 444 0.9712 0.8937
3 C 0.4358 0.5023 0.5626 0.5799
3 D 0.8138 0.9588 0.8293 0.2448
4 A 0.4881 0.2731 0.4633 0.7839
4 B 0.1413 0.1166 0.6743 0.1032
4 C 0.1522 0.9351 0.2504 0.8003
4 D 222 0.2465 0.7782 0.9412
;
run;
data outliers;
input parameter $ 11. group replicate $ measurement;
datalines;
cholesterol 3 A 111
height 4 D 222
weight 2 B 333
weight 3 B 444
bp 2 B 555
bp 1 D 666
;
run;
proc sql noprint;
select
cat('if group = '
, strip(put(group, best32.))
, " and replicate = '"
, strip(replicate)
, "' and "
, strip(parameter)
, ' = '
, strip(put(measurement, best32.))
, ' then '
, strip(parameter)
, ' = . ;')
into : listIfs separated by ' '
from outliers
;
quit;
%put %quote(&listIfs);
data want;
set have;
&listIfs;
run;