(I find it hard to give a good descriptive title, so I'll just ask by means of an example.)
I have a data set like this:
|ID | A1 A2 A3 | B1 B2 B3 | C1 C2 C3 |
+---+----------+----------+----------+
| 1 | a aa aaa| b bb bbb| c cc ccc|
| 2 | (... some values, etc ...)
What I want to do is, given an "ID", make a table output with the values A1,A2,etc for that ID, something like this:
| | A's | B's | C's |
+---+-----+-----+-----+
| 1 | a | b | c |
| 2 | aa | bb | cc |
| 3 | aaa | bbb | ccc |
So, to recap: I want to pick a row, and output a table with certain variables displayed in columns. I've tried to wrap my mind around how proc tabulate works, but haven't managed to wrangle it into giving me what I want; it may be I'm barking up the wrong tree. Is there a way to do this?
I don't need this to return a data table, just some screen output.
You can reshape the data by creating a transposing view that operates on the three arrays in parallel. Proc REPORT or PRINT can then be used to generate the presentation output.
Sample Data
data have;
do id = 1 to 10;
array a a1-a3;
array b b1-b3;
array c c1-c3;
do i = 1 to dim(a);
a(i) = 10 ** i + id;
b(i) = 2 * 10 ** i + id;
c(i) = 3 * 10 ** i + id;
end;
output;
keep id a: b: c:;
end;
run;
Transposing view
data have_v / view = have_v;
set have;
array as a1-a3;
array bs b1-b3;
array cs c1-c3;
do seq = 1 to dim(as);
a = as(seq);
b = bs(seq);
c = cs(seq);
output;
end;
keep id seq a b c;
run;
Output with where clause. BY statement used to show id value in output.
proc report data=have_v;
by id;
where id = 3;
column id seq a b c;
define id / display noprint;
run;
You could use VIEWTABLE and issue a WHERE command if you don't want to produce output.
If each row encompasses an arbitrary number of 'arrays' (say a to z) of arbitrary but equal length (say 1 to 15), you would want to write a macro that performs some meta-data examination of the data set in question. The examination would attempt to discover the array 'names' and number of elements in each. This say would need to discover and output 15 rows by 26 columns for a given id.
Sounds like something that on old style data _null_ report could produce.
data _null_;
set have ;
where id=1 ;
array a a1-a3 ;
array b b1-b3 ;
array c c1-c3 ;
file print;
put #10 'A' #20 'B' #30 'C'
/ #10 8*'-' #20 8*'-' #30 8*'-'
;
do i=1 to dim(a);
put i 8. #10 a(i) #20 b(i) #30 c(i) ;
end;
run;
Results
A B C
-------- -------- --------
1 a b c
2 aa bb cc
3 aaa bbb ccc
Related
I have table in SAS Enterprise Guide like below:
COL1 | COL2 | COL3
-----|-------|------
111 | A | C
111 | B | C
222 | A | D
333 | A | D
And I need to aggregate abve table to know how many each value in columns occured, so as to have something like below:
COL2_A | COL2_B | COL3_C | COL3_D
--------|--------|--------|--------
3 | 1 | 2 | 2
Because:
COL_2A = 3, because in COL2 value "A" exists 3 times
and so on...
How can I do that in SAS Enterprise Guide or in PROC SQL ?
I need the output as SAS dataset
Try this
data have;
input COL1 COL2 $ COL3 $;
datalines;
111 A C
111 B C
222 A D
333 A D
;
data long;
set have;
array col COL2 COL3;
do over col;
c = col;
n = cats(vname(col), '_', c);
output;
end;
run;
proc summary data = long nway;
class n;
output out = freq(drop = _TYPE_);
run;
proc transpose data = freq out = wide_freq(drop = _:);
id n;
run;
I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;
I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;
I have a dataset with a lot of lines and I'm studying a group of variables.
For each line and each variable, I want to know if the value is equal to the max for this variable or more than or equal to 10.
Expected output (with input as all variables without _B) :
(you can replace T/F by TRUE/FALSE or 1/0 as you wish)
+----+------+--------+------+--------+------+--------+
| ID | Var1 | Var1_B | Var2 | Var2_B | Var3 | Var3_B |
+----+------+--------+------+--------+------+--------+
| A | 1 | F | 5 | F | 15 | T |
| B | 1 | F | 5 | F | 7 | F |
| C | 2 | T | 5 | F | 15 | T |
| D | 2 | T | 6 | T | 10 | T |
+----+------+--------+------+--------+------+--------+
Note that for Var3, the max is 15 but since 15>=10, any value >=10 will be counted as TRUE.
Here is what I've maid up so far (doubt it will be any help but still) :
%macro pleaseWorkLittleMacro(table, var, suffix);
proc means NOPRINT data=&table;
var &var;
output out=Varmax(drop=_TYPE_ _FREQ_) max=;
run;
proc transpose data=Varmax out=Varmax(rename=(COL1=varmax));
run;
data Varmax;
set Varmax;
varmax = ifn(varmax<10, varmax, 10);
run; /* this outputs the max for every column, but how to use it afterward ? */
%mend;
%pleaseWorkLittleMacro(MY_TABLE, VAR1 VAR2 VAR3 VAR4, _B);
I have the code in R, works like a charm but I really have to translate it to SAS :
#in a for loop over variable names, db is my data.frame, x is the
#current variable name and x2 is the new variable name
x.max = max(db[[x]], na.rm=T)
x.max = ifelse(x.max<10, x.max, 10)
db[[x2]] = (db[[x]] >= x.max) %>% mean(na.rm=T) %>% percent(2)
An old school sollution would be to read the data twice in one data step;
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
run;
data my_input;
set expect;
keep ID Var1 Var2 Var3 ;
proc print;
run;
It is a good habit to declare the most volatile things in your code as macro variables.;
%let varList = Var1 Var2 Var3;
%let markList = Var1_B Var2_B Var3_B;
%let varCount = 3;
Read the data twice;
data my_result;
set my_input (in=maximizing)
my_input (in=marking);
Decklare and Initialize arrays;
format &markList $1.;
array _vars [&&varCount] &varList;
array _maxs [&&varCount] _temporary_;
array _B [&&varCount] &markList;
if _N_ eq 1 then do _varNr = 1 to &varCount;
_maxs(_varNr) = -1E15;
end;
While reading the first time, Calculate the maxima;
if maximizing then do _varNr = 1 to &varCount;
if _vars(_varNr) gt _maxs(_varNr) then _maxs(_varNr) = _vars(_varNr);
end;
While reading the second time, mark upt to &maxMarks maxima;
if marking then do _varNr = 1 to &varCount;
if _vars(_varNr) eq _maxs(_varNr) or _vars(_varNr) ge 10
then _B(_varNr) = 'T';
else _B(_varNr) = 'F';
end;
Drop all variables starting with an underscore, i.e. all my working variables;
drop _:;
Only keep results when reading for the second time;
if marking;
run;
Check results;
proc print;
var ID Var1 Var1_B Var2 Var2_B Var3 Var3_B;
proc compare base=expect compare=my_result;
run;
This is quite simple to solve in sql
proc sql;
create table my_result as
select *
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Not tested, as our SAS server is currently down, which is the reason I pass my time on Stack Overflow)
If you need that for a lot of variables, consult the system view VCOLUMN of SAS:
proc sql;
select ''|| name ||'_B = ('|| name ||' eq max_'|| name ||')'
, 'max('|| name ||') as max_'|| name
from sasHelp.vcolumn
where libName eq 'WORK'
and memName eq 'MY_RESULT'
and type eq 'num'
and upcase(name) like 'VAR%'
;
into : create_B separated by ', '
, : select_max separated by ', '
create table my_result as
select *, &create_B
, Var1_B = (Var1 eq max_Var1)
, Var1_B = (Var2 eq max_Var2)
, Var1_B = (Var3 eq max_Var3)
from my_input
, (select max(Var1) as max_Var1
, max(Var2) as max_Var2
, max(Var3) as max_Var3)
;
quit;
(Again not tested)
After Proc MEANS computes the maximum value for each column you can run a data step that combines the original data with the maximums.
data want;
length
ID $1 Var1 8 Var1_B $1. Var2 8 Var2_B $1. Var3 8 Var3_B $1. var4 8 var4_B $1;
input
ID Var1 Var1_B Var2 Var2_B Var3 Var3_B ; datalines;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
run;
data have;
set want;
drop var1_b var2_b var3_b var4_b;
run;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_) max= / autoname;
run;
The neat thing the VAR statement is that you can easily list numerically suffixed variable names. The autoname option automatically appends _ to the names of the variables in the output.
Now combine the maxes with the original (have). The set varmax automatically retains the *_max variables, and they will not get overwritten by values from the original data because the varmax variable names are different.
Arrays are used to iterate over the values and apply the business logic of flagging a row as at max or above 10.
data want;
if _n_ = 1 then set varmax; * read maxes once from MEANS output;
set have;
array values var1-var4;
array maxes var1_max var2_max var3_max var4_max;
array flags $1 var1_b var2_b var3_b var4_b;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
The difficult part above is that the MEANS output creates variables that can not be listed using the var1 - varN syntax.
When you adjust the naming convention to have all your conceptually grouped variable names end in numeric suffixes the code is simpler.
* number suffixed variable names;
* no autoname, group rename on output;
proc means NOPRINT data=have;
var var1-var4;
output out=Varmax(drop=_TYPE_ _FREQ_ rename=var1-var4=max_var1-max_var4) max= ;
run;
* all arrays simpler and use var1-varN;
data want;
if _n_ = 1 then set varmax;
set have;
array values var1-var4;
array maxes max_var1-max_var4;
array flags $1 flag_var1-flag_var4;
do i = 1 to dim(values); drop i;
flags(i) = ifc(min(10,maxes(i)) <= values(i),'T','F');
end;
run;
You can use macro code or arrays, but it might just be easier to transform your data into a tall variable/value structure.
So let's input your test data as an actual SAS dataset.
data expect ;
input ID $ Var1 Var1_B $ Var2 Var2_B $ Var3 Var3_B $ ;
cards;
A 1 F 5 F 15 T
B 1 F 5 F 7 F
C 2 T 5 F 15 T
D 2 T 6 T 10 T
;
First you can use PROC TRANSPOSE to make the tall structure.
proc transpose data=expect out=tall ;
by id ;
var var1-var3 ;
run;
Now your rules are easy to apply in PROC SQL step. You can derive a new name for the flag variable by appending a suffix to the original variable's name.
proc sql ;
create table want_tall as
select id
, cats(_name_,'_Flag') as new_name
, case when col1 >= min(max(col1),10) then 'T' else 'F' end as value
from tall
group by 2
order by 1,2
;
quit;
Then just flip it back to horizontal and merge with the original data.
proc transpose data=want_tall out=flags (drop=_name_);
by id;
id new_name ;
var value ;
run;
data want ;
merge expect flags;
by id;
run;
I am working with data that derives from an 'indicate all that apply' question. Two raters were asked to complete the question for a unique subject list. The data looks something like this.
ID| Rater|Q1A|Q1B|Q1C|Q1D
------------------------
1 | 1 | A | F | E | B
1 | 2 | E | G |
2 | 1 | D | C | A
2 | 2 | C | D | A
I want to compare the two raters' answers for each ID and determine whether answers for Q1A-Q1D are the same. I am not interested in the direct comparisons between each rater by ID for Q1A, Q1B, etc. individually. I want to know if all the values in Q1A-Q1D as a set are the same. (E.g., in the example data above, the raters for ID 2 would be identical). I am assuming I would do this with an array. Thanks.
Here is a similar solution also using call sortc, but rather using vectors and retain variables.
Create example dataset
data ratings;
infile datalines truncover;
input ID Rater (Q1A Q1B Q1C Q1D) ($);
datalines;
1 1 A F E B
1 2 E G
2 1 D C A
2 2 C D A
3 1 A B C
3 2 A B D
;
Do the comparison
data compare(keep=ID EQUAL);
set ratings;
by ID;
format PREV_1A PREV_Q1B PREV_Q1C PREV_Q1D $1.
EQUAL 1.;
retain PREV_:;
call sortc(of Q1:);
array Q(4) Q1:;
array PREV(4) PREV_:;
if first.ID then do;
do _i = 1 to 4;
PREV(_i) = Q(_i);
end;
end;
else do;
EQUAL = 1;
do _i = 1 to 4;
if Q(_i) NE PREV(_i) then EQUAL = 0;
end;
output;
end;
run;
Results
ID EQUAL
1 0
2 1
3 0
This looks like a job for call sortc:
data have;
infile cards missover;
input ID Rater (Q1A Q1B Q1C Q1D) ($);
cards;
1 1 A F E B
1 2 E G
2 1 D C A
2 2 C D A
3 1 A B C
3 2 A B D
;
run;
/*You can use an array if you like, but this works fine too*/
data temp /view = temp;
set have;
call sortc(of q:);
run;
data want;
set temp;
/*If you have more questions, extend the double-dash list to cover all of them*/
by ID Q1A--Q1D notsorted;
/*Replace Q1D with the name of the variable for the last question*/
IDENTICAL_RATERS = not(first.Q1D and last.Q1D);
run;
Sort, Concatenate, then compare.
data want ;
set ratings;
by id;
call sortc(of Q1A -- Q1D);
rating = cats(of Q1A -- Q1D);
retain rater1 rating1 ;
if first.id then rater1=rater;
if first.id then rating1=rating;
if not first.id ;
rater2 = rater ;
rating2 = rating;
match = rating1=rating2 ;
keep id rater1 rater2 rating1 rating2 match;
run;