Compare averages of 2 columns - sas

For example, i have a data set like this (the value a1 a2 a3 b1 b2 b3 are numeric):
A B
a1 b1
a2 b2
a3 b3
I want to compare the average of 2 class A and B using proc ttest. But it seems that i have to change my data set in order to use this proc. I read lots of tutorials about the proc ttest and all of them use the data sets in this form below:
class value
A a1
A a2
A a3
B b1
B b2
B b3
So my question is: Does it exist a method to do the proc ttest without changing my data set?
Thank you and sorry for my bad english :D

The short answer is no, you can't run a ttest in SAS that compares multiple columns. proc ttest, when used for 2 samples, relies on the variable in the class statement to compare the groups. Only one variable can be entered and it must have 2 levels, therefore the structure of your data is not compatible with this.
You will therefore need to change the data layout, although you could do this in a view so that you don't create a new physical dataset. Here's one way to do that.
/* create dummy data */
data have;
input A B;
datalines;
10 11
15 14
20 21
25 24
;
run;
/* create a view that turns vars A and B into a single variable */
data have_trans / view=have_trans;
set have;
array vals{2} A B;
length grouping $2;
do i = 1 to 2;
grouping = vname(vals{i}); /* extracts the current variable name (A or B) */
value = vals{i}; /* extracts the current value */
output;
end;
drop A B i; /* drop unwanted variables */
run;
/* perform ttest */
proc ttest data=have_trans;
class grouping;
var value;
run;

Related

Naming columns using values from data table

Let's say I have two columns A and B.
A B
12 "randstr"
39 "randstr"
2 "randstr"
This random string is repeated in each row.
I'm interested in how I can get the table below:
randstr B
12 "randstr"
39 "randstr"
2 "randstr"
The value in the column B was used to rename column A. I have tried using rename and all sorts of macro magic but failed. I have no idea how to proceed.
I've tried the answers below and they just don't allow for reading the value from the data and then using the value as a column name:
https://communities.sas.com/t5/General-SAS-Programming/dates-used-as-column-names/td-p/168803
https://stats.idre.ucla.edu/sas/code/a-few-sas-macro-programs-for-renaming-variables-dynamically/
SAS - Dynamically create column names using the values from another column
Renaming Column with Dynamic Name
The transformation could also be seen as a row-wise transposition.
data have;
attrib A length=8 B length=$32;
row+1;
input
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
run;
proc transpose data=have out=want(drop=row _name_);
by row;
var A;
id B;
copy B;
run;
In non-toy scenarios the B column is often not a single value. Try the same transpose with data having variation in B. The procedure will create two new columns from the values of B.
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
3141 xyz-456-def
Using this macro, it's fairly straightforward:
/* get first value in the dataset */
%let new_col=%mf_getvalue(work.YOURDATA,B);
/* rename variable A */
proc datasets library=work nolist;
modify YOURDATA;
rename A=%sysfunc(dequote(&new_col));
quit;

Combining columns in SAS

I just started using SAS and I'm trying to combine columns.
I've got table mainData
A1 A2 A3 A4
1 4 7 10
2 5 8 11
3 6 9 12
I want to create a new table rearrangedData
Type Value
A1 1
A1 2
A1 3
A2 4
A2 5
A2 6
A3 7
A3 8
A3 9
A4 10
A4 11
A4 12
There must be a simple solution to this I just can't figure this out. I'm thinking of writing do loop, but what if I don't know size of a table or amount of lines in a specific column. I can't figure how I would get such information in SAS.
This somewhat unusual transformation can be done via a transpose and some array logic:
data have;
input A1 A2 A3 A4;
cards;
1 4 7 10
2 5 8 11
3 6 9 12
;
run;
proc transpose data = have out = tr name=type prefix = r;
run;
data want;
set tr;
array r{*} r:;
do i = 1 to dim(r);
value = r[i];
output;
end;
drop i r:;
run;
Also, this preserves the original order without requiring a sort.
Make a dummy variable, then transpose data.
data have;
set have;
id=_n_;
run;
proc transpose data=have out=temp;
by id;
var A1-A4;
run;
proc sort data=temp out=want(rename=(_name_=type col1=value) drop=id);
by _name_;
run;
If you want to preserve the original order then you could use the POINT= option on the SET statement to loop over the data set once per variable (column).
So this data set will read the first observations just to get the variables defined. Then define the array VALUES so that we can use DIM(VALUES) to know how many columns. Then it uses the POINT= and NOBS= options on the SET statement to control the other loop. It uses the VNAME() function to find the name of the current variable in the array.
data want ;
set have ;
array values _numeric_;
do col=1 to dim(values);
length type $32 value 8;
type=vname(values(col));
do row=1 to nobs ;
set have point=row nobs=nobs ;
value=values(col);
output;
keep type value;
end;
end;
stop;
run;

modify multiple observations in a by variable

data a1
col1 col2 flag
a 2 .
b 3 .
a 4 .
c 1 .
For data a1, flag is always missing. I want to update multiple rows using a2.
data a2
col1 flag
a 1
Ideal output:
col1 col2 flag
a 2 1
b 3 .
a 4 1
c 1 .
But this doesn't update all the records in by statement.
data a1;
modify a1 a2;
by col1;
run;
Question edited
Actually a1 is a very large data set on server. Hence I prefer to modify it (if possible) instead of creating a new one. Otherwise I have to drop previous a1 first and copy a new a1 from local to server, which will take much more time.
If you want to do this with MODIFY, you have to loop over the modify dataset in some fashion or it will only replace the first row (because the other dataset will then run out of records - normally this behaves like merge, where once it finds a match it advances to next record). Here's one option - there are others.
data a1(index=(col1));
input col1 $ col2 flag;
datalines;
a 2 .
b 3 .
a 4 .
c 1 .
;;;;
run;
data a2(index=(col1));
col1='a';
flag=1;
run;
data a1;
set a2(rename=flag=flag2);
do _n_ = 1 to nobs_a1;
modify a1 key=col1 nobs=nobs_a1;
if _iorc_=0 then do;
flag=flag2;
replace;
end;
end;
if _iorc_=%sysrc(_DSENOM) then _error_=0;
run;
If you're not using Merge statement for the sorting problem, you can simply change your merging approach.
If flag in A1 is always missing, you can drop it, otherwise you should temporary rename it for not losing those informations.
Here I will merge A1 and A2 using hash objects, this approach doesn't require any prior sorting on datasets.
data final_merged(drop = finder);
length flag 8.; /*please change length with the real one, use $ if char*/
if _N_ = 1 then do;
declare hash merger(dataset:'A2');
merger.definekey('col1');
merger.DefineData ('flag');
merger.definedone();
end;
set A1(drop=flag);
finder = merger.find();
if finder ne 0 then flag = .;
/*then flag='' or then flag='unknown' as you want if flag is a character var*/
run;
Please, let me know if this will help.
You could do the following but SQL sorts the observations so not sure how useful this would be for you? (you could always preprocess with ordvar=_n_; and then sort the SQL statement on it if that helps):
Data:
data a1 ;
input col1 $ col2 flag ;
cards ;
a 2 .
b 3 .
a 4 .
c 1 .
;run ;
data a2 ;
input col1 $ flag ;
cards ;
a 1
;run ;
Merge:
proc sql ;
create table output as
select a.col1, a.col2, b.flag
from a1 a
left join
a2 b
on a.col1=b.col1
;quit ;
To try and do it in one pass, how about creating two macros variables containing the mapping from a2?
proc sql ;
select distinct col1, flag
into :colvals separated by '', :flagvals separated by ''
from a2
;quit ;
Set flag to the corresponding character position between the two macro variables:
data a1 ;
set a1 ;
if findc("&colvals",col1) then
flag=input(substr("&flagvals", findc("&colvals",col1),1),8.) ;
run ;

SAS merge datasets to overwrite values

I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.
This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).
Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;
The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).

how to beat computational complexity in call execute(catt(datastep))

in a datastep of this kind
ID VAR_1 VAR_2 VAR_3 ...
1 a1 b1 mv ...
2 a2 b2 mv ...
3 a3 b3 c3 ...
4 a4 mv mv ...
5 a5 b5 mv ...
6 a6 b6 mv ...
where the number of the variables are not known (i want to generalize as more as possible my code) I want to obtain a dataset like this (something like an inverted proc transpose):
ID VAR
1 a1
1 b1
2 a2
2 b2
3 a3
3 b3
3 c3
....
So i'm splitting the dataset in a nonfixed number of temp datasets, which one contains ID and only one column, trashing observation with missing values, then I'll merge all these temporary datasets obtaining my result. And this works.
But the call execute has a very high computational complexity, I mean, if I try to do this operation in a dataset with only one column (dropping missing values) my garbage computer takes 0.1 secs, while using a call execute in a dataset with 6 columns it won't take 0.1*6=0.6 secs, It will take some minutes. This because it won't work in column but in row, and this is SAS and I must get over it. But I'm asking myself (and now I'm asking to you) if there are some other ways for obtaining my results without this computational time. Here a focus on the code:
data _null_;
set old;
array try[*] VAR: ;
do i=1 to DIM(try);
call execute(catt("data var",i,"; set old; if var_",i," = ' ' then delete; allvarnew= col",i,"; ` `drop COL:; run;" ));
end;
run;
columns are char $1 (ID is char $4).
columns are the result of a proc transpose.
thanks.
I'm not sure of the efficiency of this, but it requires only one data-step as opposed to the multiple data-steps in the call execute approach described:
data new (drop=var_: i);
set test;
array try[*] VAR_: ;
do i=1 to DIM(try);
var=try[i]; output;
end;
run;