Combining different variables into one new variable SAS - sas

I would like to combine different observations from different variables in one variable. The observations are all related to each other and are repeated measurements.
Example:
Current database layout.
**Var1 Dat1 Var2 Dat2 Var3 Dat3**
Obs1a Dat1a Dat2a Dat2a Obs3a Dat3a
Obs1b Dat1b Dat2b Dat2b Obs3b Dat3b
Obs1c Dat1c Dat2c Dat2c Obs3c Dat3c
Obs1d Dat1d Dat2d Dat2d Obs3d Dat3d
I want to create a new variable with combined observations:
Var Dat
Obs 1a Dat 1a
Obs 1b Dat 1b
.... ...
Obs 2a Dat 2a
Obs 2b Dat 2b
.... ...
Obs 3c Dat 3c
Obs 3d Dat 3d
Could somebody explain me how to do this in SAS?

Create 2 arrays of the "var" and "dat" values. Loop through them and use the OUTPUT statement to create 1 observation for each value in the arrays.
data test;
input var1 $ dat1 $ var2 $ dat2 $ var3 $ dat3 $;
datalines;
Obs1a Dat1a Dat2a Dat2a Obs3a Dat3a
Obs1b Dat1b Dat2b Dat2b Obs3b Dat3b
Obs1c Dat1c Dat2c Dat2c Obs3c Dat3c
Obs1d Dat1d Dat2d Dat2d Obs3d Dat3d
;
run;
data test2(keep=var dat);
set test;
array v[3] var1-var3 ;
array d[3] dat1-dat3;
format var dat $8.;
do i=1 to 3;
var = v[i];
dat = d[i];
output;
end;
run;

Related

PROC SQL equivalent of data step merge using IN

I have to merge two very large files and I want to avoid this doing in Data step as that would mean sorting the data. I need all observations for all IDs from left file excluding IDs that are not in the second.
data leftdata;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;run;
data rightdata;
input id $ ;
datalines;
AA
BB
;
run;
*Using datastep;
PROC SORT DATA=leftdata; BY id;
PROC SORT DATA=rightdata; BY id; RUN;
DATA datastep;
MERGE leftdata(IN=a) rightdata(IN=b);
BY id; IF a and b=0;
RUN;
How can the same be achieved using PROC SQL?
Final output must include the following observations:
CC 50
CC 80
DD 60
Here are two ways:
WHERE NOT IN (SELECT ...) filtering, or
LEFT JOIN where missing the right id.
Example:
data have;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;
data excluded_ids;
input id $ ;
datalines;
AA
BB
;
proc sql;
create table want as
select * from have
where id not in (select id from excluded_ids)
;
create table want as
select have.* from have
left join excluded_ids as remove
on have.id = remove.id
where remove.id is null
;
For the second way you will need a SELECT DISTINCT if the exclusion list has a repeated id.
Data step
Use a hash object to store the exclusion list and check method to test for removal
Example:
data want;
set have;
if _n_ = 1 then do;
declare hash exclude(dataset:'excluded_ids');
exclude.defineKey('id');
exclude.defineDone();
end;
if exclude.check() = 0 then delete;
run;

How to transpose dataset more simply

I'd like to make the dataset like the below. I got it, but it’s a long program.
I think it would become more simple. If you have a good idea, please give me some advice.
This is the data.
data test;
input ID $ NO DAT1 $ TIM1 $ DAT2 $ TIM2 $;
cards;
1 1 2020/8/4 8:30 2020/8/5 8:30
1 2 2020/8/18 8:30 2020/8/19 8:30
1 3 2020/9/1 8:30 2020/9/2 8:30
1 4 2020/9/15 8:30 2020/9/16 8:30
2 1 2020/8/4 8:34 2020/8/5 8:34
2 2 2020/8/18 8:34 2020/8/19 8:34
2 3 2020/9/1 8:34 2020/9/2 8:34
2 4 2020/9/15 8:34 2020/9/16 8:34
3 1 2020/8/4 8:46 2020/8/5 8:46
3 2 2020/8/18 8:46 2020/8/19 8:46
3 3 2020/9/1 8:46 2020/9/2 8:46
3 4 2020/9/15 8:46 2020/9/16 8:46
;
run;
This is my program.
data
t1(keep = ID A1 A2 A3 A4)
t2(keep = ID B1 B2 B3 B4)
t3(keep = ID C1 C2 C3 C4)
t4(keep = ID D1 D2 D3 D4);
set test;
if NO = 1 then do;
A1 = DAT1;
A2 = TIM1;
A3 = DAT2;
A4 = TIM2;
end;
*--- cut (NO = 2, 3, 4 are same as NO = 1)--- ;
end;
if NO = 1 then output t1;
if NO = 2 then output t2;
if NO = 3 then output t3;
if NO = 4 then output t4;
run;
proc sort data = t1;by ID; run;
proc sort data = t2;by ID; run;
proc sort data = t3;by ID; run;
proc sort data = t4;by ID; run;
data test2;
merge t1 t2 t3 t4;
by ID;
run;
Since the result looks like a report use a reporting tool.
proc report data=test ;
column id no,(dat1 tim1 dat2 tim2 n) ;
define id / group width=5;
define no / across ' ' ;
define n / noprint;
run;
Tall to very wide data transformations are typically
sketchy, you put data into metadata (column names or labels) or lose referential context, or
a reporting layout for human consumption
Presuming your "as dataset like below" is accurate and you want to pivot your data in such a manner.
Way 1 - self merging subsets with renaming
You should see that the NO field is a sequence number that can be used as a BY variable when merging data sets.
Consider this example code as a template that could be the source code generation of a macro:
NO is changed name to seq for better clarity
data want;
merge
have (where=(seq=1) rename=(dat1=A1 tim1=B1 dat2=C1 tim2=D1)
have (where=(seq=2) rename=(dat1=A2 tim1=B2 dat2=C2 tim2=D2)
have (where=(seq=3) rename=(dat1=A3 tim1=B3 dat2=C3 tim2=D3)
have (where=(seq=4) rename=(dat1=A4 tim1=B4 dat2=C4 tim2=D4)
;
by id;
run;
For unknown data sets organized like the above pattern, the code generation requirements should be obvious; determine maximum seq and have the names of variables to pivot be specified (as macro parameters, in which loop over the names occurs).
Way 2 - multiple transposes
Caution, all pivoted columns will be character type and contain the formatted result of original values.
proc transpose data=have(rename=(dat1=A tim1=B dat2=C tim2=D)) out=stage1;
by id seq;
var a b c d;
run;
proc transpose data=stage1 out=want;
by id;
var col1;
id _name_ seq;
run;
Way 3 - Use array and DOW loop
* presume SEQ is indeed a unit monotonic sequence value;
data want (keep=id a1--d4);
do until (last.id);
array wide A1-A4 B1-B4 C1-C4 D1-D4;
wide [ (seq-1)*4 + 1 ] = dat1;
wide [ (seq-1)*4 + 2 ] = tim1;
wide [ (seq-1)*4 + 3 ] = dat2;
wide [ (seq-1)*4 + 4 ] = tim2;
end;
keep id A1--D4;
* format A1 A3 B1 B3 C1 C3 D1 D3 your-date-format;
* format A2 A4 ................. your-time-format;
Way 4 - change your data values to datetime
I'll leave this to esteemed others

In SAS how to transpose a table producing a dummy variable for each unique value in a column

Using SAS, I am trying to transpose the data in a table so that each unique value for variables Class and Subclass become a dummy variable, by variable ID.
Have:
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
Want:
ID Class_1 Class_2 Class_3 Subclass_1a ... Subclass_3b
----------------------------------------------------...---------------
ID1 1 1 0 1 ... 0
ID2 1 1 1 1 ... 0
ID3 1 1 1 1 ... 0
I have tried transposing the data by variable ID with Class and Subclass in the ID-statement of the transpose procedure. This however produces variables consisting of concatenations of unique combinations of the values of Class and Subclass. Neither does that approach produce 0 and 1 values where no VAR is defined in the transpose procedure.
Do I need to create the actual dummy variables first before transposing the data to achieve the want table, or is there a more straightforward way?
Seems like you need the help of PROC TRANSREG to generate a design matrix that is reduced.
data id;
infile datalines firstobs=3;
input ID :$3. class subclass :$2.;
datalines;
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;;;;
run;
proc print;
run;
proc transreg;
id id;
model class(class subclass / zero=none);
output design out=dummy(drop=class subclass);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(class: subclass:)=;
run;
proc print;
run;
you can also do distinct and use tranpose for each variable and merge it back.
data have;
input ID $ Class $ Subclass $ ;
datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;
proc sql;
create table want1 as
select distinct id, class from have;
proc transpose data = want1 out=want1a(drop =_name_) prefix = class_;
by id;
id class;
var class;
run;
proc sql;
create table want2 as
select distinct id, subclass from have;
proc transpose data = want2 out=want2a(drop =_name_) prefix = Subclass_;
by id;
id subclass;
var Subclass;
run;
data want;
merge want1a want2a;
by id;
array class(*) class_: subclass_:;
do i = 1 to dim(class);
if missing(class(i)) then class(i)= "0";
else class(i) ="1";
end;
drop i;
run;
Here is some tricky code generation that uses a hash to map a value to an array index corresponding to a flag variable representing the existential state of <name>_<value>
data have;
input ID $ Class Subclass $; datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
run;
* create indexed name_value data for variable name construction and hash initialization;
proc sql ; * fresh proc to reset within proc monotonic tracker;
create table map1 as
select class, monotonic() as index
from (select distinct class from have);
proc sql noprint;
create table map2 as
select subclass, monotonic() as index
from (select distinct subclass from have);
* populate macro variable with pdv target variable names to be arrayed;
proc sql noprint;
select catx('_','class',class)
into :map1vars separated by ' '
from map1 order by index;
select catx('_','subclass',subclass)
into :map2vars separated by ' '
from map2 order by index;
* group wise flag <variable>_<value> combinations;
data want;
if _n_ = 1 then do;
if 0 then set map1 map2; * prep pdv with hash variables;
declare hash map1(dataset:'map1');
declare hash map2(dataset:'map2');
map1.defineKey('class');
map1.defineData('index');
map1.defineDone();
map2.defineKey('subclass');
map2.defineData('index');
map2.defineDone();
end;
* group wise flag pivot vars (existential extrusion);
do until (last.id);
set have;
by id;
array map1_ &map1vars; * array for <name>_<value> combinations;
array map2_ &map2vars;
* use hash lookup on value to find index into target array;
map1.find(); put index=; map1_[index] = 1;
map2.find(); put index=; map2_[index] = 1;
end;
keep id &map1vars &map2vars;
run;
Proc REPORT can show values across with counts of occurrence within the group.
proc report data=have;
define id / group;
define class / across;
define subclass / across;
run;

How to modify the SNP values?

My dataset has 3 SNPs which looks like below
Id SNP1 SNP 2 SNP3
1 AA AA AA
2 AG AC AG
3 GG CC GG
4
5
6 So on
In SNP1 - I would like to modify the values AA =2, AG =1, GG = 0 and Likewise in SNP1 and SNP2
How can I do this?
I would put the new values in a proc format, so that you can either keep the existing values but displayed with the formatted value, or convert the existing values using the format. Here are both ways to do this.
/* create format */
proc format;
value $snpfmt 'AA' = '2'
'AG' = '1'
'GG' = '0'
;
run;
/* create initial dataset */
data have;
input Id SNP1 $ SNP2 $ SNP3 $;
datalines;
1 AA AA AA
2 AG AC AG
3 GG CC GG
;
/* option1 - format the values */
proc datasets lib=work nodetails nolist;
modify have;
format snp1 snp2 snp3 $snpfmt2. ;
quit;
/* option2 - change the values using the format */
data want;
set have;
snp1 = put(snp1,$snpfmt2.);
snp2 = put(snp2,$snpfmt2.);
snp3 = put(snp3,$snpfmt2.);
run;

Concatenate duplicate values

I have a table with some variables, say var1 and var2 and an identifier, and for some reasons, some identifiers have 2 observations.
I would like to know if there is a simple way to put back the second observation of the same identifier into the first one, that is
instead of having two observations, each with var1 var2 variables for the same identifier value
ID var1 var2
------------------
A1 12 13
A1 43 53
having just one, but with something like var1 var2 var1_2 var2_2.
ID var1 var2 var1_2 var2_2
--------------------------------------
A1 12 13 43 53
I can probably do that with renaming all my variables, then merging the table with the renamed one and dropping duplicates, but I assume there must be a simpler version.
Actually, your suggestion of merging the values back is probably the best.
This works if you have, at most, 1 duplicate for any given ID.
data first dups;
set have;
by id;
if first.id then output first;
else output dups;
run;
proc sql noprint;
create table want as
select a.id,
a.var1,
a.var2,
b.var1 as var1_2,
b.var2 as var2_2
from first as a
left join
dups as b
on a.id=b.id;
quit;
Another method makes use of PROC TRANSPOSE and a data-step merge:
/* You can experiment by adding more data to this datalines step */
data have;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
;
run;
/* This step puts the var1 values onto one line */
proc transpose data=tab out=new1 (drop=_NAME_) prefix=var1_;
by id;
var var1;
run;
/* This does the same for the var2 values */
proc transpose data=tab out=new2 (drop=_NAME_) prefix=var2_;
by id;
var var2;
run;
/* The two transposed datasets are then merged together to give one line */
data want;
merge new1 new2;
by id;
run;
As an example:
data tab;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
A2 199 342
A2 1132 111
A2 91913 199191
B1 1212 43214
;
run;
Gives:
ID var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
---------------------------------------------------
A1 12 43 . 13 53 .
A2 199 1132 91913 342 111 199191
B1 1212 . . 43214 . .
There's a very simple way of doing this, using the IDGROUP function within PROC SUMMARY.
data have;
input ID $ var1 $ var2 $;
datalines;
A1 12 13
A1 43 53
;
run;
proc summary data=have nway;
class id;
output out=want (drop=_:)
idgroup(out[2] (var1 var2)=);
run;