PROC SQL equivalent of data step merge using IN - sas

I have to merge two very large files and I want to avoid this doing in Data step as that would mean sorting the data. I need all observations for all IDs from left file excluding IDs that are not in the second.
data leftdata;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;run;
data rightdata;
input id $ ;
datalines;
AA
BB
;
run;
*Using datastep;
PROC SORT DATA=leftdata; BY id;
PROC SORT DATA=rightdata; BY id; RUN;
DATA datastep;
MERGE leftdata(IN=a) rightdata(IN=b);
BY id; IF a and b=0;
RUN;
How can the same be achieved using PROC SQL?
Final output must include the following observations:
CC 50
CC 80
DD 60

Here are two ways:
WHERE NOT IN (SELECT ...) filtering, or
LEFT JOIN where missing the right id.
Example:
data have;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;
data excluded_ids;
input id $ ;
datalines;
AA
BB
;
proc sql;
create table want as
select * from have
where id not in (select id from excluded_ids)
;
create table want as
select have.* from have
left join excluded_ids as remove
on have.id = remove.id
where remove.id is null
;
For the second way you will need a SELECT DISTINCT if the exclusion list has a repeated id.
Data step
Use a hash object to store the exclusion list and check method to test for removal
Example:
data want;
set have;
if _n_ = 1 then do;
declare hash exclude(dataset:'excluded_ids');
exclude.defineKey('id');
exclude.defineDone();
end;
if exclude.check() = 0 then delete;
run;

Related

Assign order variable by SAS

I have a dataset like as below, by using SAS, I need to assign the order variable based on descending count order to this dataset, when the category is missing, it should be always in the last whatever the count is. All other category above the missing one should be order by descending count.
Category Count
aa 10
bb 9
cc 8
6
ab 3
Desired output:
Category Count Order
aa 10 1
bb 9 2
cc 8 3
ab 3 4
6 5
You can use Proc DS2 to compute a sequence number for a result set.
Example:
data have;
input s $ f;
datalines;
aa 10
bb 9
cc 8
. 6
ab 3
;
proc ds2;
data want(overwrite=yes);
declare int sequence ;
method run();
set
{
select s,f from have
order by case when s is not null then f else -1e9-f end desc
};
sequence + 1;
end;
run;
quit;
Sort and split all the datasets into descending value order by missing and not-missing category, then stack them on top of each other.
/* Sort the non-missing values */
proc sort data=have out=have_notmissing;
by descending value;
where NOT missing(category);
run;
/* Sort the missing values */
proc sort data=have out=have_missing;
by descending value;
where missing(category);
run;
/* Stack them on top of each other */
data want;
set have_notmissing
have_missing
;
rank+1;
run;
Output:
category value rank
aa 10 1
bb 9 2
cc 8 3
ab 3 4
6 5
Perhaps what you really want is a NOTSORTED format and a procedure that supports the PRELOADFMT option and ORDER=DATA.
data test;
input category $ count;
cards;
aa 10
bb 9
cc 8
. 6
ab 3
;;;;
run;
proc print;
run;
proc format;
value $cat(notsorted)
'bb'='Bee Bee'
'aa'='Aha'
'cc'='CC'
'dd'='D D'
'ab'='AB'
;
quit;
proc summary data=test nway missing;
class category / order=data preloadfmt;
format category $cat.;
freq count;
output out=summary(drop=_type_) / levels;
run;
proc print;
run;

values of commun column in A replaced by that in B with function merge in SAS

I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.

In SAS how to transpose a table producing a dummy variable for each unique value in a column

Using SAS, I am trying to transpose the data in a table so that each unique value for variables Class and Subclass become a dummy variable, by variable ID.
Have:
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
Want:
ID Class_1 Class_2 Class_3 Subclass_1a ... Subclass_3b
----------------------------------------------------...---------------
ID1 1 1 0 1 ... 0
ID2 1 1 1 1 ... 0
ID3 1 1 1 1 ... 0
I have tried transposing the data by variable ID with Class and Subclass in the ID-statement of the transpose procedure. This however produces variables consisting of concatenations of unique combinations of the values of Class and Subclass. Neither does that approach produce 0 and 1 values where no VAR is defined in the transpose procedure.
Do I need to create the actual dummy variables first before transposing the data to achieve the want table, or is there a more straightforward way?
Seems like you need the help of PROC TRANSREG to generate a design matrix that is reduced.
data id;
infile datalines firstobs=3;
input ID :$3. class subclass :$2.;
datalines;
ID Class Subclass
-------------------------------
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;;;;
run;
proc print;
run;
proc transreg;
id id;
model class(class subclass / zero=none);
output design out=dummy(drop=class subclass);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(class: subclass:)=;
run;
proc print;
run;
you can also do distinct and use tranpose for each variable and merge it back.
data have;
input ID $ Class $ Subclass $ ;
datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
;
proc sql;
create table want1 as
select distinct id, class from have;
proc transpose data = want1 out=want1a(drop =_name_) prefix = class_;
by id;
id class;
var class;
run;
proc sql;
create table want2 as
select distinct id, subclass from have;
proc transpose data = want2 out=want2a(drop =_name_) prefix = Subclass_;
by id;
id subclass;
var Subclass;
run;
data want;
merge want1a want2a;
by id;
array class(*) class_: subclass_:;
do i = 1 to dim(class);
if missing(class(i)) then class(i)= "0";
else class(i) ="1";
end;
drop i;
run;
Here is some tricky code generation that uses a hash to map a value to an array index corresponding to a flag variable representing the existential state of <name>_<value>
data have;
input ID $ Class Subclass $; datalines;
ID1 1 1a
ID1 1 1b
ID1 1 1c
ID1 2 2a
ID2 1 1a
ID2 1 1b
ID2 2 2a
ID2 2 2b
ID2 3 3a
ID3 1 1a
ID3 1 1d
ID3 2 2a
ID3 3 3a
ID3 3 3b
run;
* create indexed name_value data for variable name construction and hash initialization;
proc sql ; * fresh proc to reset within proc monotonic tracker;
create table map1 as
select class, monotonic() as index
from (select distinct class from have);
proc sql noprint;
create table map2 as
select subclass, monotonic() as index
from (select distinct subclass from have);
* populate macro variable with pdv target variable names to be arrayed;
proc sql noprint;
select catx('_','class',class)
into :map1vars separated by ' '
from map1 order by index;
select catx('_','subclass',subclass)
into :map2vars separated by ' '
from map2 order by index;
* group wise flag <variable>_<value> combinations;
data want;
if _n_ = 1 then do;
if 0 then set map1 map2; * prep pdv with hash variables;
declare hash map1(dataset:'map1');
declare hash map2(dataset:'map2');
map1.defineKey('class');
map1.defineData('index');
map1.defineDone();
map2.defineKey('subclass');
map2.defineData('index');
map2.defineDone();
end;
* group wise flag pivot vars (existential extrusion);
do until (last.id);
set have;
by id;
array map1_ &map1vars; * array for <name>_<value> combinations;
array map2_ &map2vars;
* use hash lookup on value to find index into target array;
map1.find(); put index=; map1_[index] = 1;
map2.find(); put index=; map2_[index] = 1;
end;
keep id &map1vars &map2vars;
run;
Proc REPORT can show values across with counts of occurrence within the group.
proc report data=have;
define id / group;
define class / across;
define subclass / across;
run;

How to modify the SNP values?

My dataset has 3 SNPs which looks like below
Id SNP1 SNP 2 SNP3
1 AA AA AA
2 AG AC AG
3 GG CC GG
4
5
6 So on
In SNP1 - I would like to modify the values AA =2, AG =1, GG = 0 and Likewise in SNP1 and SNP2
How can I do this?
I would put the new values in a proc format, so that you can either keep the existing values but displayed with the formatted value, or convert the existing values using the format. Here are both ways to do this.
/* create format */
proc format;
value $snpfmt 'AA' = '2'
'AG' = '1'
'GG' = '0'
;
run;
/* create initial dataset */
data have;
input Id SNP1 $ SNP2 $ SNP3 $;
datalines;
1 AA AA AA
2 AG AC AG
3 GG CC GG
;
/* option1 - format the values */
proc datasets lib=work nodetails nolist;
modify have;
format snp1 snp2 snp3 $snpfmt2. ;
quit;
/* option2 - change the values using the format */
data want;
set have;
snp1 = put(snp1,$snpfmt2.);
snp2 = put(snp2,$snpfmt2.);
snp3 = put(snp3,$snpfmt2.);
run;

Concatenate duplicate values

I have a table with some variables, say var1 and var2 and an identifier, and for some reasons, some identifiers have 2 observations.
I would like to know if there is a simple way to put back the second observation of the same identifier into the first one, that is
instead of having two observations, each with var1 var2 variables for the same identifier value
ID var1 var2
------------------
A1 12 13
A1 43 53
having just one, but with something like var1 var2 var1_2 var2_2.
ID var1 var2 var1_2 var2_2
--------------------------------------
A1 12 13 43 53
I can probably do that with renaming all my variables, then merging the table with the renamed one and dropping duplicates, but I assume there must be a simpler version.
Actually, your suggestion of merging the values back is probably the best.
This works if you have, at most, 1 duplicate for any given ID.
data first dups;
set have;
by id;
if first.id then output first;
else output dups;
run;
proc sql noprint;
create table want as
select a.id,
a.var1,
a.var2,
b.var1 as var1_2,
b.var2 as var2_2
from first as a
left join
dups as b
on a.id=b.id;
quit;
Another method makes use of PROC TRANSPOSE and a data-step merge:
/* You can experiment by adding more data to this datalines step */
data have;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
;
run;
/* This step puts the var1 values onto one line */
proc transpose data=tab out=new1 (drop=_NAME_) prefix=var1_;
by id;
var var1;
run;
/* This does the same for the var2 values */
proc transpose data=tab out=new2 (drop=_NAME_) prefix=var2_;
by id;
var var2;
run;
/* The two transposed datasets are then merged together to give one line */
data want;
merge new1 new2;
by id;
run;
As an example:
data tab;
infile datalines;
input ID : $2. var1 var2;
datalines;
A1 12 13
A1 43 53
A2 199 342
A2 1132 111
A2 91913 199191
B1 1212 43214
;
run;
Gives:
ID var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
---------------------------------------------------
A1 12 43 . 13 53 .
A2 199 1132 91913 342 111 199191
B1 1212 . . 43214 . .
There's a very simple way of doing this, using the IDGROUP function within PROC SUMMARY.
data have;
input ID $ var1 $ var2 $;
datalines;
A1 12 13
A1 43 53
;
run;
proc summary data=have nway;
class id;
output out=want (drop=_:)
idgroup(out[2] (var1 var2)=);
run;