sas merge and in logic explanation - if-statement

query 1:
data orders;
merge items(In=a) address(in=b);
by id;
if a and b;
run;
Query 2:
data orders;
merge items(In=a) address(in=b);
by id;
if a ;
run;
Query 3:
data orders;
merge items(In=a) address(in=b);
by id;
if b;
run;
Can you please explain which join is being performed Query 1,2 and 3? as per the if condition

Merges are not joins in the SQL sense because they handle duplicate keys differently. But you can think of the first one where the key has to be in both sources as an INNER JOIN and the other two as LEFT and RIGHT joins, respectively.
To perform a UNION you would want to use the SET statement instead of the MERGE statement.

Related

Persist Variable in Base SAS

I am using SAS on OS390.
I have an INFILE1, some treatment, then another INFILE2, other treatment.
I want to use variables from INFILE1 to compare with INFILE2.
examples:
INFILE1.DATE1 > INFLE2.DATE2 THEN OUTPUT;
My issue is that DATE1 is always empty no matter what.
I've tried....
%LET DATETEMP = INFILE1.DATE1
...but DATETEMP is empty as well.
Is there any way in SAS to make a variable carry its value from an INFILE to another...so to speak?
You cannot compare variables across datasets in one datastep. You have to merge the datasets first and then make the comparison.
You can either use a datastep merge or an SQL Join
Merge Example:
DATA want ;
MERGE INFILE1 INFILE2;
BY ID ;
RUN;
SQL Join Example: (you can do either inner or left join based on what you want)
proc sql;
create table work.want as
select t1.date1 , t2.date2, t1.id
from INFILE1 as t1
left join INFILE2 as t2 on t1.id=t2.id;
/*inner join INFILE2 as t2 on t1.id=t2.id;*/
quit;

subset data by thousands of observations

I have two data sets:
Data set 1: This dataset has 2300 rows. The jobID is the same throughout the dataset, but the Hash is unique throughout the dataset
Hash jobID
3456343454 1077
3453454 1077
43673 1077
.... and so on
Data set 2: This dataset has 5838918 rows. Different JobID values and different Hash values such as the following:
Hash jobID
2223422 2
233435 155
2344322 1171
... and so on
What I am trying to attempt is to see whether any of the Hash values that is part of the first dataset also exists in the second dataset. Since it's over a thousand different unique Hash values in the first dataset, I cannot type each one of them to see whether it exists in the second dataset like the following:
if hash in (value1 value2...etc), and to show the all Hash values that exist in Dataset 1 but not in dataset 2.
What is the best way to go about doing this?
Also, the HASH is is Character ($32 format and informat), while JobID is numeric (Format:Best12./ Informat 12.)
Use a SQL query to create the second list, you don't have to manually list the values.
proc sql;
create table in1_not2 as
select *
from table1 as a
where a.hash not in
(select b.hash from table2 as b);
quit;
If I understood correctly, you could check with a simple merge.
Order the two datasets by Hash:
proc sort data=dataset1; by Hash; run;
proc sort data=dataset2; by Hash; run;
Check if the Hash is in both datasets:
data check;
merge dataset1 (in=a keep=Hash)
dataset2 (in=b keep=Hash);
by Hash;
if a and b;
run;
Note that all I'm doing is checking the hash, I'm not bringing any other variable to the final dataset.

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

Max number or arguments to `where ... in (...)` clause in Proc SQL?

Suppose I'm subsetting a table and summarizing it in proc sql. The code uses a where ... in clause and a subquery to do the subsetting. I know that some SQL engines would set some limit on the number of arguments to the where ... in clause. Does SAS has have limit on this? This question would apply to a program like this:
proc sql;
create table want as
select
ID,
sum(var1) as var1,
sum(var2) as var2,
sum(var3) as var3
from largetable
where ID in (select ID from longlist)
group by ID;
quit;
What if longlist returns 10,000 IDs? How about 10,000,000?
I'm not aware of any explicit limit on this. SAS's SQL parser seems to convert these often to JOINs, when they're not explicitly coded in the table; that means there are some limitations, but not particularly small ones.
I do believe there is a limit to the length of a SQL statement in total, so if you were trying to include an extremely long list in text you might run into problems, but in the example above I don't see a problem with 10,000,000 IDs. I just tested it with 250,000,000 IDs in the longlist table, and SAS had no problem with it:
data largetable;
do id=1 to 1e8;
if mod(id,7)=0 then output;
end;
run;
data ids;
do id = 1 to 1e9;
if mod(id,4)=0 then output;
end;
run;
proc sql _method;
create table want as
select
ID
from largetable
where ID in (select ID from IDs)
group by ID;
quit;
Interestingly, adding _method indicates it does not do this as a join, but as a subquery. I'm not sure why, at least in this case; everything I've been told says that it should convert this to a join implicitly.
As Joe has said, there should probably be no problems with any reasonable number of rows in the longlist table. However, although this may be readable, a join may perform better.
Do you have a strong preference for running the query as written rather than doing a left join, e.g.
proc sql;
create table want as
select
b.ID,
sum(b.var1) as var1,
sum(b.var2) as var2,
sum(b.var3) as var3
from longlist a left join largetable b
on a.ID = b.ID
group by b.ID;
quit;
Elaborating a bit on entering a long list as text - I'm not aware of any limit on the length of any one statement in SAS, but there are various limits on the length of individual lines of code, depending on your version and how you're submitting it. I suspect it's possible to split a long statement over several lines each approaching the maximum allowed length.

Update the values of a column in a dataset with another table

If I have Table A with two columns: ID and Mean, and Table B with a long list of columns including Mean, how can I replace the values of the Mean column in Table B with the IDs that exist in Table A?
I've tried PROC SQL UPDATE and both DATASET MERGE and DATASET UPDATE but they keep adding rows when the number of columns is not equal in both tables.
data want;
merge have1(in=H1) have2(in=H2);
by mergevar;
if H1;
run;
That will guarantee that H2 does not add any rows, unless there are duplicate values for one of the by values. Other conditions can be used as well; if h2; would do about the same thing for the right-hand dataset, and if h1 and h2; would only keep records that come from both tables.
PROC SQL join should also work fairly easily.
proc sql;
create table want as
select A.id, coalesce(B.mean, A.mean)
from A left join B
on A.id=B.id;
quit;