SAS merge datasets to overwrite values - sas

I want to use dataset B to overwrite some values in dataset A by merging dataset A & B with a merging ID. However it doesn't work as expected. Here is the test I did:
/* create table A */
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ var1 var2;
datalines;
1 20 30
2 20 30
;
run;
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a b;
by id1;
run;
Table C looks like this:
ID1 ID2 VAR1 VAR2
1 a 20 30
1 b 10 30
2 a 20 30
2 b 10 30
Why the 10s in row 2&4 didn't get replaced by 20 from table B? While var2 works as expected?
I know I can do this simply using proc SQL, and that's what I did to solve the problem. But I still quite curious if there is a way to do what I wanted using merge? And why this wasn't working? I prefer merge over SQL in this circumstance because the logic is easier to implement (util I found this not working properly).
I use SAS 9.4.

This has to do with how SAS iterates over the data sets during the merge. Basically, the second record for each of A doesn't get lined up with a record from B. The value of VAR2 is carried over from the previous record. VAR1 gets its value from A (because there is no B).
IF there is record in B for EVERY ID1, then you can rewrite your merge like this to achieve what you want.
/* merge A&B to overwrite var1 in table A using values in table B */
data c;
merge a(drop=var1) b;
by id1;
run;
This drops the VAR1 from A so that it is carried down from the record in B.
Otherwise you will need more complex logic (might I suggest an SQL left join with the coalesce() function?).

Like DomPazz suggests, proc sql is the way to do this. merge will only keep one value from each data set. The coalesce function pick the first non-missing value from the list, so it uses var1 from b, but if b.var1 is null then it uses a.var1.
proc sql;
create table c as
select
a.id1,
a.id2,
coalesce(b.var1,a.var1) as var1,
b.var2
from
a
left join b
on a.id1 = b.id1
;
quit;

The merge method could still work fine, you would just need to be more explicit about how to choose the 'best' value for var1, such as:
data c (drop = a_var1 b_var1);
merge a(rename=(var1 = a_var1))
b(rename=(var1 = b_var1));
by id1;
* Now you have two different variables named a_var1 and b_var1;
* Implement logic to choose your favorite;
if NOT MISSING(b_var1) Then DO;
var1 = b_var1;
var1_source='B';
END;
else DO;
var1 = a_var1;
var1_source='A';
END;
run;
If your criteria for which 'var1' to choose is as simple as 'If b exists, use it' then this is identical to the the SQL method with coalesce().
Where I've found this method useful is for more complicated criteria, plus its always nice to know the source of the data (which doesn't happen with coalesce()).

Related

values of commun column in A replaced by that in B with function merge in SAS

I want merge two tables, but they have 2 columns in commun, and i do not want value of var1 in A replaced by that in B, if we don't use drop or rename, does anyone know it?
I can fix it with sql but just curious with Merge!
data a;
infile datalines;
input id1 $ id2 $ var1;
datalines;
1 a 10
1 b 10
2 a 10
2 b 10
;
run;
/* create table B */
data b;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 30 50
2 b 30 50
;
run;
/* Marge A and B */
data c;
merge a (in=N) b(in=M);
if N;
by id1;
run;
but what i like is:
data C;
infile datalines;
input id1 $ id2 $ var1 var2;
datalines;
1 a 10 50
1 b 10 50
2 a 10 50
2 b 10 50
;
run;
Use rename
data c;
merge a (in=N) b(in=M rename=(var1=var1_2));
by id1;
if N;
run;
If you don't want to use rename / drop etc., then you could just flip the merge order such that the datasets whose var1 should be retained overwrites the other:
data c;
merge b (in=M) a(in=N);
by id1;
if N;
run;
When the data step loads data from the datasets mentioned it does it in the order that they appear on the MERGE (or SET or UPDATE) statement. So if you are merging two dataset and the BY variables match values then the record from the first is loaded and the record from the second is loaded, overwriting the values read from the first.
For 1 to 1 matching you can just change the order that the datasets are mentioned.
merge b(in=M) a(in=N) ;
If you really want the variables defined in the output dataset in the order they appear in A then add a SET statement that the compiler will process but that can never execute before your MERGE statement.
if 0 then set a b ;
If you are doing a 1 to many matching then you might have other trouble since when a dataset stops contributing values to the current BY group then SAS does not re-read the last observation. In that case you will have to use some combination of RENAME=, DROP= or KEEP= dataset options.
In PROC SQL when you have duplicate names for selected columns (and are trying to create an output dataset instead of report) then SAS ignores the second copy of the named variable. So in a sense it is the reverse of what happens with the MERGE statement.

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

Naming columns using values from data table

Let's say I have two columns A and B.
A B
12 "randstr"
39 "randstr"
2 "randstr"
This random string is repeated in each row.
I'm interested in how I can get the table below:
randstr B
12 "randstr"
39 "randstr"
2 "randstr"
The value in the column B was used to rename column A. I have tried using rename and all sorts of macro magic but failed. I have no idea how to proceed.
I've tried the answers below and they just don't allow for reading the value from the data and then using the value as a column name:
https://communities.sas.com/t5/General-SAS-Programming/dates-used-as-column-names/td-p/168803
https://stats.idre.ucla.edu/sas/code/a-few-sas-macro-programs-for-renaming-variables-dynamically/
SAS - Dynamically create column names using the values from another column
Renaming Column with Dynamic Name
The transformation could also be seen as a row-wise transposition.
data have;
attrib A length=8 B length=$32;
row+1;
input
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
run;
proc transpose data=have out=want(drop=row _name_);
by row;
var A;
id B;
copy B;
run;
In non-toy scenarios the B column is often not a single value. Try the same transpose with data having variation in B. The procedure will create two new columns from the values of B.
A & B; datalines;
12 xyz-123-abc
39 xyz-123-abc
2 xyz-123-abc
3141 xyz-456-def
Using this macro, it's fairly straightforward:
/* get first value in the dataset */
%let new_col=%mf_getvalue(work.YOURDATA,B);
/* rename variable A */
proc datasets library=work nolist;
modify YOURDATA;
rename A=%sysfunc(dequote(&new_col));
quit;

How to use proc compare to update dataset

I want to use proc compare to update dataset on a daily basis.
work.HAVE1
Date Key Var1 Var2
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
work.HAVE2
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
Date and Key are uniquely determine one record.
How can I use the above two tables to construct the following
work.WANT
Date Key Var1 Var2
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
I don't want to delete the previous data and then rebuild it. I want to modify it via append new records at the bottom and adjust the values in VAR1 or VAR2.
I'm struggling with proc compare but it just doesn't return what I want.
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.DIFF outnoequal outcomp;
id Date Key;
run;
This will give you new and changed (unequal records) in single dataset WORK.DIFF. You'll have to distinguish new vs changed yourself.
However, what you want to achieve is actually a MERGE - inserts new, overwrites existing, though maybe due to performance reasons etc. you don't want to re-create the full table.
data work.WANT;
merge work.HAVE1 work.HAVE2;
by Date Key;
run;
Edit1:
/* outdiff option will produce records with _type_ = 'DIF' for matched keys */
proc compare base=work.HAVE1 compare=work.HAVE2 out=WORK.RESULT outnoequal outcomp outdiff;
id Date Key;
run;
data WORK.DIFF_KEYS; /* keys of changed records */
set WORK.RESULT;
where _type_ = 'DIF';
keep Date Key;
run;
/* split NEW and CHANGED */
data
WORK.NEW
WORK.CHANGED
;
merge
WORK.RESULT (where=( _type_ ne 'DIF'));
WORK.DIFF_KEYS (in = d)
;
by Date Key;
if d then output WORK.CHANGED;
else output WORK.NEW;
run;
Edit2:
Now you can just APPEND the WORK.NEW to target table.
For WORK.CHANGED - either use MODIFY or UPDATE statement to update the records.
Depending on the size of the changes, you can also think about PROC SQL; DELETE to delete old records and PROC APPEND to add new values.
All a PROC COMPARE will do will tell you the differences between 2 datasets. To achieve your goal you need to use an UPDATE statement in a data step. This way, values in HAVE1 are updated with HAVE2 where the date and key match, or a new record inserted if there are no matches.
data have1;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 2
01Aug2013 K2 a 3
02Aug2013 K1 b 4
;
run;
data have2;
input Date :date9. Key $ Var1 $ Var2;
format date date9.;
datalines;
01Aug2013 K1 a 3
01Aug2013 K2 a 3
02Aug2013 K1 b 4
03Aug2013 K2 c 1
;
run;
data want;
update have1 have2;
by date key;
run;

Update one dataset with another without using PROC SQL

I have the below two datasets
Dataset A
id age mark
1 . .
2 . .
1 . .
Dataset B
id age mark
2 20 200
1 10 100
I need the below dataset as output
Output Dataset
id age mark
1 10 100
2 20 200
1 10 100
How to carry out this without using PROC SQL i.e. using DATA STEP?
There are many ways to do this. The easiest is to sort the two data sets and then use MERGE. For example:
proc sort data=A;
by id;
run;
proc sort data=B;
by id;
run;
data WANT;
merge A(drop=age mark) B;
by ID;
run;
The trick is to drop the variables you are adding from the first data set A; the new variables will come from the second data set B.
Of course, this solution does not preserve the original order of the observations in your data set AND only works because your second data set contains unique values of id.
I tried this and it worked for me, even if you have data you would like to preserve in that column. Just for completeness sake I added an SQL variant too.
data a;
input id a;
datalines;
1 10
2 20
;
data b;
input id a;
datalines;
1 .
1 5
1 .
2 .
3 4
;
data c (drop=b);
merge a (rename = (a=b) in=ina) b (in = inb);
by id;
if b ne . then a = b;
run;
proc sql;
create table d as
select a.id, a.a from a right join b on a.id=b.id where a.id is not null
union all
select b.id, b.a from a right join b on a.id = b.id where a.id is null
;
quit;