Conditional Merge in SAS data step

Conditional Merge in SAS data step - sas

I have two tables, which for the sake of simplicity are:
Table 1:
Date ID
20170401 X
20170501 Y
20170601 Z
Table 2:
Date ID
20170201 Z
20170301 Y
20170501 X
I want to create a new table which has everything from table 1, unless the ID occurred at a previous date in table 2.
The desired output for table 1 and 2 would be:
Date ID
20170401 X
This is what I currently have. I'm not sure where to put the conditional merge:
data new;
merge table1 table2(in=b);
by date ID;
if not b [where table2.date is before table1.date];
run;
Thanks

Unfortunately SAS overwrites variables with the same names, so rename date in table2. Merge on ID and only keep if it's in table1 and passes your date criterion. Then drop the date2 column as it's not needed.
data new;
merge table1 (in=a) table2(in=b rename=(date=date2));
by ID;
if a and not(date2<date);
drop date2;
run;

Related

Updating and inserting new records into Teradata table from SAS

I have a state table sitting in Teradata with 11 million rows and a unique row for every ID. I run logic in SAS that if a column (class) is updated, it updates the Teradata with the new record. Table structure in Teradata and the table generated in SAS is:
id
class
updated_at
1
X
date1
2
Y
date2
If the class is updated in the SAS created table for an id, the class and updated_at columns are updated in Teradata (more columns can be updated as well). Moreover, if a new record (id) is added, it is inserted into Teradata.
I want to achieve this in SAS, without having to push the SAS table into Teradata, and use merge into. Every table created in SAS will be 11 million+ rows.
To update a record manually, I can just use this:
proc sql;
update TD.TABLE_IN_TERADATA
set class = 'Z'
where updated_at = date3;
quit;

As far as I understand you have a teradata master table with all your data. Then you have new SAS tables with data to update your master data.
To generate some sample data (only SAS tables, I don' have teradata at hand...):
data test_data;
input id 2. class $2. updated_at date9.;
format updated_at date9.;
datalines;
1 X 01jan2020
2 Y 12feb2020
3 Z 01jan2020
4 X 16mar2020
5 Y 23jun2020
6 Z 23jun2020
7 X 31dec2020
;
run;
data sas_data;
input id 2. class $2. ;
format updated_at date9.;
updated_at=today();
datalines;
1 Z
3 Z
5 Z
7 Z
8 Y
9 Z
;
run;
So, we have changes in id=1, 5 and 7, whereas 3 is unchanged and 8 and 9 are new.
In pure SAS code you can use a data step with update to update and insert in one step, see here:
/* Any data row without change has to be eliminated, */
/* here id=3, otherwise updated_at will be updated there */
proc sql;
create table changed_data as
select s.*
from sas_data s
left join test_data t
on s.id eq t.id
where s.class ne t.class;
quit;
/* in sas update and insert via data-update-step */
data test_data1;
update test_data changed_data;
by id;
run;
As documented, the first sql step is only needed if you don't want to have updated_at to be updated in id=3 because there is no change. But maybe you want to have this updated as well, then you can remove this step.
By the way, precondition here is that the table is sorted by id or there is an index on id in the table.
But it might be that the SAS data step will not work with the teradata table. Then you could use the following steps in "pure" SQL (starting with the first step above to generate the table changed_data) plus an append step:
/* Alternative steps in pure SQL */
/* Step1: SQL-update, no insert */
proc sql;
update test_data t
set class=(select class from changed_data s where t.id=s.id),
updated_at=(select updated_at from changed_data s where t.id=s.id)
where id in (select id from changed_data)
;
quit;
/* Preparation for step2: extract completely new data */
proc sql;
create table new_data as
select s.*
from sas_data s
where id not in (select id from test_data)
;
quit;
/* Step2: insert new data via proc-append */
proc append base=test_data
data=new_data;
quit;
Generally, your performance might be poor with big data sets. Then consider to use a passthrough to the database and use the teradata "upsert", but then you will have to move your sas data into teradata.

SAS Set a variable equal to another variable, rename it, and then merge

I am trying to set the value of a variable to the value of another variable, then rename the original variable, then merge using the following code: (MK_RETURN_DATA is a subset of RETURNOUTSET. I just wanted merge the MK_RETURN_DATA with RETURNOUTSET with one variable in MK_RETURN_DATA renamed).
data RETURNOUTSET;
CUM_RETURN = return_sec;
run;
PROC SQL;
CREATE TABLE MK_RETURN AS
SELECT a.*
FROM
RETURNOUTSET a
WHERE a.SYMBOL = 'SPY';
QUIT;
DATA MK_RETURN_DATA;
SET MK_RETURN;
RENAME RETURN_SEC=MK_RETURN_RATE;
DROP SYMBOL;
RUN;
proc sort data=MK_RETURN_DATA; by Date Time; run;
proc sort data=RETURNOUTSET; by Date Time; run;
data WITH_MARKET;
merge RETURNOUTSET(IN=C) MK_RETURN_DATA(IN=D);
by Date Time;
if C;
run;
However, I am getting very weird results in the first block of data with symbol "A" in WITH_MARKET. The value of CUM_RETURN is actually equal to the value of MK_RETURN_RATE, while I wanted it to be return_sec.
What happened?

You should be able to do this with dataset options.
First make sure the data is sorted.
proc sort data=RETURNOUTSET; by Date Time; run;
Then merge that dataset back with itself and use the appropriate KEEP, RENAME and WHERE dataset options to select the correct records to merge onto the original data.
data WITH_MARKET;
merge RETURNOUTSET(IN=C)
RETURNOUTSET(IN=D
keep=symbol return_sec date time
rename=(symbol=x_symbol return_sec=MK_RETURN_RATE)
where=(x_symbol='SPY')
)
;
by Date Time;
if C;
drop x_symbol ;
run;
If you do not have SYMBOL='SPY' records for all of the DATE TIME values in your original data then the merge might not work. Or if you have multiple SYMBOL='SPY' records for the same DATE TIME values then you also might have trouble with this merge.

All of what you did up to this point is the same as this one datastep. You put RETURN_SEC in CUM_RETURN, you filtered down to SYMBOL='SPY', and you renamed RETURN_SEC to MK_RETURN_RATE.
DATA MK_RETURN_DATA;
SET returnoutset(where=(symbol='SPY'));
cum_return = return_sec;
RENAME RETURN_SEC=MK_RETURN_RATE;
DROP SYMBOL;
RUN;
So ... CUM_RETURN equals MK_RETURN_RATE equals the former RETURN_SEC, as far as I can tell. What were you actually trying to do?

select only a few columns from a large table in SAS

I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.

You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;

I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.

Update the values of a column in a dataset with another table

If I have Table A with two columns: ID and Mean, and Table B with a long list of columns including Mean, how can I replace the values of the Mean column in Table B with the IDs that exist in Table A?
I've tried PROC SQL UPDATE and both DATASET MERGE and DATASET UPDATE but they keep adding rows when the number of columns is not equal in both tables.

data want;
merge have1(in=H1) have2(in=H2);
by mergevar;
if H1;
run;
That will guarantee that H2 does not add any rows, unless there are duplicate values for one of the by values. Other conditions can be used as well; if h2; would do about the same thing for the right-hand dataset, and if h1 and h2; would only keep records that come from both tables.
PROC SQL join should also work fairly easily.
proc sql;
create table want as
select A.id, coalesce(B.mean, A.mean)
from A left join B
on A.id=B.id;
quit;

SAS data step/ proc sql insert rows from another table with auto increment primary key

I have 2 datasets as below
id name status
1 A a
2 B b
3 C c
Another dataset
name status new
C c 0
D d 1
E e 1
F f 1
How do I insert all rows from 2nd table to 1st table? The situation is that the first table is permanent. The 2nd table is updated monthly, so I would like to add all rows from the monthly updated table to the permanent table, so that it would look like this
id name status
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
The problem I'm facing is that I cannot increment the id from dataset 1. As far as I searched, the dataset in SAS does not have auto increment property. The auto increment can be done with using data step, but I don't know if data step could be use in the case with 2 tables like this.
The usual sql would be
Insert into table1 (name, status)
select name, status from table2 where new = 1;
But since the sas dataset not support auto increment column hence the problem I'm facing.
I could solve it by using SAS data step as below after the above proc sql
data table1;
set table1;
if _n_ > 3 then id = _n_;
run;
This would increase the value of id column, but the code is kinda ugly, and also the id is a primary key, and being used as a foreign key in other table, so I don't want to mess up the ids of old rows.
I'm in the process of both learning and working with SAS so help is really appreciated. Thanks in advance.
Extra question:
If the 2nd table does not have the new column, is there any way to complete what I want (add new row from monthly table (2nd) to permanent table (1st)) with data step? Currently, I use this ugly proc sql/data step to create new column
proc sql; //create a temp table from table2
create t2temp as select t2.*,
(case when t2.name = t1.name and t2.status = t1.status then 0 else 1) as new
from table2 as t2
left join table1 as t1
on t2.name = t1.name and t2.status = t1.status;
drop table t2; //drop the old table2 with no column "new"
quit;
data table2; //rename the t2temp as table2
set t2temp;
run;

You can do it in the datastep. BTW, if you were creating it entirely anew, you could just use
id+1;
to create an autonumbered field (assuming your data step wasn't too complicated). This will keep track of the current highest ID number and assign one higher to each row as you go if it is in the new dataset.
data have;
input id name $ status $;
datalines;
2 A a
3 B b
1 C c
;;;;
run;
data addon;
input name $ status $ new;
datalines;
C c 0
D d 1
E e 1
F f 1
;;;;
run;
data want;
retain _maxID; *keep the value of _maxID from one row to the next,
do not reset it;
set have(in=old) addon(in=add); *in= creates a temporary variable indicating which
dataset a row came from;
if (old) or (add and new); *in SAS like in c/etc., 0/missing(null) is
false negative/positive numbers are true;
if add then ID = _maxID+1; *assigns ID to the new records;
_maxID = max(id,_maxID); *determines the new maximum ID -
this structure guarantees it works
even if the old DS is not sorted;
put id= name=;
drop _maxID;
run;
Response to second question:
Yes, you can still do that. One of the easiest ways is, if you have the datasets sorted by NAME:
data want;
retain _maxID;
set have(in=old) addon(in=add);
by name;
if (old) or (add and first.name);
if add then ID = _maxID+1;
_maxID = max(id,_maxID);
put id= name=;
run;
first.name will be true for the first record with the same value of name; so if HAVE has a value of that name, then ADDON will not be permitted to add a new record.
This does require name to be unique in HAVE, or you might delete some records. If that is not true then you have a more complicated solution.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js