Assume we have a table P_DEF in which we want to update the value of column RUN_ID for a certain subset which we stored in another table TMP. Here how I would do it in SQL:
update P_DEF
set RUN_ID = (-1) * TMP.RUN_ID /* change the sign of the value */
from P_DEF
inner join TMP
on P_DEF.RUN_ID = TMP.RUN_ID
and P_DEF.ITEM_ID = TMP.ITEM_ID
and P_DEF.ITEM_TITLE = TMP.ITEM_TITLE
Now the big question: To my knowledge, a proc SQL does not support this kind of filtered update. So how do I do this with a minimal number of transformations in SAS DI(S)?
Update by join is not supported by SAS SQL, but you can do CORRELATED UPDATE: update by values from correlated subquery:
data P_DEF;
infile cards;
length RUN_ID_ORIG 8;
input RUN_ID ITEM_ID ITEM_TITLE $20.;
RUN_ID_ORIG = RUN_ID;
cards;
1 1 some title
1 1 should be negative
1 2 another title
1 3 should be negative
4 44 another title
5 44 should be negative
;
run;
data TMP;
infile cards;
input RUN_ID ITEM_ID ITEM_TITLE $20. #30 NEW_ID;
cards;
1 1 should be negative 100
1 3 should be negative 123
5 44 should be negative 188
;
run;
proc sql;
/* this unwillingly updates all records, nonmatched will be set to null */
update P_DEF
set RUN_ID = (select NEW_ID from TMP
where P_DEF.RUN_ID = TMP.RUN_ID
and P_DEF.ITEM_ID = TMP.ITEM_ID
and P_DEF.ITEM_TITLE = TMP.ITEM_TITLE )
;
select * from P_DEF
;
quit;
The correlated update is not sufficient when there are non-matches, so you need to add filter to only update matched rows.
When joining on multiple columns, I usually rely on catx to get unique values
(depending on you data, you might need to use different numeric formats in put functions):
proc sql;
update P_DEF set RUN_ID = RUN_ID_ORIG; /* reset RUN_ID */
quit;
/* correct "inner join" update */
proc sql;
update P_DEF
set RUN_ID = (select NEW_ID from TMP
where P_DEF.RUN_ID = TMP.RUN_ID
and P_DEF.ITEM_ID = TMP.ITEM_ID
and P_DEF.ITEM_TITLE = TMP.ITEM_TITLE )
where
catx('#', put(RUN_ID, 16.), put(ITEM_ID, 16.), ITEM_TITLE)
in select catx('#', put(RUN_ID, 16.), put(ITEM_ID, 16.), ITEM_TITLE)
from TMP
;
select * from P_DEF;
quit;
The version above is slightly different from your exact example to show how to get value from subquery - the NEW_ID column.
Simplified version where you only use lookup table to identify rows to be updated is this:
proc sql;
update P_DEF set RUN_ID = RUN_ID_ORIG; /* reset RUN_ID */
quit;
proc sql;
/* simplified for your case:
you dont actually use value from TMP that does not exist in P_DEF */
update P_DEF
set RUN_ID = -1 * RUN_ID
where
RUN_ID > 0 /* so we can rerun this if needed */
and catx('#', put(RUN_ID, 16.), put(ITEM_ID, 16.), ITEM_TITLE)
in ( select catx('#', put(RUN_ID, 16.), put(ITEM_ID, 16.), ITEM_TITLE)
from TMP )
;
select * from P_DEF;
quit;
As you can see, the correlated update might need two subqueries to update single column, so don't expect it to be perfomant on bigger tables. You might be better with data step methods: MERGE, MODIFY or UPDATE statements.
As for SAS Data Integration Studio tranformation you asked for, I believe you can achieve this with SCD Type 1 Loader, this will generate some of the code I mentioned.
Related
I have a state table sitting in Teradata with 11 million rows and a unique row for every ID. I run logic in SAS that if a column (class) is updated, it updates the Teradata with the new record. Table structure in Teradata and the table generated in SAS is:
id
class
updated_at
1
X
date1
2
Y
date2
If the class is updated in the SAS created table for an id, the class and updated_at columns are updated in Teradata (more columns can be updated as well). Moreover, if a new record (id) is added, it is inserted into Teradata.
I want to achieve this in SAS, without having to push the SAS table into Teradata, and use merge into. Every table created in SAS will be 11 million+ rows.
To update a record manually, I can just use this:
proc sql;
update TD.TABLE_IN_TERADATA
set class = 'Z'
where updated_at = date3;
quit;
As far as I understand you have a teradata master table with all your data. Then you have new SAS tables with data to update your master data.
To generate some sample data (only SAS tables, I don' have teradata at hand...):
data test_data;
input id 2. class $2. updated_at date9.;
format updated_at date9.;
datalines;
1 X 01jan2020
2 Y 12feb2020
3 Z 01jan2020
4 X 16mar2020
5 Y 23jun2020
6 Z 23jun2020
7 X 31dec2020
;
run;
data sas_data;
input id 2. class $2. ;
format updated_at date9.;
updated_at=today();
datalines;
1 Z
3 Z
5 Z
7 Z
8 Y
9 Z
;
run;
So, we have changes in id=1, 5 and 7, whereas 3 is unchanged and 8 and 9 are new.
In pure SAS code you can use a data step with update to update and insert in one step, see here:
/* Any data row without change has to be eliminated, */
/* here id=3, otherwise updated_at will be updated there */
proc sql;
create table changed_data as
select s.*
from sas_data s
left join test_data t
on s.id eq t.id
where s.class ne t.class;
quit;
/* in sas update and insert via data-update-step */
data test_data1;
update test_data changed_data;
by id;
run;
As documented, the first sql step is only needed if you don't want to have updated_at to be updated in id=3 because there is no change. But maybe you want to have this updated as well, then you can remove this step.
By the way, precondition here is that the table is sorted by id or there is an index on id in the table.
But it might be that the SAS data step will not work with the teradata table. Then you could use the following steps in "pure" SQL (starting with the first step above to generate the table changed_data) plus an append step:
/* Alternative steps in pure SQL */
/* Step1: SQL-update, no insert */
proc sql;
update test_data t
set class=(select class from changed_data s where t.id=s.id),
updated_at=(select updated_at from changed_data s where t.id=s.id)
where id in (select id from changed_data)
;
quit;
/* Preparation for step2: extract completely new data */
proc sql;
create table new_data as
select s.*
from sas_data s
where id not in (select id from test_data)
;
quit;
/* Step2: insert new data via proc-append */
proc append base=test_data
data=new_data;
quit;
Generally, your performance might be poor with big data sets. Then consider to use a passthrough to the database and use the teradata "upsert", but then you will have to move your sas data into teradata.
I have a dataset with the first 4 columns and I want to create the last column. My dataset has millions of records.
ID
Date
Code
Event of Interest
Want to Create
1
1/1/2022
101
*
201
1
1/1/2022
201
yes
201
1
1/1/2022
301
*
201
1
1/1/2022
401
*
201
2
1/5/2022
101
*
301
2
1/5/2022
201
*
301
2
1/5/2022
301
yes
301
I want to group records by ID and date. If one of the records in the grouping has a 'yes' in the event of interest variable, I want to assign that code to the entire grouping. I am using base SAS.
Any ideas?
Assuming that you will only have one yes value for each id and date, you can use a lookup table and merge them together. Here are a few ways to do it.
1. Self-merge
Simply merge the data onto itself where event = yes.
data want;
merge have
have(rename=(code = new_code
event = _event_)
where =(upcase(_event_) = 'YES')
)
;
by id date;
drop _event_;
run;
2. SQL Self-join
Same as above, but using a SQL inner join.
proc sql;
create table want as
select t1.*
, t2.code as new_code
from have as t1
INNER JOIN
have as t2
ON t1.id = t2.id
AND t1.date = t2.date
where upcase(t2.event) = 'YES'
;
quit;
3. Hash lookup table
This is more advanced but can be quite performant if you have the memory. Notice that it looks very similar to our merge statement in Option 1. We're creating a lookup table, loading it to memory, and using a hash join to pull values from that in-memory table. h.Find() will check the unique combination of (id, date) in the value read from the set statement against the hash table in memory. If a match is found, it will pull the value of new_code.
data want;
set have;
if(_N_ = 1) then do;
dcl hash h(dataset: "have(rename=(code= new_code)
where =(upcase(event) = 'YES')
)"
, hashexp:20);
h.defineKey('id', 'date');
h.defineData('new_code');
h.defineDone();
call missing(new_code);
end;
rc = h.Find();
drop rc;
run;
You could just remember the last value of CODE you want for the group by using a double DOW loop.
In the first loop copy the code value to the new variable. The second loop can re-read the observations and write them out with the extra variable filled in.
data want;
do until (last.date);
set have;
by id date ;
if 'Event of Interest'n='yes' then 'Want to Create'n=code;
end;
do until (last.date);
set have;
by id date;
output;
end;
run;
Below given dataset I am trying to find New Users Vs Repeated Users.
DATE ID Unique_Event
20200901 a12345 1
20200902 a12345 1
20200903 b12345 1
20200903 a12345 1
20200904 c12345 1
In the above dataset, since a12345 appeared on multiple dates, should be counted as a "repeated" user whereas b12345 only appeared once, so he is a "new" user. Please note, this is only sample data as the actual data is quite large. I tried the below code, but I am not getting the correct count. Ideally, tot_num_users-num_new_users should be repeated users, but I am getting incorrect counts. Am I missing something?
Expected Output:
Month new_users repeated_users
9 2 1
Code:
data user_events;
set user_events;
new_date=input(date,yymmdd10.);
run;
proc sql;select month(new_date) as mm,
count(distinct vv.id) as total_num_users,
count(distinct case when v.new_date = vv.minva then v.id end) as num_new_users,
(count(distinct vv.id) - count(distinct case when v.new_date = vv.minva then id end)
) as num_repeated_users
from user_events v inner join
(select t.id, min(new_date) as minva
from user_events t
group by t.id
) vv
on v.id = vv.id
group by 1
order by 1;quit;
In a sub-select, for each ID you can count the number of distinct DATE to determine the new / repeated status. The all ids aggregate computations are made from the sub-select.
proc sql;
create table freq as
select
count(*) as id_count
, sum (status='repeated') as id_repeated_count /* sum counts a logic eval state */
, sum (status='new') as id_new_count
from
( select
id
, case
when count(distinct date) > 1 then 'repeated'
else 'new'
end as status
from
user_events
group by
id
) as statuses
;
An alternative solution not using proc sql (though I'm aware you tagged this with "proc sql").
data final;
set user_events;
Month=month(new_date);
run;
proc sort data=final; by Month ID;
data final;
set final;
by Month ID;
if first.Month then do;
new_users=0;
repeated_users=0;
end;
if last.ID then do;
if first.ID then
new_users+1;
else
repeated_users+1;
end;
if last.Month then
output;
keep Month new_users repeated_users;
run;
Since you are using proc sql, this is a sql question, not a SAS question.
Try something like:
proc sql;
select ID,count(Unique_Event)
from <that table>
group by ID
order by ID
run;
Here is the the structure of my data below:
I only need ID 2 since all patients are alive, therefore trying to delete ID 1:
ID sex status
1 2 A
1 2 A
1 2 A
1 2 D
2 1 A
2 1 A
2 1 A
If you truly want to delete the records from your source dataset, you can do so with:
PROC SQL;
DELETE FROM MyData WHERE ID = 1;
QUIT;
However, if you want to retain the source dataset as-is; maybe you will use it again, it would be best to create a new dataset from it, like so:
PROC SQL;
CREATE TABLE MyFilteredData AS
SELECT ID, sex, status
FROM MyData
WHERE ID = 2;
QUIT;
or
DATA MyFilteredData;
SET MyData;
IF ID = 2;
RUN;
proc sql;
delete from your_data where id ~= 2;
quit;
This PROC SQL will create a new dataset Want from the original dataset Have including only IDs that have no status="D":
proc sql;
create table Want as
select *
from Have
where ID not in
(select distinct ID
from Have
where status="D")
;
quit;
I have to join 2 tables on a key (say XYZ). I have to update one single column in table A using a coalesce function. Coalesce(a.status_cd, b.status_cd).
TABLE A:
contains some 100 columns. KEY Columns ABC.
TABLE B:
Contains just 2 columns. KEY Column ABC and status_cd
TABLE A, which I use in this left join query is having more than 100 columns. Is there a way to use a.* followed by this coalesce function in my PROC SQL without creating a new column from the PROC SQL; CREATE TABLE AS ... step?
Thanks in advance.
You can take advantage of dataset options to make it so you can use wildcards in the select statement. Note that the order of the columns could change doing this.
proc sql ;
create table want as
select a.*
, coalesce(a.old_status,b.status_cd) as status_cd
from tableA(rename=(status_cd=old_status)) a
left join tableB b
on a.abc = b.abc
;
quit;
I eventually found a fairly simple way of doing this in proc sql after working through several more complex approaches:
proc sql noprint;
update master a
set status_cd= coalesce(status_cd,
(select status_cd
from transaction b
where a.key= b.key))
where exists (select 1
from transaction b
where a.ABC = b.ABC);
quit;
This will update just the one column you're interested in and will only update it for rows with key values that match in the transaction dataset.
Earlier attempts:
The most obvious bit of more general SQL syntax would seem to be the update...set...from...where pattern as used in the top few answers to this question. However, this syntax is not currently supported - the documentation for the SQL update statement only allows for a where clause, not a from clause.
If you are running a pass-through query to another database that does support this syntax, it might still be a viable option.
Alternatively, there is a way to do this within SAS via a data step, provided that the master dataset is indexed on your key variable:
/*Create indexed master dataset with some missing values*/
data master(index = (name));
set sashelp.class;
if _n_ <= 5 then call missing(weight);
run;
/*Create transaction dataset with some missing values*/
data transaction;
set sashelp.class(obs = 10 keep = name weight);
if _n_ > 5 then call missing(weight);
run;
data master;
set transaction;
t_weight = weight;
modify master key = name;
if _IORC_ = 0 then do;
weight = coalesce(weight, t_weight);
replace;
end;
/*Suppress log messages if there are key values in transaction but not master*/
else _ERROR_ = 0;
run;
A standard warning relating to the the modify statement: if this data step is interrupted then the master dataset may be irreparably damaged, so make sure you have a backup first.
In this case I've assumed that the key variable is unique - a slightly more complex data step is needed if it isn't.
Another way to work around the lack of a from clause in the proc sql update statement would be to set up a format merge, e.g.
data v_format_def /view = v_format_def;
set transaction(rename = (name = start weight = label));
retain fmtname 'key' type 'i';
end = start;
run;
proc format cntlin = v_format_def; run;
proc sql noprint;
update master
set weight = coalesce(weight,input(name,key.))
where master.name in (select name from transaction);
run;
In this scenario I've used type = 'i' in the format definition to create a numeric informat, which proc sql uses convert the character variable name to the numeric variable weight. Depending on whether your key and status_cd columns are character or numeric you may need to do this slightly differently.
This approach effectively loads the entire transaction dataset into memory when using the format, which might be a problem if you have a very large transaction dataset. The data step approach should hardly use any memory as it only has to load 1 row at a time.