substituting values in column under if-condition SAS - if-statement

I am fairly new to sas and I am wrapping my head around this problem. I have two tables A and B.
The first one collects information about customers in terms of their purchases and the specific products they purchased
A
Customer_id
product_code
1111111
12345
1111111
34523
The second one is some sort of a dictionary, i.e. contains old and new codes, the old code being the key and the new one the updated version.
B
old
new
34523
22256
89765
76576
My goal would be to update in table A, all the references to old codes with their updated (new) version. In the end A should look like
Customer_id
product_code
1111111
12345
1111111
22256
The approach that I wuld take in this example is the following (pseudo-code)
if A.product_code in B.old then
A.product_code = B.new
else
nothing
But I am struggling a bit with sas synthax to implement that.
I really hope that my issue is clear enough and do not hesitate to ask further clarification if necessary.
Thanks to anyone who is willing to participate

How about
data a;
input Customer_id product_code;
datalines;
1111111 12345
1111111 34523
;
data b;
input old new;
datalines;
34523 22256
89765 76576
;
proc sql;
update a
set product_code =
(select new from b
where a.product_code = b.old)
where exists (
select 1
from b
where a.product_code = b.old)
;
quit;

A very SAS way is to use MODIFY statement and hash lookup.
data master;
modify master;
if _n_ = 1 then do;
declare hash mappings(dataset:'code_changes(rename=new_code=code)');
mappings.defineKey('old_code');
mappings.defineData('code');
mappings.defineDone();
call missing(old_code);
end;
if mappings.find(key:code)=0 then replace;
run;
Another MODIFY way is to read changes with a SET statement.
This example requires an index on the master table.
proc sql;
create index code on master;
data master;
set mappings;
reset = 1;
do until (_iorc_);
code = old_code;
modify master key=code keyreset=reset;
if _iorc_ = 0 then do;
code = new_code;
replace;
end;
reset = 0;
end;
_error_ = 0;
run;

Related

Flagging values based on subsequent occurences using first. retain etc

Thank you who will be able to help me. I've got a dataset as below:
data smp;
infile datalines dlm=',';
informat identifier $7. trx_date $9. transaction_id $13. product_description $50. ;
input identifier $ trx_date transaction_id $ product_description $ ;
datalines;
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT F/FREE STRAWBERRY
Cust1,11Aug2016,20-0030417313,ONKEN BIOPOT FULL STRAWB/GRAIN
Cust1,11Aug2016,20-0030417313,RACHELS YOG GREEK NAT F/F/ORG
Cust1,03Nov2016,23-0040737060,RACHELS YOG GREEK NAT F/F/ORG
Cust3,13Feb2016,39-0070595440,COLLECT YOG LEMON
Cust3,21Jun2016,34-0050769524,AF YOG FARMHOUSE STRAWB/REDCUR
Cust3,21Jun2016,34-0050769524,Y/VALLEY GREEK HONEY ORGANIC
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK LEMON CURD ORG
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG FRUITY FAVS
Cust3,21Jun2016,34-0050769524,Y/VALLEY THICK YOG STRAWB ORG
Cust3,26Jun2016,39-0430106897,TOTAL GREEK YOGURT 0%
Cust3,14Aug2016,54-0040266755,M/BUNCH SQUASHUMS STRAW/RASP
Cust3,14Aug2016,54-0040266755,MULLER CORNER STRAWBERRY
Cust3,14Aug2016,54-0040266755,TOTAL GREEK YOGURT 0%
Cust3,22Aug2016,54-0050447336,M/BUNCH SQUASHUMS STRAW/RASP
;
For each customers (and each of their purchase based on transaction_id), i'm wanting to flag each product that will be repurchased during their next visit (only their next visit) on a rolling basis. So in the above dataset, correct flags would be on rows 4,12 and 13 because these products are bought on the next customer visit (we only look at the next visit).
I'm trying to do it with the following program:
proc sort data = smp out = td;
by descending identifier transaction_id product_description;
run;
DATA TD2(DROP=tmp_product);
SET td;
BY identifier transaction_id product_description;
RETAIN tmp_product;
IF FIRST.product_description and first.transaction_id THEN DO;
tmp_product = product_description;
END;
ATTRIB repeat_flag FORMAT=$1.;
IF NOT FIRST.product_description THEN DO;
IF tmp_product EQ product_description THEN repeat_flag ='Y';
ELSE repeat_flag = 'N';
END;
RUN;
proc sort data = td2;
by descending identifier transaction_id product_description;
run;
But it's not working? if someone could pse help it would be fab.
Best Wishes
Other method is to produce a dummy group in original dataset and temporary dataset. In original dataset, group is sequenced by visit time per customer, in temporary dataset, group is sequenced from beginning of SECOND visit time per customer, group number in temporary dataset is the same as group number of original dataset, but its visit time is next visit of original dataset. With the dummy group, it is easy to find the same product that was repurchased during their next visit by hash table.
proc sort data=smp;
by identifier trx_date;
run;
data have(drop=_group) temp(drop=group rename=(_group=group));
set smp;
by identifier trx_date;
if first.identifier then do;
group=1; _group=0;
end;
if dif(trx_date)>0 then do;
group+1; _group+1;
end;
if _group^=0 then output temp;
output have;
run;
data want;
if 0 then set temp;
if _n_=1 then do;
declare hash h(dataset:'temp');
h.definekey('identifier','group','product_description');
h.definedata('product_description');
h.definedone();
end;
set have;
flag=(h.find()=0);
drop group;
run;
The method below will "look ahead" to the next row (opposite to LAG) after sorting so you can bring comparisons onto the same row for simple logic:
** convert character date to numeric **;
data smp1; set smp;
TRX_DATE_NUM = input(TRX_DATE,ANYDTDTE10.);
format TRX_DATE_NUM mmddyy10.;
run;
** sort **;
proc sort data = smp1;
by IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM;
run;
** look ahead at the next observations and use logic to identify flags **;
data look_ahead;
set smp1;
by IDENTIFIER;
set smp1 (firstobs = 2
keep = IDENTIFIER PRODUCT_DESCRIPTION TRX_DATE_NUM
rename = (IDENTIFIER = NEXT_ID PRODUCT_DESCRIPTION = NEXT_PROD TRX_DATE_NUM = NEXT_DT))
smp1 (obs = 1 drop = _ALL_);
if last.IDENTIFIER then do;
NEXT_ID = "";
NEXT_PROD = "";
NEXT_DT = .;
end;
run;
** logic says if the next row is the same customer who bought the same product on a different date then flag **;
data look_ahead_final; set look_ahead;
if IDENTIFIER = NEXT_ID and NEXT_PROD = PRODUCT_DESCRIPTION and TRX_DATE_NUM ne NEXT_DT then FLAG = 1;
else FLAG = 0;
run;
There are a few ways to do this; I think the simplest to understand, while still having a reasonable level of performance, is to sort the data in descending date order and then use an array to store the product_descriptions of the last trx_date.
Here I use a 2 dimensional array where the first dimension is just a 1/2 value; each trx_date simultaneously loads one row of the array and checks against the other row of the array (using _array_switch to determine which is being loaded/checked).
You could do the same thing with a hash table, and it would be appreciably faster along with perhaps a bit less complicated in some ways; if you are familiar with hash tables and want to see that solution comment and I or someone else can provide it.
You also could use SQL to do this, and I suspect that is the most common solution overall, but I couldn't quite get it to work, as it has some complexity with subqueries within subqueries the way I was approaching it, and I'm apparently not good enough with those.
Here's the array solution. Set the second dimension of prods to a reasonable maximum for your data - it could even be thousands, this is a temporary array and does not use much memory so set to 32000 or whatever would not be a big deal.
proc sort data=smp;
by identifier descending trx_date ;
run;
data want;
array prods[2,20] $255. _temporary_;
retain _array_switch 2;
do _n_ = 1 by 1 until (last.trx_date);
set smp;
by identifier descending trx_date;
/* for first row for an identifier, clear out the whole thing */
if first.identifier then do;
call missing(of prods[*]);
end;
/* for first row of a trx_date, clear out the array-row we were looking at last time, and switch _array_switch to the other value */
if first.trx_date then do;
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave;
call missing(prods[_array_switch,_i]);
end;
_array_switch = 3-_array_switch;
end;
*now check the array to see if we should set next_trans_flag;
next_trans_flag='N';
do _i = 1 to dim(prods,2);
if missing(prods[_array_switch,_i]) then leave; *for speed;
if prods[_array_switch,_i] = product_description then next_trans_flag='Y';
end;
prods[3-_array_switch,_n_] = product_description; *set for next trx_date;
output;
end;
drop _:;
run;
I think to really answer this you need to generate a list of distinct visit*product combinations. And also a list of the distinct products bought on particular visits.
proc sql noprint ;
create table bought as
select distinct identifier, product_description, trx_date, transaction_id
from smp
order by 1,2,3,4
;
create table all_visits as
select a.identifier, product_description, trx_date, transaction_id
from (select distinct identifier,product_description from bought) a
natural join (select distinct identifier,transaction_id,trx_date from bought) b
order by 1,2,3,4
;
quit;
You can then combine them and make a flag for whether the product was bought on that visit.
data check ;
merge all_visits bought(in=in1) ;
by identifier product_description trx_date transaction_id ;
bought=in1;
run;
You can now use a lead technique to figure out if the they also bought the product on the next visit.
data flag ;
set check ;
by identifier product_description trx_date transaction_id ;
set check(firstobs=2 keep=bought rename=(bought=bought_next)) check(drop=_all_ obs=1);
if last.product_description then bought_next=0;
run;
You can then combine back with the actual purchases and eliminate the extra dummy records.
proc sort data=smp;
by identifier product_description trx_date transaction_id ;
run;
data want ;
merge flag smp (in=in1);
by identifier product_description trx_date transaction_id ;
if in1 ;
run;
Let's put the records back into the original order so we can check the results.
proc sort; by row; run;
proc print; run;

Delete all observations, which are doubled on some variable

Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.
If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.
You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;

I want to add auto_increment column in a table in SAS

I want to add a auto_Increment column in a table in SAS.Following code add's a column but not increment the value.
Thanks In Advance.
proc sql;
alter table pmt.W_cur_qtr_recoveries
add ID integer;
quit;
Wow, going to try for my second "SAS doesn't do that" answer this morning. Risky stuff.
A SAS dataset cannot define an auto-increment column. Whether you are creating a new dataset or inserting records into an existing dataset, you are responsible for creating any increment counters (ie they are just normal numeric vars where you have set the values to what you want).
That said, there are DATA step statements such as the sum statement (e.g. MyCounter+1) that make it easier to implement counters. If you describe more details of your problem, people could provide some alternatives.
The correct answer at this time is to create the ID yourself, BUT the discussion wouldn't be complete without mentioning that there is an unsupported SQL function Monotonic that can do what you want. It's not reliable, yet it persists.
The code pattern for its usage is
select monotonic() as ID, ....
Use the _N_ automatic variable in a data step like:
DATA TEMPLIB.my_dataset (label="my dataset with auto increment variables");
SET TEMPREP.my_dataset;
sas_incr_num = _N_; * add an auto increment 'sas_incr_num' variable;
sas_incr_cat = cat("AB.",cats(repeat("0",5-ceil(log10(sas_incr_num+1))),sas_incr_num),".YZ"); * auto increment the sas_incr_num variable and add 5 leading zeros and concatenate strings on either end;
LABEL
sas_incr_num="auto number each row"
sas_incr_cat="auto number each row, leading zeros, and add strings along for fun"
...
There is no such thing as an auto increment column in a SAS dataset. You can use a data step to create a new dataset that has the new variable. You can use the same name to have it replace the old one when done.
data pmt.W_cur_qtr_recoveries;
set pmt.W_cur_qtr_recoveries;
ID+1;
run;
It really depends on what your intended outcome is. But I have thrown together an example of how you may want to tackle this. it is a little rough, but gives you something to work from.
/*JUST SETTING UP THE DAY ONE DATA WITH AN ID ATTACHED
YOU WOULD MAKE THE FIRST RUN EXECUTE DIFFERENTLY TO SUBSEQUENT RUNS BY USING THE EXISTS FUNCTION AND MACRO LANGUAGE,
BUT I WILL LET YOU INVESTIGATE THIS FURTHER AS IT MAY BE IRRELEVANT.*/
DATA DAY1;
SET SASHELP.CLASS;
ID+1;
RUN;
/*ON DAY 2 WE ARE APPENDING ADDITIONAL RECORDS TO THE EXISTING DATASET*/
DATA DAY2;
/*APPEND DATASETS*/
SET DAY1 SASHELP.CLASS;
/*HOLD VALUE IN PROGRAM DATA VECTOR (PDV) UNTIL EXPLICITLY CHANGED*/
RETAIN _ID;
/*ADD VARIABLE _ID AND POPULATE WITH ID. IN DOING THIS THE LAST INSTANCE OF THE ID WILL BE HELD IN THE PDV FOR THE
FIRST OF THE NEW RECORDS*/
IF ID ~= . THEN _ID = ID;
/*INCREMENT THE VALUE IN _ID BY 1 AND DO SO FOR EACH RECORD ADDED*/
ELSE DO;
_ID+1;
END;
/*DROP THE ORIGINAL ID;*/
DROP ID;
/*RENAME _ID TO ID*/
RENAME _ID = ID;
RUN;
where "W_prv_qtr_recoveries" is a table Name and "pmt" is a library name.
Thanks to user2337871.
DATA pmt.W_prv_qtr_recoveries;
SET pmt.W_prv_qtr_recoveries;
RETAIN _ID;
IF ID ~= . THEN _ID = ID;
ELSE DO;
_ID+1;
END;
DROP ID;
RENAME _ID = ID;
RUN;
Assuming that this autoincrement column will be used for every record that is inserted.
We can accomplish the same as follows:-
We will first check the latest key in the dataset
PROC SQL;
SELECT MAX(KEY) INTO :MK FROM MYDATA;
QUIT;
%put KeyOld=&MK;
Then we increment this key
Data _NULL_;
call symput('KeyNew',&MK+1);
run;
%put KeyNew=&KeyNew;
Here we hold the New record that we want to insert, and add the correspoding key
Data TEMP1;
set TEMP;
Key=&KeyNew;
run;
Finally we load the new record in our dataset
PROC APPEND BASE=MYDATA DATA=TEMP1 FORCE;
RUN;

update statement in data step

In the database we have email address dataset as following. Please notice that there are two observations for id 1003
data Email;
input id$ email $20.;
datalines;
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#gmail.com
1003 2003#gmail.com
;
run;
And we receive user request to change the email address as following,
data amendEmail;
input id$ email $20.;
datalines;
1003 1003#yahoo.com
;
run;
I attempt to using the update statement in the data step
data newEmail;
update Email amendEmail;
by id;
run;
While it only change the first observation for id 1003.
My desired output would be
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#yahoo.com
1003 1003#yahoo.com
is it possible using non proc sql method?
Vasilij's merge-based data-step answer will give you the dataset you want, but not in the most efficient way, as it will overwrite the whole email dataset, rather than updating just the rows you want to change.
You can use a modify statement to change the email address for just the rows from email with matching ids in the amendEmail dataset.
First, you need to make sure you have an index on id in the email dataset. This is just a one-off task - as long as you don't overwrite the email dataset (e.g. with another data step that doesn't use a modify statement, or by sorting it) the index will still be there.
proc datasets lib = work nolist;
modify email;
index create id;
run;
quit;
Now you can do updates using the index:
data email;
set amendEmail(rename = (email = new_email));
do until(eof);
modify email key = id end = eof;
if _IORC_ then _ERROR_ = 0;
else do;
email = new_email;
replace;
end;
end;
run;
You should see some output in the log that looks like this, indicating that your dataset has been updated rather than overwritten:
NOTE: There were 1 observations read from the data set WORK.AMENDEMAIL.
NOTE: The data set WORK.EMAIL has been updated. There were 2 observations rewritten, 0 observations added and 0 observations
deleted.
N.B. before you use a modify statement like this, make sure that your master email dataset is backed up. If the data step is interrupted, it may become corrupt.
If you want to change both rows, you will end up with duplicates. You should probably address the issue of duplicates in your source table to begin with.
If you need a working solution with duplicated results, consider using PROC SQL with LEFT JOIN and conditional clause for email address.
PROC SQL;
CREATE TABLE EGTASK.QUERY_FOR_EMAIL AS
SELECT t1.id,
/* email */
(CASE WHEN t1.id = t2.id THEN t2.email
ELSE t1.email
END) AS email
FROM WORK.EMAIL t1
LEFT JOIN WORK.AMENDEMAIL t2 ON (t1.id = t2.id);
QUIT;
As per comments, if you prefer to use data step, you can use the following:
data want (drop=email2);
merge Email amendEmail (rename=(email=email2));
by id;
if email2 ne "" then email=email2;
run;
Ideally you should have unique values in the by variable. In case of duplicates it just updates the first observation. Please refer the link below
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001329152.htm

SAS data step/ proc sql insert rows from another table with auto increment primary key

I have 2 datasets as below
id name status
1 A a
2 B b
3 C c
Another dataset
name status new
C c 0
D d 1
E e 1
F f 1
How do I insert all rows from 2nd table to 1st table? The situation is that the first table is permanent. The 2nd table is updated monthly, so I would like to add all rows from the monthly updated table to the permanent table, so that it would look like this
id name status
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
The problem I'm facing is that I cannot increment the id from dataset 1. As far as I searched, the dataset in SAS does not have auto increment property. The auto increment can be done with using data step, but I don't know if data step could be use in the case with 2 tables like this.
The usual sql would be
Insert into table1 (name, status)
select name, status from table2 where new = 1;
But since the sas dataset not support auto increment column hence the problem I'm facing.
I could solve it by using SAS data step as below after the above proc sql
data table1;
set table1;
if _n_ > 3 then id = _n_;
run;
This would increase the value of id column, but the code is kinda ugly, and also the id is a primary key, and being used as a foreign key in other table, so I don't want to mess up the ids of old rows.
I'm in the process of both learning and working with SAS so help is really appreciated. Thanks in advance.
Extra question:
If the 2nd table does not have the new column, is there any way to complete what I want (add new row from monthly table (2nd) to permanent table (1st)) with data step? Currently, I use this ugly proc sql/data step to create new column
proc sql; //create a temp table from table2
create t2temp as select t2.*,
(case when t2.name = t1.name and t2.status = t1.status then 0 else 1) as new
from table2 as t2
left join table1 as t1
on t2.name = t1.name and t2.status = t1.status;
drop table t2; //drop the old table2 with no column "new"
quit;
data table2; //rename the t2temp as table2
set t2temp;
run;
You can do it in the datastep. BTW, if you were creating it entirely anew, you could just use
id+1;
to create an autonumbered field (assuming your data step wasn't too complicated). This will keep track of the current highest ID number and assign one higher to each row as you go if it is in the new dataset.
data have;
input id name $ status $;
datalines;
2 A a
3 B b
1 C c
;;;;
run;
data addon;
input name $ status $ new;
datalines;
C c 0
D d 1
E e 1
F f 1
;;;;
run;
data want;
retain _maxID; *keep the value of _maxID from one row to the next,
do not reset it;
set have(in=old) addon(in=add); *in= creates a temporary variable indicating which
dataset a row came from;
if (old) or (add and new); *in SAS like in c/etc., 0/missing(null) is
false negative/positive numbers are true;
if add then ID = _maxID+1; *assigns ID to the new records;
_maxID = max(id,_maxID); *determines the new maximum ID -
this structure guarantees it works
even if the old DS is not sorted;
put id= name=;
drop _maxID;
run;
Response to second question:
Yes, you can still do that. One of the easiest ways is, if you have the datasets sorted by NAME:
data want;
retain _maxID;
set have(in=old) addon(in=add);
by name;
if (old) or (add and first.name);
if add then ID = _maxID+1;
_maxID = max(id,_maxID);
put id= name=;
run;
first.name will be true for the first record with the same value of name; so if HAVE has a value of that name, then ADDON will not be permitted to add a new record.
This does require name to be unique in HAVE, or you might delete some records. If that is not true then you have a more complicated solution.