In the database we have email address dataset as following. Please notice that there are two observations for id 1003
data Email;
input id$ email $20.;
datalines;
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#gmail.com
1003 2003#gmail.com
;
run;
And we receive user request to change the email address as following,
data amendEmail;
input id$ email $20.;
datalines;
1003 1003#yahoo.com
;
run;
I attempt to using the update statement in the data step
data newEmail;
update Email amendEmail;
by id;
run;
While it only change the first observation for id 1003.
My desired output would be
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#yahoo.com
1003 1003#yahoo.com
is it possible using non proc sql method?
Vasilij's merge-based data-step answer will give you the dataset you want, but not in the most efficient way, as it will overwrite the whole email dataset, rather than updating just the rows you want to change.
You can use a modify statement to change the email address for just the rows from email with matching ids in the amendEmail dataset.
First, you need to make sure you have an index on id in the email dataset. This is just a one-off task - as long as you don't overwrite the email dataset (e.g. with another data step that doesn't use a modify statement, or by sorting it) the index will still be there.
proc datasets lib = work nolist;
modify email;
index create id;
run;
quit;
Now you can do updates using the index:
data email;
set amendEmail(rename = (email = new_email));
do until(eof);
modify email key = id end = eof;
if _IORC_ then _ERROR_ = 0;
else do;
email = new_email;
replace;
end;
end;
run;
You should see some output in the log that looks like this, indicating that your dataset has been updated rather than overwritten:
NOTE: There were 1 observations read from the data set WORK.AMENDEMAIL.
NOTE: The data set WORK.EMAIL has been updated. There were 2 observations rewritten, 0 observations added and 0 observations
deleted.
N.B. before you use a modify statement like this, make sure that your master email dataset is backed up. If the data step is interrupted, it may become corrupt.
If you want to change both rows, you will end up with duplicates. You should probably address the issue of duplicates in your source table to begin with.
If you need a working solution with duplicated results, consider using PROC SQL with LEFT JOIN and conditional clause for email address.
PROC SQL;
CREATE TABLE EGTASK.QUERY_FOR_EMAIL AS
SELECT t1.id,
/* email */
(CASE WHEN t1.id = t2.id THEN t2.email
ELSE t1.email
END) AS email
FROM WORK.EMAIL t1
LEFT JOIN WORK.AMENDEMAIL t2 ON (t1.id = t2.id);
QUIT;
As per comments, if you prefer to use data step, you can use the following:
data want (drop=email2);
merge Email amendEmail (rename=(email=email2));
by id;
if email2 ne "" then email=email2;
run;
Ideally you should have unique values in the by variable. In case of duplicates it just updates the first observation. Please refer the link below
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001329152.htm
Related
I have a dataset with the first 4 columns and I want to create the last column. My dataset has millions of records.
ID
Date
Code
Event of Interest
Want to Create
1
1/1/2022
101
*
201
1
1/1/2022
201
yes
201
1
1/1/2022
301
*
201
1
1/1/2022
401
*
201
2
1/5/2022
101
*
301
2
1/5/2022
201
*
301
2
1/5/2022
301
yes
301
I want to group records by ID and date. If one of the records in the grouping has a 'yes' in the event of interest variable, I want to assign that code to the entire grouping. I am using base SAS.
Any ideas?
Assuming that you will only have one yes value for each id and date, you can use a lookup table and merge them together. Here are a few ways to do it.
1. Self-merge
Simply merge the data onto itself where event = yes.
data want;
merge have
have(rename=(code = new_code
event = _event_)
where =(upcase(_event_) = 'YES')
)
;
by id date;
drop _event_;
run;
2. SQL Self-join
Same as above, but using a SQL inner join.
proc sql;
create table want as
select t1.*
, t2.code as new_code
from have as t1
INNER JOIN
have as t2
ON t1.id = t2.id
AND t1.date = t2.date
where upcase(t2.event) = 'YES'
;
quit;
3. Hash lookup table
This is more advanced but can be quite performant if you have the memory. Notice that it looks very similar to our merge statement in Option 1. We're creating a lookup table, loading it to memory, and using a hash join to pull values from that in-memory table. h.Find() will check the unique combination of (id, date) in the value read from the set statement against the hash table in memory. If a match is found, it will pull the value of new_code.
data want;
set have;
if(_N_ = 1) then do;
dcl hash h(dataset: "have(rename=(code= new_code)
where =(upcase(event) = 'YES')
)"
, hashexp:20);
h.defineKey('id', 'date');
h.defineData('new_code');
h.defineDone();
call missing(new_code);
end;
rc = h.Find();
drop rc;
run;
You could just remember the last value of CODE you want for the group by using a double DOW loop.
In the first loop copy the code value to the new variable. The second loop can re-read the observations and write them out with the extra variable filled in.
data want;
do until (last.date);
set have;
by id date ;
if 'Event of Interest'n='yes' then 'Want to Create'n=code;
end;
do until (last.date);
set have;
by id date;
output;
end;
run;
I have 33 different datasets with one column and all share the same column name/variable name;
net_worth
I want to load the values into arrays and use them in a datastep. But the array that I use should depend on the the by groups in the datastep (country by city). There are total of 33 datasets and 33 groups (country by city). each dataset correspond to exactly one by group.
here is an example what the by groups look like in the dataset: customers
UK 105 (other fields)
UK 102 (other fields)
US 291 (other fields)
US 292 (other fields)
Could I get some advice on how to go about and enter the columns in arrays and then use them in a datastep. or do you suggest to do it in another way?
%let var1 = uk105
%let var2 = uk102
.....
&let var33 = jk12
data want;
set customers;
by country city;
if _n_ = 1 then do;
*set datasets and create and populate arrays*;
* use array values in calculations with fields from dataset customers, depending on which by group. if the by group is uk and city is 105 then i need to use the created array corresponding to that by group;
It is a little hard to understand what you want.
It sounds like you have one dataset name CUSTOMERS that has all of the main variables and a bunch of single variable datasets that the values of NET_WORTH for a lot of different things (Countries?).
Assuming that the observations in all of the datasets are in the same order then I think you are asking for how to generate a data step like this:
data want;
set customers;
set uk105 (rename=(net_worth=uk105));
set uk103 (rename=(net_worth=uk103));
....
run;
Which might just be easiest to do using a data step.
filename code temp;
data _null_;
input name $32. ;
file code ;
put ' set ' name '(rename=(net_worth=' name '));' ;
cards;
uk105
uk102
;;;;
data want;
set customers;
%include code / source2;
run;
I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1
I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;
Probably a simple question. I have a simple dataset with scheduled payment dates in it.
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016
;
What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?
Want:
previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015 12/16/2015 0
12/17/2015 01/16/2016 1
01/17/2016 02/16/2016 0
Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:
DATA INFORM2;
INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
INPUT account previous_pmt_date scheduled_pmt_date;
FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016
;
run;
Specify a cutoff date:
%let cutoff_date = %sysfunc(mdy(1,31,2016));
This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).
data want;
set inform2 end=eof;
by account scheduled_pmt_date;
recno = _n_ + 1;
if not eof then do;
set inform2 (keep=account previous_pmt_date scheduled_pmt_date
rename=(account = next_account
previous_pmt_date = next_previous_pmt_date
scheduled_pmt_date = next_scheduled_pmt_date)
) point=recno;
end;
else do;
call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
end;
select;
when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
when ( next_account ne account ) flag='b';
otherwise flag = 'z';
end;
*drop next:;
run;
This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.
We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.
Some notes...
The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.
The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.
The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.
Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.