I'm a beginner user of SAS especially when it comes to aggregate rows computation.
Here is a question which I believe some of you may have encountered before.
The data I have is related to insurance policies, here is an example dataset: columns from left to right are customer number, policy number, policy status, policy start date and policy cancel date (if the policy is not active, otherwise is a missing value).
data have;
informat cust_id 8. pol_num $10. status $10. start_date can_date DDMMYY10.;
input cust_id pol_num status start_date can_date;
format start_date can_date date9.;
datalines;
110 P110001 Cancelled 04/12/2004 10/10/2013
110 P110002 Active 01/03/2005 .
123 P123001 Cancelled 21/07/1998 23/04/2013
123 P123003 Cancelled 22/10/1987 01/11/2011
133 P133001 Active 19/02/2001 .
133 P133001 Active 20/02/2002 .
;
run;
Basically I want to roll these policy level information to customer level, if a customer holds at least one active policy, then his status would be 'Active', otherwise if all his policies are Cancelled, then his status becomes 'Inactive'. I also need a customer "start date" which picks up the earliest policy start date under that customer. If the customer is 'Inactive', then I need the customer's latest policy cancel date as the customer's exit date.
Below is the what I needed:
data want;
informat cust_id 8. status $10. start_date exit_date DDMMYY10.;
input cust_id status start_date exit_date;
format start_date exit_date date9.;
datalines;
110 Active 01/03/2005 .
123 Inactive 22/10/1987 23/04/2013
133 Active 19/02/2001 .
;
run;
Solution in any form would be much appreciated! Either DATA step or PROC SQL is fine.
Thank you so much.
You can do something like that:
proc sql;
create table want as
select cust_id,
case when count(case when status='Active' then 1 end) > 0
then 'Active'
else 'Inactive'
end as status,
min(start_date) as start_date,
case when count(case when status='Active' then 1 end) = 0
then max(can_date)
end as exit_date
from have
group by cust_id;
quit;
You could attack the question in a DATA step. Here's one simple way, assuming your data are sorted by cust_id and start_date...
data want;
set have (keep=cust_id status start_date exit_date);
where upcase(status) contains 'ACTIVE';
by cust_id start_date;
if first.start_date then output;
else delete;
run;
/*BEGINNER NOTES*/
*1] WHERE tells SAS to compile only records that fit a certain
condition - the DS 'want' will never have any observations with
'CANCELLED' in the status variable;
*2] I use UPCASE() to standardize the contents of status, as CONTAINS
is a case-sensitive operator;
*3] FIRST.variable = 1 if the value is the first encountered in
the compile phase;
Related
Date set having id and date .I want a date set with two duplicate id but condition is that one should be before 8th June and other should be after 8th June.
To take the first date and the first date after 2021-06-08 you can sort by ID and DATE and use LAG() to detect when you cross the date boundary.
data have ;
input id date :date. ;
format date date9.;
cards;
1 01jun2021
1 07jun2021
1 08jun2021
1 09jun2021
;
data want;
set have ;
by id date;
if first.id or ( (date<='08JUN2021'd) ne lag(date<='08JUN2021'd));
run;
results
Obs id date
1 1 01JUN2021
2 1 09JUN2021
Below given dataset I am trying to find New Users Vs Repeated Users.
DATE ID Unique_Event
20200901 a12345 1
20200902 a12345 1
20200903 b12345 1
20200903 a12345 1
20200904 c12345 1
In the above dataset, since a12345 appeared on multiple dates, should be counted as a "repeated" user whereas b12345 only appeared once, so he is a "new" user. Please note, this is only sample data as the actual data is quite large. I tried the below code, but I am not getting the correct count. Ideally, tot_num_users-num_new_users should be repeated users, but I am getting incorrect counts. Am I missing something?
Expected Output:
Month new_users repeated_users
9 2 1
Code:
data user_events;
set user_events;
new_date=input(date,yymmdd10.);
run;
proc sql;select month(new_date) as mm,
count(distinct vv.id) as total_num_users,
count(distinct case when v.new_date = vv.minva then v.id end) as num_new_users,
(count(distinct vv.id) - count(distinct case when v.new_date = vv.minva then id end)
) as num_repeated_users
from user_events v inner join
(select t.id, min(new_date) as minva
from user_events t
group by t.id
) vv
on v.id = vv.id
group by 1
order by 1;quit;
In a sub-select, for each ID you can count the number of distinct DATE to determine the new / repeated status. The all ids aggregate computations are made from the sub-select.
proc sql;
create table freq as
select
count(*) as id_count
, sum (status='repeated') as id_repeated_count /* sum counts a logic eval state */
, sum (status='new') as id_new_count
from
( select
id
, case
when count(distinct date) > 1 then 'repeated'
else 'new'
end as status
from
user_events
group by
id
) as statuses
;
An alternative solution not using proc sql (though I'm aware you tagged this with "proc sql").
data final;
set user_events;
Month=month(new_date);
run;
proc sort data=final; by Month ID;
data final;
set final;
by Month ID;
if first.Month then do;
new_users=0;
repeated_users=0;
end;
if last.ID then do;
if first.ID then
new_users+1;
else
repeated_users+1;
end;
if last.Month then
output;
keep Month new_users repeated_users;
run;
Since you are using proc sql, this is a sql question, not a SAS question.
Try something like:
proc sql;
select ID,count(Unique_Event)
from <that table>
group by ID
order by ID
run;
I have merged two datasets. One data set has the date a project was submitted and the other has when a project was ended. I want to have a new dataset that has only the projects where the ending date is before the submission date. I'm basically trying to identify where projects are being properly closed out before we submit them for outside reviews. Both date variables are date9. formats.
The data looks something like this (edit: there are no missing dates)
Service Submission_date End_date
1 1/1/2010 2/1/2009
2 2/1/2010 12/31/2010
3 5/1/2012 3/1/2010
I used a simple where statement but I am not still seeing incorrect dates. I used code like this:
data correctsubmission;
set projects;
where end_date < submit_date;
run;
Any ideas?
Make sure your variables actually contain dates.
data have;
input Service Submission_date End_date ;
informat Submission_date End_date mmddyy.;
format Submission_date End_date yymmdd10.;
cards;
1 1/1/2010 2/1/2009
2 2/1/2010 12/31/2010
3 5/1/2012 3/1/2010
;
They should be numeric variables that contain the number of days since 1960. Preferable with a date format (like DATE, YYMMDD, etc) so that humans can read the displayed value.
Also make sure to account for missing values.
data want;
set have;
where .Z < end_date < submission_date;
run;
Or reverse the test.
data want ;
set have;
where Submission_date > End_date ;
run;
In the database we have email address dataset as following. Please notice that there are two observations for id 1003
data Email;
input id$ email $20.;
datalines;
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#gmail.com
1003 2003#gmail.com
;
run;
And we receive user request to change the email address as following,
data amendEmail;
input id$ email $20.;
datalines;
1003 1003#yahoo.com
;
run;
I attempt to using the update statement in the data step
data newEmail;
update Email amendEmail;
by id;
run;
While it only change the first observation for id 1003.
My desired output would be
1001 1001#gmail.com
1002 1002#gmail.com
1003 1003#yahoo.com
1003 1003#yahoo.com
is it possible using non proc sql method?
Vasilij's merge-based data-step answer will give you the dataset you want, but not in the most efficient way, as it will overwrite the whole email dataset, rather than updating just the rows you want to change.
You can use a modify statement to change the email address for just the rows from email with matching ids in the amendEmail dataset.
First, you need to make sure you have an index on id in the email dataset. This is just a one-off task - as long as you don't overwrite the email dataset (e.g. with another data step that doesn't use a modify statement, or by sorting it) the index will still be there.
proc datasets lib = work nolist;
modify email;
index create id;
run;
quit;
Now you can do updates using the index:
data email;
set amendEmail(rename = (email = new_email));
do until(eof);
modify email key = id end = eof;
if _IORC_ then _ERROR_ = 0;
else do;
email = new_email;
replace;
end;
end;
run;
You should see some output in the log that looks like this, indicating that your dataset has been updated rather than overwritten:
NOTE: There were 1 observations read from the data set WORK.AMENDEMAIL.
NOTE: The data set WORK.EMAIL has been updated. There were 2 observations rewritten, 0 observations added and 0 observations
deleted.
N.B. before you use a modify statement like this, make sure that your master email dataset is backed up. If the data step is interrupted, it may become corrupt.
If you want to change both rows, you will end up with duplicates. You should probably address the issue of duplicates in your source table to begin with.
If you need a working solution with duplicated results, consider using PROC SQL with LEFT JOIN and conditional clause for email address.
PROC SQL;
CREATE TABLE EGTASK.QUERY_FOR_EMAIL AS
SELECT t1.id,
/* email */
(CASE WHEN t1.id = t2.id THEN t2.email
ELSE t1.email
END) AS email
FROM WORK.EMAIL t1
LEFT JOIN WORK.AMENDEMAIL t2 ON (t1.id = t2.id);
QUIT;
As per comments, if you prefer to use data step, you can use the following:
data want (drop=email2);
merge Email amendEmail (rename=(email=email2));
by id;
if email2 ne "" then email=email2;
run;
Ideally you should have unique values in the by variable. In case of duplicates it just updates the first observation. Please refer the link below
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001329152.htm
I have a credit card transaction dataset (let's call it "Trans") with transaction amount, zip code, and date. I have another dataset (let's call it "Key") that lists sales tax rates based on date and geocode. The Key dataset also includes a range of zip codes associated with each geocode represented by 2 variables: Zip Start and Zip End.
Because Geocodes don't align with zip codes, some of the zip code ranges overlap. If this happens, I want to use the lowest sales tax rate associated with the zip code shown in Trans.
Trans dataset:
TransAmount TransDate TransZip
$200 01/07/1998 90010
$12 02/09/2002 90022
Key dataset:
Geocode Rate StartDate EndDate ZipStart ZipEnd
1001 .0825 199701 200012 90001 90084
1001 .085 200101 200812 90001 90084
1002 .0825 199701 200012 90022 90024
1002 .08 200101 200812 90022 90024
Desired output:
TransAmount TransDate TransZip Rate
$200 01/07/1998 90010 .0825
$12 02/09/2002 90022 .08
I used this basic SQL code in SAS, but I run into the problem of overlapping zip codes.
proc sql;
create table output as
select a.*, b.zipstart, b.zipend, b.startdate, b.enddate, b.rate
from Trans.CA_Zip_Cd_Testing a left join Key.CA_rates b
on a.TranZip ge b.zipstart
and a.TranZip le b.zipend
and a.TransDate ge b.StartDate
and a.transDate le b.EndDate
;
quit;
Well the easiest way to do this as far as the query portion is to just add a subquery to get the min rate.
Select t.transamount, t.transdate,t.transzip
,(Select MIN(rate) from Key where t.transzip between ZipStart and ZipEnd and t.transdate between startdate and enddate) 'Rate'
from trans t
You could also do it as subquery and join on it.
The SAS SQL Optimizer can be good sometimes. Other times, it can be a challenge. This code is going to be a bit more complicated, but it will likely be faster, and subject to size constraints on your key table.
data key;
set key;
dummy_key=1;
run;
data want(drop=dummy_key geocode rate startDate endDate zipStart zipEnd rc i);
if _n_ = 1 then do;
if 0 then set key;
declare hash k (dataset:'key',multidata:'y');
k.defineKey('dummy_key');
k.defineData('geocode','rate','startdate','enddate','zipstart','zipend');
k.defineDone();
end;
call missing (of _all_);
set trans;
dummy_key=1;
rc = k.find();
do i=1 to 1000 while (rc=0);
transZipNum = input(transZip,8.); *converts character zip to number. if its already a number then remove;
zipStartNum = input(zipStart,8.);
zipEndNum = input(zipEnd,8.);
if startDate <= transDate <= endDate then do;
if zipStartNum <= transZipNum <= zipEndNum then do;
rate_out = min(rate_out,rate);
end;
end;
rc=k.find_next();
end;
run;