I have a dataset like this:
CustomerID AccountManager TransactionID Transaction_Time
1111111111 FA001 TR2016001 08SEP16:11:19:25
1111111111 FA001 TR2016002 26OCT16:08:22:49
1111111111 FA002 TR2016003 04NOV16:08:05:36
1111111111 FA003 TR2016004 04NOV16:17:15:52
1111111111 FA004 TR2016005 25NOV16:13:04:16
1231231234 FA005 TR2016006 25AUG15:08:03:29
1231231234 FA005 TR2016007 16SEP15:08:24:24
1231231234 FA008 TR2016008 18SEP15:14:42:29
CustomerID represents each customer, each customer could have multiple transactions. Each account manager could deal with multiple transactions too. But transactionID is unique in this table.
Now I would like to count for each customer, when the transation happened, if I went back to last 90 days, how many distinct Account Manager get involved and how many transactions happened. The result I am looking for is like this:
CustomerID Manager TransacID Transaction_Time CountTransac CountManager
1111111111 FA001 TR2016001 08SEP16:11:19:25 1 1
1111111111 FA001 TR2016002 26OCT16:08:22:49 2 1
1111111111 FA002 TR2016003 04NOV16:08:05:36 3 2
1111111111 FA003 TR2016004 04NOV16:17:15:52 4 3
1111111111 FA004 TR2016005 25NOV16:13:04:16 5 4
1231231234 FA005 TR2016006 25AUG15:08:03:29 1 1
1231231234 FA005 TR2016007 16SEP15:08:24:24 2 1
1231231234 FA008 TR2016008 18SEP15:14:42:29 3 2
Now using the following code, I figure out how to calculate transaction count, but I did not know how to calculate the distinct manager count. It would be highly appreciated if someone could help me out. Thanks a lot.
DATA want;
SET transaction;
COUNT=1;
DO point=_n_-1 TO 1 BY -1;
SET want(KEEP=CustomerID Transaction_Time COUNT POINT=point
RENAME=(CustomerID =SAME_ID Transaction_Time =OTHER_TIME COUNT=OTHER_COUNT));
IF CustomerID NE SAME_ID
OR INTCK ("DAY", DATEPART(OTHER_TIME), DATEPART(Transaction_Time )) > 90
THEN LEAVE;
COUNT + OTHER_COUNT;
END;
DROP SAME_ID OTHER_TIME OTHER_COUNT;
RENAME COUNT=COUNT_TRANSAC;
RUN;
Your code does not work at all as it is, but I see what you want to do. Here is something that does work. I commented out the WHERE statement so you can see that it produces the result you asked for. You need the WHERE statement if you really want just the last 90 days.
* Always a good idea to sort first unless you are CERTAIN that
* your values are in the order you want.;
proc sort data=have;
by customerid AccountManager transactionid;
run;
DATA want;
SET have;
* Uncomment the WHERE statement to activate the 90-day time frame.;
* where today()-datepart(transaction_time)<=90;
by customerid AccountManager transactionid;
if first.customerid
then do;
counttransac=0;
countmanager=0;
end;
if first.AccountManager
then countmanager+1;
counttransac+1;
RUN;
Taking advantage of SAS's BY statement and the first. and last. variable modifiers, you can reset your counter each time you see a new customer ID and manager ID.
[EDIT] Okay, that's much more difficult. Here is code that looks back at the history before each transaction. I see why you were using two SET statements because you have to join the dataset to itself. Probably you can do this with PROC SQL, but I didn't have time to check it out. Let me know if this works for you.
* Sort each customer's and manager's transactions;
proc sort data=transaction;
by customerid accountmanager;
run;
DATA want;
SET transaction nobs=pmax;
by customerid;
length lastmgr $ 100;
retain pstart; * Starting row for each customer;
* Save starting row for each customer;
if first.customerid
then pstart=_n_;
* Initialize current account manager and counters for
* managers and transactions. The current transaction always
* counts as one transaction and one manager.
* Save the beginning of the 90-day period to avoid
* recalculating it each time.;
lastmgr=accountmanager;
mgrct=1;
tranct=1;
ninetyday=datepart(transaction_time)-90;
* Set the starting row to search for each transaction;
p=pstart;
* Loop through all rows for the customer and only count
* those that occur before the current transaction and
* after the 90-day period before it.;
* Note that the transactions are not necessarily sorted
* in chronological order but rather in groups by customer
* and manager, so we have to look through all of the
* customer's transactions each time.;
* DO UNTIL(0) means loop forever, so be careful that
* there is always a LEAVE statement executed.;
do until(0);
* p > pmax means the end of the transaction list, so stop.;
if p > pmax
then leave;
set transaction (keep=customerid accountmanager transaction_time
rename=(customerid=cust2 accountmanager=mgr2 transaction_time=tt2))
point=p;
* When customer ID changes, we are done with the loop.;
if cust2 ~= customerid
then leave;
else do;
* To be counted, the transaction needs to be within the
* 90-day period. Using "<" for the transaction time pre-
* vents counting the current transaction twice.;
if datepart(tt2) >= ninetyday and tt2 < transaction_time
then do;
tranct=tranct+1;
if mgr2 ~= lastmgr
then do;
mgrct=mgrct+1;
lastmgr=mgr2;
end;
end;
end;
* Look at the next transaction.;
p=p+1;
end;
keep CustomerID AccountManager TransactionID Transaction_Time tranct mgrct;
RUN;
[EDIT] Here is a PROC SQL approach that works. It's by Tom in answer to my question here about how to create an elegant query to accomplish your task:
proc sql noprint ;
create table want as
select a.*
, count(distinct b.accountmanager) as mgrct
, count(*) as tranct
from transaction a
left join transaction b
on a.customerid = b.customerid
and b.transaction_time <= a.transaction_time
and datepart(a.transaction_time)-datepart(b.transaction_time)
between 0 and 90
group by 1,2,3,4
;
quit;
Related
I have a dataset with the first 4 columns and I want to create the last column. My dataset has millions of records.
ID
Date
Code
Event of Interest
Want to Create
1
1/1/2022
101
*
201
1
1/1/2022
201
yes
201
1
1/1/2022
301
*
201
1
1/1/2022
401
*
201
2
1/5/2022
101
*
301
2
1/5/2022
201
*
301
2
1/5/2022
301
yes
301
I want to group records by ID and date. If one of the records in the grouping has a 'yes' in the event of interest variable, I want to assign that code to the entire grouping. I am using base SAS.
Any ideas?
Assuming that you will only have one yes value for each id and date, you can use a lookup table and merge them together. Here are a few ways to do it.
1. Self-merge
Simply merge the data onto itself where event = yes.
data want;
merge have
have(rename=(code = new_code
event = _event_)
where =(upcase(_event_) = 'YES')
)
;
by id date;
drop _event_;
run;
2. SQL Self-join
Same as above, but using a SQL inner join.
proc sql;
create table want as
select t1.*
, t2.code as new_code
from have as t1
INNER JOIN
have as t2
ON t1.id = t2.id
AND t1.date = t2.date
where upcase(t2.event) = 'YES'
;
quit;
3. Hash lookup table
This is more advanced but can be quite performant if you have the memory. Notice that it looks very similar to our merge statement in Option 1. We're creating a lookup table, loading it to memory, and using a hash join to pull values from that in-memory table. h.Find() will check the unique combination of (id, date) in the value read from the set statement against the hash table in memory. If a match is found, it will pull the value of new_code.
data want;
set have;
if(_N_ = 1) then do;
dcl hash h(dataset: "have(rename=(code= new_code)
where =(upcase(event) = 'YES')
)"
, hashexp:20);
h.defineKey('id', 'date');
h.defineData('new_code');
h.defineDone();
call missing(new_code);
end;
rc = h.Find();
drop rc;
run;
You could just remember the last value of CODE you want for the group by using a double DOW loop.
In the first loop copy the code value to the new variable. The second loop can re-read the observations and write them out with the extra variable filled in.
data want;
do until (last.date);
set have;
by id date ;
if 'Event of Interest'n='yes' then 'Want to Create'n=code;
end;
do until (last.date);
set have;
by id date;
output;
end;
run;
I am working with crime data. Now, I have the following table crimes. Each row contains a specific crime (e.g. assault): the date it was committed (date) and a person-ID of the offender (person).
date person
------------------------------
02JAN2017 1
03FEB2017 1
04JAN2018 1 --> not to be counted (more than a year after 02JAN2017)
27NOV2017 2
28NOV2018 2 --> should not be counted (more than a year after 27NOV2017)
01MAY2017 3
24FEB2018 3
10OCT2017 4
I am interested in whether each person has committed (relapse=1) or not committed (relapse=0) another crime within 1 year after the first crime committed by the same person. Another condition is that the first crime has to be committed within a specific year (here 2017).
The result should therefore look like this:
date person relapse
------------------------------
02JAN2017 1 1
03FEB2017 1 1
04JAN2018 1 1
27NOV2017 2 0
28NOV2018 2 0
01MAY2017 3 1
24FEB2018 3 1
10OCT2017 4 0
Can anyone please give me a hint on how to do this in SAS?
Obviously, the real data are much larger, so I cannot do it manually.
One approach is to use DATA step by group processing.
The BY <var> statement sets up binary variables first.<var> and last.<var> that flag the first row in a group and the last row in a group.
You appear to be assigning the computed relapse flag over the entire group, and that kind of computation can be done with what SAS coders call a DOW loop -- a loop with the SET statement inside loop, with a follow up loop that assigns the computation to each row in the group.
The INTCK function can compute the number of years between two dates.
For example:
data want(keep=person date relapse);
* DOW loop computes assertion that relapse occurred;
relapse = 0;
do _n_ = 1 by 1 until (last.person);
set crimes; * <-------------- CRIMES;
by person date;
* check if persons first crime was in 2017;
if _n_ = 1 and year(date) = 2017 then _first = date;
* check if persons second crime was within 1 year of first;
if _n_ = 2 and _first then relapse = intck('year', _first, date, 'C') < 1;
end;
* at this point the relapse flag has been computed, and its value
* will be repeated for each row output;
* serial loop over same number of rows in the group, but
* read in through a second SET statement;
do _n_ = 1 to _n_;
set crimes; * <-------------- CRIMES;
output;
end;
run;
The process would be more complex, with more bookkeeping variables, if the actual process is to classify different time frames of a person as either relapsed or reformed based on rules more nuanced than "1st in 2017 and next within 1 year".
I started using sas relatively recent - I'm not by any means attempting to create perfect code here.
I'd sort the data by id/person and date first (date should be numeric), and then use retain statements check against the date of the first crime. It's not perfect, but if your data is good (no missing dates), it'll work, and it is easy to follow imho.
This only works if the first record and act of crime is supposed to happen in 2017. If you have crimes happening in 2016, and want to check whether 'a crime' is committed in 2017 and then check the relapse, then this code is not going to work - but I think that is covered in the comments beneath your question.
data test;
input tmp_year $ 1-9 person;
datalines;
02JAN2017 1
03FEB2017 1
04JAN2018 1
27NOV2017 2
28NOV2018 2
01MAY2017 3
24FEB2018 3
10OCT2017 4
;
run;
data test2;
set test;
crime_date = input(tmp_year, date9.);
act_year = year(crime_date);
run;
proc sort data=test2;
by person crime_date ;
run;
data want;
set test2;
by person crime_date;
retain date_of_crime;
if first.person and act_year = 2017 then date_of_crime = crime_date;
else if first.person then call missing(date_of_crime);
if intck('YEAR', date_of_crime, crime_date) =< 1 and not first.person
then relapse = 1;
else relapse = 0;
run;
The above code flags the act of crimes committed one year after an act of crime in 2017. You can then retrieve the unique persons with a proc sql statement, and join them with whatever dataset you have.
I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.
I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;
I have a credit card transaction dataset (let's call it "Trans") with transaction amount, zip code, and date. I have another dataset (let's call it "Key") that lists sales tax rates based on date and geocode. The Key dataset also includes a range of zip codes associated with each geocode represented by 2 variables: Zip Start and Zip End.
Because Geocodes don't align with zip codes, some of the zip code ranges overlap. If this happens, I want to use the lowest sales tax rate associated with the zip code shown in Trans.
Trans dataset:
TransAmount TransDate TransZip
$200 01/07/1998 90010
$12 02/09/2002 90022
Key dataset:
Geocode Rate StartDate EndDate ZipStart ZipEnd
1001 .0825 199701 200012 90001 90084
1001 .085 200101 200812 90001 90084
1002 .0825 199701 200012 90022 90024
1002 .08 200101 200812 90022 90024
Desired output:
TransAmount TransDate TransZip Rate
$200 01/07/1998 90010 .0825
$12 02/09/2002 90022 .08
I used this basic SQL code in SAS, but I run into the problem of overlapping zip codes.
proc sql;
create table output as
select a.*, b.zipstart, b.zipend, b.startdate, b.enddate, b.rate
from Trans.CA_Zip_Cd_Testing a left join Key.CA_rates b
on a.TranZip ge b.zipstart
and a.TranZip le b.zipend
and a.TransDate ge b.StartDate
and a.transDate le b.EndDate
;
quit;
Well the easiest way to do this as far as the query portion is to just add a subquery to get the min rate.
Select t.transamount, t.transdate,t.transzip
,(Select MIN(rate) from Key where t.transzip between ZipStart and ZipEnd and t.transdate between startdate and enddate) 'Rate'
from trans t
You could also do it as subquery and join on it.
The SAS SQL Optimizer can be good sometimes. Other times, it can be a challenge. This code is going to be a bit more complicated, but it will likely be faster, and subject to size constraints on your key table.
data key;
set key;
dummy_key=1;
run;
data want(drop=dummy_key geocode rate startDate endDate zipStart zipEnd rc i);
if _n_ = 1 then do;
if 0 then set key;
declare hash k (dataset:'key',multidata:'y');
k.defineKey('dummy_key');
k.defineData('geocode','rate','startdate','enddate','zipstart','zipend');
k.defineDone();
end;
call missing (of _all_);
set trans;
dummy_key=1;
rc = k.find();
do i=1 to 1000 while (rc=0);
transZipNum = input(transZip,8.); *converts character zip to number. if its already a number then remove;
zipStartNum = input(zipStart,8.);
zipEndNum = input(zipEnd,8.);
if startDate <= transDate <= endDate then do;
if zipStartNum <= transZipNum <= zipEndNum then do;
rate_out = min(rate_out,rate);
end;
end;
rc=k.find_next();
end;
run;
The question might be quite vague but I could not come up with a decent concise title.
I have data where there are id ,date, amountA and AmtB as my variables. The task is to pick the dates that are within 10 days of each other and then see if their amountA are within 20% and if they are then pick the one with highest amountB. I have used to this code to achieve this
id date amountA amountB
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
This is what I need
id date amountA amountB
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/25/2014 720 50
I wrote this code but the problem with this code is its not automatic and has to be done on a case to case basis.I need a way to loop it so that it automatically outputs the results.I am no pro at looping and hence am stuck.Any help is greatly appreciated
data test2;
set test1;
diff_days=abs(intck('days',first_dt,date));
if diff_days<=10 then flag=1;
else if diff_days>10 then flag=0;
run;
data test3 rem_test3;
set test2;
if flag=1 then output test3;
else output rem_test3;
run;
proc sort data=test3;
by id amountA;
run;
data all_within;
set test3;
by id amountA;
amtA_lag=lag1(amountA);
if first.id then
do;
counter=1;
flag1=1;
end;
if first.id=0 then
do;
counter+1;
diff=abs(amountA-amtA_lag);
if diff<(10/100*amountA) then flag1+1;
else flag1=0;
end;
if last.stay and flag1=counter then output all_within;
run;
If I understand the problem correctly, you want to group all records together that have (no skip of 10+ days) and (amt A w/in 20%)?
Looping isn't your problem - no explicitly coded loop is needed to do this (or at least, the way I think of it). SAS does the data step loop for you.
What you want to do is:
Identify groups. A group is the consecutive records that you want to, among them, collapse to one row. It's not perfectly clear to me how amountA has to behave here - does the whole group need to have less than a maximum difference of 10%, or a record to next record difference of < 10%, or a (current highest amtB of group) < 10% - but you can easily identify all of these rules. Use a RETAINed variable to keep track of the previous amountA, previous date, highest amountB, date associated with the highest amountB, amountA associated with highest amountB.
When you find a record that doesn't fit in the current group, output a record with the values of the previous group.
You shouldn't need two steps for this, although you can if you want to see it more easily - this may be helpful for debugging your rules. Set it so that you have a GroupNum variable, which you RETAIN, and you increment that any time you see a record that causes a new group to start.
I had trouble figuring out the rules...but here is some code that checks each record against the previous for the criteria I think you want.
Data HAVE;
input id date :mmddyy10. amountA amountB ;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
;
Proc Sort data=HAVE;
by id date;
Run;
Data WANT(drop=Prev_:);
Set HAVE;
Prev_Date=lag(date);
Prev_amounta=lag(amounta);
Prev_amountb=lag(amountb);
If not missing(prev_date);
If date-prev_date<=10 then do;
If (amounta-prev_amounta)/amounta<=.1 then;
If amountb<prev_amountb then do;
Date=prev_date;
AmountA=prev_amounta;
AmountB=prev_amountb;
end;
end;
Else delete;
Run;
Here is a method that I think should work. The basic approach is:
Find all the pairs of sufficiently close observations
Join the pairs with themselves to get all connected ids
Reduce the groups
Join to the original data and get the desired values
data have;
input
id
date :mmddyy10.
amountA
amountB;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
2 1/16/2014 1100 81
3 1/30/2014 700 50
4 2/05/2014 710 80
5 2/25/2014 720 50
;
run;
/* Count the observations */
%let dsid = %sysfunc(open(have));
%let nobs = %sysfunc(attrn(&dsid., nobs));
%let rc = %sysfunc(close(&dsid.));
/* Output any connected pairs */
data map;
array vals[3, &nobs.] _temporary_;
set have;
/* Put all the values in an array for comparison */
vals[1, _N_] = id;
vals[2, _N_] = date;
vals[3, _N_] = amountA;
/* Output all pairs of ids which form an acceptable pair */
do i = 1 to _N_;
if
abs(vals[2, i] - date) < 10 and
abs((vals[3, i] - amountA) / amountA) < 0.2
then do;
id2 = vals[1, i];
output;
end;
end;
keep id id2;
run;
proc sql;
/* Reduce the connections into groups */
create table groups as
select
a.id,
min(min(a.id, a.id2, b.id)) as group
from map as a
left join map as b
on a.id = b.id2
group by a.id;
/* Get the final output */
create table lookup (where = (amountB = maxB)) as
select
have.*,
groups.group,
max(have.amountB) as maxB
from have
left join groups
on have.id = groups.id
group by groups.group;
quit;
The code works for the example data. However, the group reduction is insufficient for more complicated data. Fortunately, approaches for finding all the subgraphs given a set of edges can be found here, here, here or here (using SAS/OR).