sas: count number of rows according to date - sas

I am working with crime data. Now, I have the following table crimes. Each row contains a specific crime (e.g. assault): the date it was committed (date) and a person-ID of the offender (person).
date person
------------------------------
02JAN2017 1
03FEB2017 1
04JAN2018 1 --> not to be counted (more than a year after 02JAN2017)
27NOV2017 2
28NOV2018 2 --> should not be counted (more than a year after 27NOV2017)
01MAY2017 3
24FEB2018 3
10OCT2017 4
I am interested in whether each person has committed (relapse=1) or not committed (relapse=0) another crime within 1 year after the first crime committed by the same person. Another condition is that the first crime has to be committed within a specific year (here 2017).
The result should therefore look like this:
date person relapse
------------------------------
02JAN2017 1 1
03FEB2017 1 1
04JAN2018 1 1
27NOV2017 2 0
28NOV2018 2 0
01MAY2017 3 1
24FEB2018 3 1
10OCT2017 4 0
Can anyone please give me a hint on how to do this in SAS?
Obviously, the real data are much larger, so I cannot do it manually.

One approach is to use DATA step by group processing.
The BY <var> statement sets up binary variables first.<var> and last.<var> that flag the first row in a group and the last row in a group.
You appear to be assigning the computed relapse flag over the entire group, and that kind of computation can be done with what SAS coders call a DOW loop -- a loop with the SET statement inside loop, with a follow up loop that assigns the computation to each row in the group.
The INTCK function can compute the number of years between two dates.
For example:
data want(keep=person date relapse);
* DOW loop computes assertion that relapse occurred;
relapse = 0;
do _n_ = 1 by 1 until (last.person);
set crimes; * <-------------- CRIMES;
by person date;
* check if persons first crime was in 2017;
if _n_ = 1 and year(date) = 2017 then _first = date;
* check if persons second crime was within 1 year of first;
if _n_ = 2 and _first then relapse = intck('year', _first, date, 'C') < 1;
end;
* at this point the relapse flag has been computed, and its value
* will be repeated for each row output;
* serial loop over same number of rows in the group, but
* read in through a second SET statement;
do _n_ = 1 to _n_;
set crimes; * <-------------- CRIMES;
output;
end;
run;
The process would be more complex, with more bookkeeping variables, if the actual process is to classify different time frames of a person as either relapsed or reformed based on rules more nuanced than "1st in 2017 and next within 1 year".

I started using sas relatively recent - I'm not by any means attempting to create perfect code here.
I'd sort the data by id/person and date first (date should be numeric), and then use retain statements check against the date of the first crime. It's not perfect, but if your data is good (no missing dates), it'll work, and it is easy to follow imho.
This only works if the first record and act of crime is supposed to happen in 2017. If you have crimes happening in 2016, and want to check whether 'a crime' is committed in 2017 and then check the relapse, then this code is not going to work - but I think that is covered in the comments beneath your question.
data test;
input tmp_year $ 1-9 person;
datalines;
02JAN2017 1
03FEB2017 1
04JAN2018 1
27NOV2017 2
28NOV2018 2
01MAY2017 3
24FEB2018 3
10OCT2017 4
;
run;
data test2;
set test;
crime_date = input(tmp_year, date9.);
act_year = year(crime_date);
run;
proc sort data=test2;
by person crime_date ;
run;
data want;
set test2;
by person crime_date;
retain date_of_crime;
if first.person and act_year = 2017 then date_of_crime = crime_date;
else if first.person then call missing(date_of_crime);
if intck('YEAR', date_of_crime, crime_date) =< 1 and not first.person
then relapse = 1;
else relapse = 0;
run;
The above code flags the act of crimes committed one year after an act of crime in 2017. You can then retrieve the unique persons with a proc sql statement, and join them with whatever dataset you have.

Related

SAS group by counters per variable - primary key creation

I have some data which needs to be split into 12 or so different groups, there is no key and the order the data is in is important.
The data has a number of groups and those groups have singular and / or nested groups within that. Each group will be split out as the data is in a hierarchical format. so each "GROUP" then has its own format which then all needs to be joined up on one line (or many) rows.
Sample data file:
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
The hierarchy that should exist when data is input. I am thinking there could be several tables that can be joined together later. (numbers for illustration of parent child levels)
1. Transaction [TRANS]
1.1. Meter Point [MTPNT]
1.1.1. Asset [ASSET]
1.1.1.1. Meter [METER]
1.1.1.2. Converter [CONVE]
1.1.1.3. Register Details [REGST]
1.1.1.3.1. Reading [READG]
1.1.1.4. Market Participant [MKPRT]
1.1.1.5. Name [NAME]
1.1.1.5.1. Address [ADDRS]
1.1.1.5.2. Contact Mechanism [CONTM]
1.2. Appointment [APPNT]
1.3. Name [NAME]
1.3.1. Address [ADDRS]
1.3.2. Contact Mechanism [CONTM]
1.4. Market Participant [MKPRT]
The industry GAS data, so in this flow you can have many ASSET per MTPNT, and those many ASSET can have many REGST because this is where the meter reading is kept for READG
I have tried using by groups and iterative first. processing, but i have not worked with this type of data before. I need a way to split create a key per grouping, which when split up and the fields are defined, can be joined back together.
I have tried manipulating the infile so that all the data appears on one line per TRANS, but then i still have the issue of applying the fields, and ordering is paramount.
I have managed to get a few keys for some of the groups, but after splitting they dont quite join back together.
data TRANS;
set mpancreate_a;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKey+1;
end;
run;
data TRANS;
set TRANS;
TRANSKey2 + 1;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKEY2=1;
end;
run;
data MTPNT;
set TRANS;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "MTPNT" then
MTPNTKEY+1;
end;
run;
data MTPNT;
set MTPNT;
by MTPNTKEY NOTSORTED;
if first.MTPNTKEY and DataItmGrp = "MTPNT" then
MTPNTKEY2=0;
MTPNTKEY2+1;
run;
data ASSET;
set MTPNT;
IF MTPNTKEY = 0 THEN
MTPNTKEY2=0;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "ASSET" then
ASSETKEY+1;
end;
run;
data ASSET;
set ASSET;
by ASSETKEY NOTSORTED;
if first.ASSETKEY and DataItmGrp = "ASSET" then
ASSETKEY2=0;
ASSETKEY2+1;
IF ASSETKEY =0 THEN
ASSETKEY2=0;
run;
i want a counter for each group found, and a retained counter for that particular group - but i cannot work out how to get in and out of the groupings based on the hierarchy above
i'm hoping that once i have these keys, i can split the data by group and then left join back together
_n_ TRANS TRANS2 MTPNT MTPNT2
TRANS 1 1 0 0 0
MTPNT 2 2 1 1 1
ASSET 3 3 1 2 1
METER 4 4 1 3 1
READG 5 5 1 4 1
MTPNT 6 6 1 1 2
ASSET 7 7 1 2 2
METER 8 8 1 3 2
READG 9 9 1 4 2
APPNT 10 10 1 5 2
TRANS 11 1 2 6 2
MTPNT 12 2 2 1 3
ASSET 13 3 2 2 3
METER 14 4 2 3 3
READG 15 5 2 4 3
MTPNT 16 6 2 1 4
ASSET 17 7 2 2 4
METER 18 8 2 3 4
READG 19 9 2 4 4
APPNT 20 10 2 5 4
The input of hierarchical data from a data file that has no definitive markers is problematic. The best suggestion I have is to understand what are the salient values you want to extract and in what context do you want to know them. For this problem a simplest first approach would be to have a single monolithic table with categorical variables to capture the path that descends to the salient value (meter reading).
A more complex situation would be the first token in each line drives the input for that line and the output table it belongs to. Since there are no landmarks as to hierarchy absolute or relative position (as in the NAME and MKPRT) there is no 100% confident way to place them in the hierarchy and that can also affect the placement of items read-in from subsequent data lines.
Depending on the true complexity and adherence to rules in the real world you may or may not 'miss out' the reading of some values.
Suppose there is the simpler goal of just getting the meter readings.
data want;
length tier level1-level6 $8 path $64 meterReadingString $8 dummy $1;
retain level1-level5 path;
attrib readingdate informat=yymmdd10. format=yymmdd10.;
infile cards dsd missover;
input #1 tier #; * held input - dont advance read line yet;
if tier="TRANS" then do;
level1 = tier;
call missing (of level2-level6);
path = catx("/", of level:);
end;
if tier="MTPNT" and path="TRANS" then do;
level2 = tier;
call missing (of level3-level6);
path = catx("/", of level:);
end;
if tier="ASSET" and path="TRANS/MTPNT" then do;
level3 = tier;
call missing (of level4-level6);
path = catx("/", of level:);
end;
if tier="METER" and path="TRANS/MTPNT/ASSET" then do;
level4 = tier;
call missing (of level5-level6);
path = catx("/", of level:);
end;
if tier="REGST" and path="TRANS/MTPNT/ASSET/METER" then do;
level5 = tier;
call missing (of level6-level6);
path = catx("/", of level:);
end;
if tier="READG" and path="TRANS/MTPNT/ASSET/METER/REGST" then do;
level6 = tier;
path = catx("/", of level:);
input #1 tier readingdate dummy meterReadingString #; * reread line according to tier;
meterReading = input(meterReadingString, best12.);
if path = "TRANS/MTPNT/ASSET/METER/REGST/READG" then OUTPUT;
end;
datalines;
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
run;
You can use this as the basis of a more complicated reader that has a different output <tier> data set for each tier or path to tier encountered. You would need a different input statement per tier, similar to how READG is read.

use data step generate next observation

Case 1
Suppose the data are sorted by year then by month (always have 3 observations in data).
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
I need to copy the Index of last month to new observation
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
2015 2 1.2
Case 2
Year is removed from data. So we only have Month and Index.
Month Index
1 1.2
11 1.1
12 1.5
Data is always collected from consecutive 3 months. So 1 is the last month.
Still, ideal output is
Month Index
1 1.2
2 1.2
11 1.1
12 1.5
I solve it by creating another dataset only contains Month (1,2...12). Then right join the original dataset twice. But I think there's more elegant way to deal with this.
Case 1 can be a straight-forward data step. Add end=eof to the set statement to initialize a variable eof that returns value 1 when the data step is reading the last row of the data set. An output statement in the data step outputs a row during each iteration. If eof=1, a do block runs that increments the month by 1 and outputs another row.
data want;
set have end=eof;
output;
if eof then do;
month=mod(month+1,12);
output;
end;
run;
For case 2, I would switch to an sql solution. Self join the table to itself on month, incremented by 1 in the second table. Use the coalesce function to keep the values from the existing table if it exists. If not, use the values from the second table. Since a case crossing December-January will produce 5 months, limit the output to four rows using the outobs= option in proc sql to exclude the unwanted second January.
proc sql outobs=4;
create table want as
select
coalesce(t1.month,mod(t2.month+1,12)) as month,
coalesce(t1.index,t2.index) as index
from
have t1
full outer join have t2
on t1.month = t2.month+1
order by
coalesce(t1.month,t2.month+1)
;
quit;

Create a loop in SAS which filters 2 variables simultaneously

The question might be quite vague but I could not come up with a decent concise title.
I have data where there are id ,date, amountA and AmtB as my variables. The task is to pick the dates that are within 10 days of each other and then see if their amountA are within 20% and if they are then pick the one with highest amountB. I have used to this code to achieve this
id date amountA amountB
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
This is what I need
id date amountA amountB
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/25/2014 720 50
I wrote this code but the problem with this code is its not automatic and has to be done on a case to case basis.I need a way to loop it so that it automatically outputs the results.I am no pro at looping and hence am stuck.Any help is greatly appreciated
data test2;
set test1;
diff_days=abs(intck('days',first_dt,date));
if diff_days<=10 then flag=1;
else if diff_days>10 then flag=0;
run;
data test3 rem_test3;
set test2;
if flag=1 then output test3;
else output rem_test3;
run;
proc sort data=test3;
by id amountA;
run;
data all_within;
set test3;
by id amountA;
amtA_lag=lag1(amountA);
if first.id then
do;
counter=1;
flag1=1;
end;
if first.id=0 then
do;
counter+1;
diff=abs(amountA-amtA_lag);
if diff<(10/100*amountA) then flag1+1;
else flag1=0;
end;
if last.stay and flag1=counter then output all_within;
run;
If I understand the problem correctly, you want to group all records together that have (no skip of 10+ days) and (amt A w/in 20%)?
Looping isn't your problem - no explicitly coded loop is needed to do this (or at least, the way I think of it). SAS does the data step loop for you.
What you want to do is:
Identify groups. A group is the consecutive records that you want to, among them, collapse to one row. It's not perfectly clear to me how amountA has to behave here - does the whole group need to have less than a maximum difference of 10%, or a record to next record difference of < 10%, or a (current highest amtB of group) < 10% - but you can easily identify all of these rules. Use a RETAINed variable to keep track of the previous amountA, previous date, highest amountB, date associated with the highest amountB, amountA associated with highest amountB.
When you find a record that doesn't fit in the current group, output a record with the values of the previous group.
You shouldn't need two steps for this, although you can if you want to see it more easily - this may be helpful for debugging your rules. Set it so that you have a GroupNum variable, which you RETAIN, and you increment that any time you see a record that causes a new group to start.
I had trouble figuring out the rules...but here is some code that checks each record against the previous for the criteria I think you want.
Data HAVE;
input id date :mmddyy10. amountA amountB ;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
;
Proc Sort data=HAVE;
by id date;
Run;
Data WANT(drop=Prev_:);
Set HAVE;
Prev_Date=lag(date);
Prev_amounta=lag(amounta);
Prev_amountb=lag(amountb);
If not missing(prev_date);
If date-prev_date<=10 then do;
If (amounta-prev_amounta)/amounta<=.1 then;
If amountb<prev_amountb then do;
Date=prev_date;
AmountA=prev_amounta;
AmountB=prev_amountb;
end;
end;
Else delete;
Run;
Here is a method that I think should work. The basic approach is:
Find all the pairs of sufficiently close observations
Join the pairs with themselves to get all connected ids
Reduce the groups
Join to the original data and get the desired values
data have;
input
id
date :mmddyy10.
amountA
amountB;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
2 1/16/2014 1100 81
3 1/30/2014 700 50
4 2/05/2014 710 80
5 2/25/2014 720 50
;
run;
/* Count the observations */
%let dsid = %sysfunc(open(have));
%let nobs = %sysfunc(attrn(&dsid., nobs));
%let rc = %sysfunc(close(&dsid.));
/* Output any connected pairs */
data map;
array vals[3, &nobs.] _temporary_;
set have;
/* Put all the values in an array for comparison */
vals[1, _N_] = id;
vals[2, _N_] = date;
vals[3, _N_] = amountA;
/* Output all pairs of ids which form an acceptable pair */
do i = 1 to _N_;
if
abs(vals[2, i] - date) < 10 and
abs((vals[3, i] - amountA) / amountA) < 0.2
then do;
id2 = vals[1, i];
output;
end;
end;
keep id id2;
run;
proc sql;
/* Reduce the connections into groups */
create table groups as
select
a.id,
min(min(a.id, a.id2, b.id)) as group
from map as a
left join map as b
on a.id = b.id2
group by a.id;
/* Get the final output */
create table lookup (where = (amountB = maxB)) as
select
have.*,
groups.group,
max(have.amountB) as maxB
from have
left join groups
on have.id = groups.id
group by groups.group;
quit;
The code works for the example data. However, the group reduction is insufficient for more complicated data. Fortunately, approaches for finding all the subgraphs given a set of edges can be found here, here, here or here (using SAS/OR).

SAS and Date operations

I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)
Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;
By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.
Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;

Dynamic Where in SAS - Conditional Operation based on another set

To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75