I have some data which needs to be split into 12 or so different groups, there is no key and the order the data is in is important.
The data has a number of groups and those groups have singular and / or nested groups within that. Each group will be split out as the data is in a hierarchical format. so each "GROUP" then has its own format which then all needs to be joined up on one line (or many) rows.
Sample data file:
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
The hierarchy that should exist when data is input. I am thinking there could be several tables that can be joined together later. (numbers for illustration of parent child levels)
1. Transaction [TRANS]
1.1. Meter Point [MTPNT]
1.1.1. Asset [ASSET]
1.1.1.1. Meter [METER]
1.1.1.2. Converter [CONVE]
1.1.1.3. Register Details [REGST]
1.1.1.3.1. Reading [READG]
1.1.1.4. Market Participant [MKPRT]
1.1.1.5. Name [NAME]
1.1.1.5.1. Address [ADDRS]
1.1.1.5.2. Contact Mechanism [CONTM]
1.2. Appointment [APPNT]
1.3. Name [NAME]
1.3.1. Address [ADDRS]
1.3.2. Contact Mechanism [CONTM]
1.4. Market Participant [MKPRT]
The industry GAS data, so in this flow you can have many ASSET per MTPNT, and those many ASSET can have many REGST because this is where the meter reading is kept for READG
I have tried using by groups and iterative first. processing, but i have not worked with this type of data before. I need a way to split create a key per grouping, which when split up and the fields are defined, can be joined back together.
I have tried manipulating the infile so that all the data appears on one line per TRANS, but then i still have the issue of applying the fields, and ordering is paramount.
I have managed to get a few keys for some of the groups, but after splitting they dont quite join back together.
data TRANS;
set mpancreate_a;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKey+1;
end;
run;
data TRANS;
set TRANS;
TRANSKey2 + 1;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKEY2=1;
end;
run;
data MTPNT;
set TRANS;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "MTPNT" then
MTPNTKEY+1;
end;
run;
data MTPNT;
set MTPNT;
by MTPNTKEY NOTSORTED;
if first.MTPNTKEY and DataItmGrp = "MTPNT" then
MTPNTKEY2=0;
MTPNTKEY2+1;
run;
data ASSET;
set MTPNT;
IF MTPNTKEY = 0 THEN
MTPNTKEY2=0;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "ASSET" then
ASSETKEY+1;
end;
run;
data ASSET;
set ASSET;
by ASSETKEY NOTSORTED;
if first.ASSETKEY and DataItmGrp = "ASSET" then
ASSETKEY2=0;
ASSETKEY2+1;
IF ASSETKEY =0 THEN
ASSETKEY2=0;
run;
i want a counter for each group found, and a retained counter for that particular group - but i cannot work out how to get in and out of the groupings based on the hierarchy above
i'm hoping that once i have these keys, i can split the data by group and then left join back together
_n_ TRANS TRANS2 MTPNT MTPNT2
TRANS 1 1 0 0 0
MTPNT 2 2 1 1 1
ASSET 3 3 1 2 1
METER 4 4 1 3 1
READG 5 5 1 4 1
MTPNT 6 6 1 1 2
ASSET 7 7 1 2 2
METER 8 8 1 3 2
READG 9 9 1 4 2
APPNT 10 10 1 5 2
TRANS 11 1 2 6 2
MTPNT 12 2 2 1 3
ASSET 13 3 2 2 3
METER 14 4 2 3 3
READG 15 5 2 4 3
MTPNT 16 6 2 1 4
ASSET 17 7 2 2 4
METER 18 8 2 3 4
READG 19 9 2 4 4
APPNT 20 10 2 5 4
The input of hierarchical data from a data file that has no definitive markers is problematic. The best suggestion I have is to understand what are the salient values you want to extract and in what context do you want to know them. For this problem a simplest first approach would be to have a single monolithic table with categorical variables to capture the path that descends to the salient value (meter reading).
A more complex situation would be the first token in each line drives the input for that line and the output table it belongs to. Since there are no landmarks as to hierarchy absolute or relative position (as in the NAME and MKPRT) there is no 100% confident way to place them in the hierarchy and that can also affect the placement of items read-in from subsequent data lines.
Depending on the true complexity and adherence to rules in the real world you may or may not 'miss out' the reading of some values.
Suppose there is the simpler goal of just getting the meter readings.
data want;
length tier level1-level6 $8 path $64 meterReadingString $8 dummy $1;
retain level1-level5 path;
attrib readingdate informat=yymmdd10. format=yymmdd10.;
infile cards dsd missover;
input #1 tier #; * held input - dont advance read line yet;
if tier="TRANS" then do;
level1 = tier;
call missing (of level2-level6);
path = catx("/", of level:);
end;
if tier="MTPNT" and path="TRANS" then do;
level2 = tier;
call missing (of level3-level6);
path = catx("/", of level:);
end;
if tier="ASSET" and path="TRANS/MTPNT" then do;
level3 = tier;
call missing (of level4-level6);
path = catx("/", of level:);
end;
if tier="METER" and path="TRANS/MTPNT/ASSET" then do;
level4 = tier;
call missing (of level5-level6);
path = catx("/", of level:);
end;
if tier="REGST" and path="TRANS/MTPNT/ASSET/METER" then do;
level5 = tier;
call missing (of level6-level6);
path = catx("/", of level:);
end;
if tier="READG" and path="TRANS/MTPNT/ASSET/METER/REGST" then do;
level6 = tier;
path = catx("/", of level:);
input #1 tier readingdate dummy meterReadingString #; * reread line according to tier;
meterReading = input(meterReadingString, best12.);
if path = "TRANS/MTPNT/ASSET/METER/REGST/READG" then OUTPUT;
end;
datalines;
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
run;
You can use this as the basis of a more complicated reader that has a different output <tier> data set for each tier or path to tier encountered. You would need a different input statement per tier, similar to how READG is read.
The question might be quite vague but I could not come up with a decent concise title.
I have data where there are id ,date, amountA and AmtB as my variables. The task is to pick the dates that are within 10 days of each other and then see if their amountA are within 20% and if they are then pick the one with highest amountB. I have used to this code to achieve this
id date amountA amountB
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
This is what I need
id date amountA amountB
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/25/2014 720 50
I wrote this code but the problem with this code is its not automatic and has to be done on a case to case basis.I need a way to loop it so that it automatically outputs the results.I am no pro at looping and hence am stuck.Any help is greatly appreciated
data test2;
set test1;
diff_days=abs(intck('days',first_dt,date));
if diff_days<=10 then flag=1;
else if diff_days>10 then flag=0;
run;
data test3 rem_test3;
set test2;
if flag=1 then output test3;
else output rem_test3;
run;
proc sort data=test3;
by id amountA;
run;
data all_within;
set test3;
by id amountA;
amtA_lag=lag1(amountA);
if first.id then
do;
counter=1;
flag1=1;
end;
if first.id=0 then
do;
counter+1;
diff=abs(amountA-amtA_lag);
if diff<(10/100*amountA) then flag1+1;
else flag1=0;
end;
if last.stay and flag1=counter then output all_within;
run;
If I understand the problem correctly, you want to group all records together that have (no skip of 10+ days) and (amt A w/in 20%)?
Looping isn't your problem - no explicitly coded loop is needed to do this (or at least, the way I think of it). SAS does the data step loop for you.
What you want to do is:
Identify groups. A group is the consecutive records that you want to, among them, collapse to one row. It's not perfectly clear to me how amountA has to behave here - does the whole group need to have less than a maximum difference of 10%, or a record to next record difference of < 10%, or a (current highest amtB of group) < 10% - but you can easily identify all of these rules. Use a RETAINed variable to keep track of the previous amountA, previous date, highest amountB, date associated with the highest amountB, amountA associated with highest amountB.
When you find a record that doesn't fit in the current group, output a record with the values of the previous group.
You shouldn't need two steps for this, although you can if you want to see it more easily - this may be helpful for debugging your rules. Set it so that you have a GroupNum variable, which you RETAIN, and you increment that any time you see a record that causes a new group to start.
I had trouble figuring out the rules...but here is some code that checks each record against the previous for the criteria I think you want.
Data HAVE;
input id date :mmddyy10. amountA amountB ;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
1 1/16/2014 1100 81
1 1/30/2014 700 50
1 2/05/2014 710 80
1 2/25/2014 720 50
;
Proc Sort data=HAVE;
by id date;
Run;
Data WANT(drop=Prev_:);
Set HAVE;
Prev_Date=lag(date);
Prev_amounta=lag(amounta);
Prev_amountb=lag(amountb);
If not missing(prev_date);
If date-prev_date<=10 then do;
If (amounta-prev_amounta)/amounta<=.1 then;
If amountb<prev_amountb then do;
Date=prev_date;
AmountA=prev_amounta;
AmountB=prev_amountb;
end;
end;
Else delete;
Run;
Here is a method that I think should work. The basic approach is:
Find all the pairs of sufficiently close observations
Join the pairs with themselves to get all connected ids
Reduce the groups
Join to the original data and get the desired values
data have;
input
id
date :mmddyy10.
amountA
amountB;
format date mmddyy10.;
datalines;
1 1/15/2014 1000 79
2 1/16/2014 1100 81
3 1/30/2014 700 50
4 2/05/2014 710 80
5 2/25/2014 720 50
;
run;
/* Count the observations */
%let dsid = %sysfunc(open(have));
%let nobs = %sysfunc(attrn(&dsid., nobs));
%let rc = %sysfunc(close(&dsid.));
/* Output any connected pairs */
data map;
array vals[3, &nobs.] _temporary_;
set have;
/* Put all the values in an array for comparison */
vals[1, _N_] = id;
vals[2, _N_] = date;
vals[3, _N_] = amountA;
/* Output all pairs of ids which form an acceptable pair */
do i = 1 to _N_;
if
abs(vals[2, i] - date) < 10 and
abs((vals[3, i] - amountA) / amountA) < 0.2
then do;
id2 = vals[1, i];
output;
end;
end;
keep id id2;
run;
proc sql;
/* Reduce the connections into groups */
create table groups as
select
a.id,
min(min(a.id, a.id2, b.id)) as group
from map as a
left join map as b
on a.id = b.id2
group by a.id;
/* Get the final output */
create table lookup (where = (amountB = maxB)) as
select
have.*,
groups.group,
max(have.amountB) as maxB
from have
left join groups
on have.id = groups.id
group by groups.group;
quit;
The code works for the example data. However, the group reduction is insufficient for more complicated data. Fortunately, approaches for finding all the subgraphs given a set of edges can be found here, here, here or here (using SAS/OR).
I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)
Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;
By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.
Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75