I have my data sorted by ID and date. I have converted the date to a single numeric which has ordering (the year followed by the week in the year). i want to make a new variable that is a function of the minimum value in the finest partition. and example follows
ID Start listen
1 201134 201138
1 201204 201150
2 200905 200910
2 201005 201020
I want something like
ID Start listen weekSincestart
1 201134 201138 4
1 201204 201150 54
2 200905 200910 5
2 201005 201020 15
all im doing is taking (listen-min(start)) but i am assuming min() is taking the minimum start for a given ID. So, i am asking if there is a "by statement" for the min function
In my opinion there is no need to convert your start and listen values from dates using the method you have.
I converted your data back into dates using INTNX using the first day of the year in START and LISTEN variables and increment by week in the same variables. The dates might not be exactly what you have on your dataset, however it should result in something similar.
The following should do what you want, if I understand you correctly.
DATA WANT2;
SET HAVE;
BY ID START;
RETAIN _START;
FORMAT _START DATE9.;
IF FIRST.ID THEN _START = START;
WEEKSINCESTART = INTCK("WEEK",_START,LISTEN);
RUN;
In this instance your sample is sorted, however if you wish to conduct by statement processing to identify the first instance of a value in ID you will need to sort your dataset first. The retain statement will hold a value and by using the by statement we can specify when the value in the retained variable is altered. In this instance we want to alter the _START variablee when the first instance of an ID is encountered. I use the underscore prefix because it makes it easier to drop these variables en masse if necessary. This value will not be replaced until the next instance of an ID, which means it will be the value of subsequent observations for ID 1 and so on. The INTCK function measures the number of intervals, in this example the number of WEEKS, between period one and period two, in this instance between the first instance of START for each ID captured in _START and LISTEN for each observation.
The end result is:
ID START LISTEN _START WEEKSINCESTART
1 21AUG2011 18SEP2011 21AUG2011 4
1 29JAN2012 11DEC2011 21AUG2011 16
2 01FEB2009 08MAR2009 01FEB2009 5
2 31JAN2010 16MAY2010 01FEB2009 67
I hope this is useful.
Regards,
Scott
You could easily do it with proc sql:
proc sql;
create table RESULT as
select *, listen-min(start) as weekSincestart
from INPUT
group by id;
quit;
It will take the min of each id group to calculate min(start).
And since you select variables which are not in the group by, nor have an aggregation function, it will not aggregate multiple rows into one on the group by.
Your question is a bit confusing. If you want just listen minus start (what your 'result' is), then just do that. The min function does not cross rows; in SAS it's difficult to cross rows (or at least it's something you would have to do intentionally). Of course you do need to figure out how to deal with the year barrier; if I were you I'd leave the dates as actual dates and use INTCK to determine the difference in weeks.
If you actually do want the minimum for the entire ID, the data step solution (not quite as neat as the SQL solution but works roughly the same):
data want;
set have;
by id start;
retain _initial_start;
if first.id then _initial_start=start;
weeksincestart=listen-_initial_start; *or whatever you intended - this does not seem right;
drop _initial_start;
run;
Related
I'm very new to SAS, trying to learn everything I need for my analytical task. The task I have now is to create a flag for the ongoing application. I think it might be easier to show it in a table, just to illustrate my problem:enter image description here
[Update 2017.10.27] data sample in code, big thanks to Richard :)
data sample;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
In the system, people apply for the service. When a person does that, he gets a PeopleID, which does not change when the person applies again. And also each application gets an applicationID, which is unique and later applications have larger applicationID. What I want is to create an Ongoing flag for each application. The propose is to show that: by the time this application came in, the same person has or does not have an ongoing application (application which has not received a decision). See some examples from the table above:
Person#2 has two applications #8 and #7, by the time he applied #8, #7 has not been decided, therefore #8 should get ongoing flag.
Person#1 applied multiple times. Application #3 and #2 have ongoing application due to App#1. Application #6 and #5 came in at the same date, but according to application ID, we can tell that #6 came in later than #5, and as #5 have not been decided by then, #6 gets ongoing flag.
As you might notice, application with a positive ongoing flag always receives decisions on the same date as it came in. That is because applications with ongoing cases are automatically declined. However, I cannot use this as an indicator: there are many other reasons that trigger an automatic decline.
The ongoing_flag is what I want to create in my dataset. I have tried to sort by 1.peopleID, 2.descending applicationID, 3. descending applied_date, so my entire dataset looks like the small example table above. But then I don't know how to make SAS compare within the same variable (peopleID) but different lines (applicationID) and columns (compare Applied_date with Decision_date). I want to compare, for each person, every application's applied_date with all the previous applications' decision_date, such that I can tell by the time this application came in, whether or not there is an ongoing application from previously in the system.
I know I used too many words to explain my problem. For those who read through, thank you for reading! For those who have any idea on what might be a good approach, please leave your comments! Millions of thanks!
Min:
For problems of this type you want to mentally break the data structure into different parts.
BY GROUP
The variables whose unique combination defines the group. There are one or more rows in a group. Let's call them items.
GROUP DETAILS
Variables that are observational in nature. They may be numbers such as temperature, weight or dollars, or, characters or strings that represent some state being tracked. The details (at the state you are working) themselves might be aggregates for a deeper level of detail.
GOAL
Compute additional variables that further elucidate an aspect of the details over the group. For numeric the goal might be statistical such as MIN, MAX, MEAN, MEDIAN, RANGE, etc. Or it might be identificational such as which ID had
highest $, or which name was longest, or any other business rule.
Your specific problem is one of determining claim activity on a given date. I think of it as a coverage type of problem because the dates in question cover a range. The BY GROUP is person and an 'Activity' date.
Here is one data-centric approach. The original data is expanded to have one row per date from applied to decided. Then simple BY group processing and the automatic first. are used to determine if an application is during one as yet undecided.
data have;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
data coverage;
do _n_ = 1 by 1 until (last.PeopleID);
set have;
by PeopleID;
if Decision_date > Max_date then Max_date = Decision_date;
end;
put 'NOTE: ' PeopleID= Max_date= yymmdd10.;
do _n_ = 1 to _n_;
set have;
do Activity_date = Applied_date to ifn(missing(Decision_date),Max_date,Decision_date);
if missing(Decision_date) then Decision_date = Max_date;
output;
end;
end;
keep PeopleID ApplicationID Applied_date Decision_date Activity_date;
format Activity_date yymmdd10.;
run;
proc sort data=coverage;
by PeopleID Activity_date ApplicationID ;
run;
data overlap;
set coverage;
by PeopleID Activity_date;
Ongoing_flag = not (first.Activity_date);
if Activity_date = Applied_date then
output;
run;
proc sort data=overlap;
by PeopleID descending ApplicationID ;
run;
Other approaches could involve arrays, hashes, or SQL. SQL is very different from DATA Step code and some consider it to be more clear.
proc sql;
create table want as
select
PeopleID, ApplicationID, Applied_date, Decision_date
, case
when exists (
select * from have as inner
where inner.PeopleID = outer.PeopleID
and inner.ApplicationID < outer.ApplicationID
and
case
when inner.Decision_date is null and outer.Decision_date is null then 1
when inner.Decision_date is null then 1
when outer.Decision_date is null then 0
else outer.Decision_date < inner.Decision_date
end
)
then 1
else 0
end as Ongoing_flag
from have as outer
;
I have hundreds of thousands of IDs in a large dataset.
Some records have the same ID but different data points. Some of these IDs need to be merged into a single ID. People registered for a system more than once should be just one person in the database.
I also have a separate file that tells me which IDs need to be merged, but it's not always a one-to-one relationship. For example, in many cases I have x->y and then y->z because they registered three times. I had a macro that essentially was the following set of if-then statements:
if ID='1111111' then do; ID='2222222'; end;
if ID='2222222' then do; ID='3333333'; end;
I believe SAS runs this one record at a time. My list of merged IDs is almost 15k long, so it takes forever to run and the list just gets longer. Is there a faster method of updating these IDs?
Thanks
EDIT: Here is an example of the situation, except the macro is over 15k lines long due to all the merges.
data one;
input ID $5. v1 $ v2 $;
cards;
11111 a b
11111 c d
22222 e f
33333 g h
44444 i j
55555 k l
66666 m n
66666 o p
;
run;
%macro ID_Change;
if ID='11111' then do; ID='77777'; end; *77777 is a brand new ID;
if ID='22222' then do; ID='88888'; end; *88888 is a new ID but is merged below;
if ID='88888' then do; ID='99999'; end; *99999 becomes the newer ID;
%mend;
data two; set one; %ID_Change; run;
A hash table will greatly speed up the process. Hash tables are one of the little-used, but highly effective, tools in SAS. They're a bit bizarre since the syntax is very different from standard SAS programming. For now, think of it as a way to merge data together in-memory (a big reason as to why it's so fast).
First, create a dataset that has the conversions that you need. We want to match up by ID, then convert it to New_ID. Consider ID as your key column, and New_ID as your data column.
dataset: translate
ID New_ID
111111 222222
222222 333333
In a hash table, you need to consider two things:
The Key column(s)
The Data column(s)
The Data column is what will be replacing observations matched by the Key column. In other words, New_ID will be populated every time there's a match for ID.
Next, you'll want to do your hash merge. This is performed in the data step.
data want;
set have;
/* Only declare the hash object on the first iteration.
Otherwise it will do this every record. */
if(_N_ = 1) then do;
declare hash id_h(dataset: 'translate'); *Declare a hash object called 'id_h';
id_h.defineKey('ID'); *Define key for matching;
id_h.defineData('New_ID'); *The new ID after matching;
id_h.defineDone(); *Done declaring this hash object;
call missing(New_ID); *Prevents a warning in the log;
end;
/* If a customer has changed multiple times, keep iterating until
there is no longer a match between tables */
do while(id_h.Find() = 0);
_loop_count+1; *Tells us how long we've been in the loop;
/* Just in case the while loop gets to 500 iterations, then
there's likely a problem and you don't want the data step to get stuck */
if(_loop_count > 500) then do;
put 'WARNING: ' ID ' iterated 500 times. The loop will stop. Check observation ' _N_;
leave;
end;
/* If the ID of the hash table matches the ID of the dataset, then
we'll set ID to be New_ID from the hash object;
ID = New_ID;
end;
_loop_count = 0;
drop _loop_count;
run;
This should run very quickly and provide the desired output, assuming that your lookup table is coded in the way that you need it to be.
Use PROC SQL or a MERGE step against your separate file (after you have created a separate dataset from it, using infile or proc import) to append this unique id to all records. If your separate file contains only the duplicates, you will need to create a dummy unique id for the non-duplicates.
Do PROC SORT with BY unique id and timestamp of signup.
Use a DATA step with the same BY variables. Depending on whether you want to keep the first or last signup, do if first.timestamp then output; (or last, etc.)
Or you could do it all in one PROC SQL using a left join to the separate file, a coalesce step to return a dummy unique id if it is not contained in the separate file, a group by unique id, and a having max(timestamp) (or min). You can also coalesce any other variables you might want to try to preserve across signups -- for example, if the first signup contained a phone number and successive signups were missing that data point.
Without a reproducible example it's hard to be more specific.
%let months_back = %sysget(months_back);
data;
m = intnx('month', "&sysdate9"d, -&months_back - 2, 'begin');
m = intnx('day', put(m, date9.), 26, 'same');
m2back = put(m, yymmddd10.);
put m2back;
run;
NOTE: Character values have been converted to numeric values at the
places given by: (Line):(Column).
5:19 NOTE: Invalid numeric data, '01OCT2012' , at line 5 column 19.
I really don't know why this go wrong. The date string is numeric data?
PUT(m, date9.) is the culprit here. The 2nd argument of INTNX needs to be numeric (i.e. a date), the PUT function always returns a character value, in this instance '01OCT2012'. Just take out the PUT function completely and the code should work.
m = intnx('day', m, 26, 'same');
SAS stores dates as numbers - and in fact does not have a truly separate type for them. A SAS date is the number of days since 1/1/1960, so a bit over 19000 for today. The date format is entirely irrelevant to any date calculations - it is solely for human readibility.
The bit where you say:
"&sysdate9"d
actually converts the string "01JAN2012" to a numeric value (18304).
There's actually a quicker way to accomplish what you're trying to do. Because days correspond to whole numbers in SAS, to increment by one day you can simply add one to the value.
For example:
%let months_back=5;
data _null_;
m = intnx('month', today(), -&months_back - 2, 'begin');
m2 = intnx('day', m, 26, 'same');
m3 = intnx('month',"&sysdate9"d, -&months_back - 2)+26;
m2back = put(m2, yymmdd10.);
put m= date9. m2= yymmdd10. m3= yymmdd10.;
run;
M3 does your entire calculation in one step, by using the MONTH interval, then adding 26. INTNX('day'...) is basically pointless, unless there's some other value to using the function (using a shift index for example).
You also can see the use of a format in the PUT(log) statement here - you don't have to PUT it to a character value and then put that to the log to get the formatted value, just put (var) (format.); - and string together as many as you want that way.
Also, "&sysdate9."d is not the best way to get the current date. &sysdate. is only defined on startup of SAS, so if your session ran for 3 days you would not be on the current day (though perhaps that's desired?). Instead, the TODAY() function gets the current date, up to date no matter how long your SAS session has been running.
Finally - I recommend data _null_; if you don't want a dataset (and naming the result dataset if you do want it). data _null_ does not create a dataset. data; simply creates increasing numbers of datasets (data1, data2, ...) which quickly fill up your workspace and make it hard to tell what you're doing.
I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)
Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;
By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.
Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75