I'm very new to SAS, trying to learn everything I need for my analytical task. The task I have now is to create a flag for the ongoing application. I think it might be easier to show it in a table, just to illustrate my problem:enter image description here
[Update 2017.10.27] data sample in code, big thanks to Richard :)
data sample;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
In the system, people apply for the service. When a person does that, he gets a PeopleID, which does not change when the person applies again. And also each application gets an applicationID, which is unique and later applications have larger applicationID. What I want is to create an Ongoing flag for each application. The propose is to show that: by the time this application came in, the same person has or does not have an ongoing application (application which has not received a decision). See some examples from the table above:
Person#2 has two applications #8 and #7, by the time he applied #8, #7 has not been decided, therefore #8 should get ongoing flag.
Person#1 applied multiple times. Application #3 and #2 have ongoing application due to App#1. Application #6 and #5 came in at the same date, but according to application ID, we can tell that #6 came in later than #5, and as #5 have not been decided by then, #6 gets ongoing flag.
As you might notice, application with a positive ongoing flag always receives decisions on the same date as it came in. That is because applications with ongoing cases are automatically declined. However, I cannot use this as an indicator: there are many other reasons that trigger an automatic decline.
The ongoing_flag is what I want to create in my dataset. I have tried to sort by 1.peopleID, 2.descending applicationID, 3. descending applied_date, so my entire dataset looks like the small example table above. But then I don't know how to make SAS compare within the same variable (peopleID) but different lines (applicationID) and columns (compare Applied_date with Decision_date). I want to compare, for each person, every application's applied_date with all the previous applications' decision_date, such that I can tell by the time this application came in, whether or not there is an ongoing application from previously in the system.
I know I used too many words to explain my problem. For those who read through, thank you for reading! For those who have any idea on what might be a good approach, please leave your comments! Millions of thanks!
Min:
For problems of this type you want to mentally break the data structure into different parts.
BY GROUP
The variables whose unique combination defines the group. There are one or more rows in a group. Let's call them items.
GROUP DETAILS
Variables that are observational in nature. They may be numbers such as temperature, weight or dollars, or, characters or strings that represent some state being tracked. The details (at the state you are working) themselves might be aggregates for a deeper level of detail.
GOAL
Compute additional variables that further elucidate an aspect of the details over the group. For numeric the goal might be statistical such as MIN, MAX, MEAN, MEDIAN, RANGE, etc. Or it might be identificational such as which ID had
highest $, or which name was longest, or any other business rule.
Your specific problem is one of determining claim activity on a given date. I think of it as a coverage type of problem because the dates in question cover a range. The BY GROUP is person and an 'Activity' date.
Here is one data-centric approach. The original data is expanded to have one row per date from applied to decided. Then simple BY group processing and the automatic first. are used to determine if an application is during one as yet undecided.
data have;
input PeopleID ApplicationID Applied_date yymmdd10. Decision_date yymmdd10. Ongoing_flag_wanted;
format Applied_date Decision_date yymmdd10.;
datalines;
1 6 2017.10.1 2017.10.1 1
1 5 2017.10.1 2017.10.4 0
1 3 2017.9.28 2017.9.29 1
1 2 2017.9.26 2017.9.26 1
1 1 2017.9.25 2017.9.30 0
2 8 2017.10.7 2017.10.7 1
2 7 2017.10.2 . 0
3 4 2017.9.30 2017.10.3 0
run;
data coverage;
do _n_ = 1 by 1 until (last.PeopleID);
set have;
by PeopleID;
if Decision_date > Max_date then Max_date = Decision_date;
end;
put 'NOTE: ' PeopleID= Max_date= yymmdd10.;
do _n_ = 1 to _n_;
set have;
do Activity_date = Applied_date to ifn(missing(Decision_date),Max_date,Decision_date);
if missing(Decision_date) then Decision_date = Max_date;
output;
end;
end;
keep PeopleID ApplicationID Applied_date Decision_date Activity_date;
format Activity_date yymmdd10.;
run;
proc sort data=coverage;
by PeopleID Activity_date ApplicationID ;
run;
data overlap;
set coverage;
by PeopleID Activity_date;
Ongoing_flag = not (first.Activity_date);
if Activity_date = Applied_date then
output;
run;
proc sort data=overlap;
by PeopleID descending ApplicationID ;
run;
Other approaches could involve arrays, hashes, or SQL. SQL is very different from DATA Step code and some consider it to be more clear.
proc sql;
create table want as
select
PeopleID, ApplicationID, Applied_date, Decision_date
, case
when exists (
select * from have as inner
where inner.PeopleID = outer.PeopleID
and inner.ApplicationID < outer.ApplicationID
and
case
when inner.Decision_date is null and outer.Decision_date is null then 1
when inner.Decision_date is null then 1
when outer.Decision_date is null then 0
else outer.Decision_date < inner.Decision_date
end
)
then 1
else 0
end as Ongoing_flag
from have as outer
;
Related
I want converted my data from long to wide format using data step. The problem is that due to missing values the values are not placed in the correct cells. I think to solve the problem I have to include placeholder for missing values.
The problem is I don't know how to do. Can someone please give me tip on how to go about it.
data tic;
input id country$ month math;
datalines;
1 uk 1 10
1 uk 2 15
1 uk 3 24
2 us 2 15
2 us 4 12
3 fl 1 15
3 fl 2 16
3 fl 3 17
3 fl 4 15
;
run;
proc sort data=tic;
by id;
run;
data tot(drop=month math);
retain month1-month4 math1-math4;
array tat{4} month1-month4;
array kat{4} math1-math4;
set tic;
by id;
if first.id then do;
i=1;
do j=1 to 4;
tat{j}=.;
kat{j}=.;
end;
end;
tat(i)=month;
kat(i)=math;
if last.id then output;
i+1;
run;
Edit
I finally figured out what the problem is:
changed this lines of code
tat(i)=month;
kat(i)=math;
to:
tat(month)=month;
kat(month)=math;
and it fixed the problem.
Data transformations from tall and skinny to short and wide often mean that categorical data ends up as column names. This is a process of moving data to metadata, which can be a problem later on for dealing with BY or CLASS groups.
SAS has Proc TABULATE and Proc REPORT for creating pivoted output. Proc TRANSPOSE is also a good standard way of creating pivoted data.
I did notice that you are pivoting two columns at once. TRANSPOSE can't multi-pivot. The DATA Step approach you showed is a typical way for doing a transpose transform when the indices lie within known ranges. In your case the array declaration must be such that 'direct-addressing' via index can to handle the minimal and maximal month values that occur over all the data.
I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck
I have my data sorted by ID and date. I have converted the date to a single numeric which has ordering (the year followed by the week in the year). i want to make a new variable that is a function of the minimum value in the finest partition. and example follows
ID Start listen
1 201134 201138
1 201204 201150
2 200905 200910
2 201005 201020
I want something like
ID Start listen weekSincestart
1 201134 201138 4
1 201204 201150 54
2 200905 200910 5
2 201005 201020 15
all im doing is taking (listen-min(start)) but i am assuming min() is taking the minimum start for a given ID. So, i am asking if there is a "by statement" for the min function
In my opinion there is no need to convert your start and listen values from dates using the method you have.
I converted your data back into dates using INTNX using the first day of the year in START and LISTEN variables and increment by week in the same variables. The dates might not be exactly what you have on your dataset, however it should result in something similar.
The following should do what you want, if I understand you correctly.
DATA WANT2;
SET HAVE;
BY ID START;
RETAIN _START;
FORMAT _START DATE9.;
IF FIRST.ID THEN _START = START;
WEEKSINCESTART = INTCK("WEEK",_START,LISTEN);
RUN;
In this instance your sample is sorted, however if you wish to conduct by statement processing to identify the first instance of a value in ID you will need to sort your dataset first. The retain statement will hold a value and by using the by statement we can specify when the value in the retained variable is altered. In this instance we want to alter the _START variablee when the first instance of an ID is encountered. I use the underscore prefix because it makes it easier to drop these variables en masse if necessary. This value will not be replaced until the next instance of an ID, which means it will be the value of subsequent observations for ID 1 and so on. The INTCK function measures the number of intervals, in this example the number of WEEKS, between period one and period two, in this instance between the first instance of START for each ID captured in _START and LISTEN for each observation.
The end result is:
ID START LISTEN _START WEEKSINCESTART
1 21AUG2011 18SEP2011 21AUG2011 4
1 29JAN2012 11DEC2011 21AUG2011 16
2 01FEB2009 08MAR2009 01FEB2009 5
2 31JAN2010 16MAY2010 01FEB2009 67
I hope this is useful.
Regards,
Scott
You could easily do it with proc sql:
proc sql;
create table RESULT as
select *, listen-min(start) as weekSincestart
from INPUT
group by id;
quit;
It will take the min of each id group to calculate min(start).
And since you select variables which are not in the group by, nor have an aggregation function, it will not aggregate multiple rows into one on the group by.
Your question is a bit confusing. If you want just listen minus start (what your 'result' is), then just do that. The min function does not cross rows; in SAS it's difficult to cross rows (or at least it's something you would have to do intentionally). Of course you do need to figure out how to deal with the year barrier; if I were you I'd leave the dates as actual dates and use INTCK to determine the difference in weeks.
If you actually do want the minimum for the entire ID, the data step solution (not quite as neat as the SQL solution but works roughly the same):
data want;
set have;
by id start;
retain _initial_start;
if first.id then _initial_start=start;
weeksincestart=listen-_initial_start; *or whatever you intended - this does not seem right;
drop _initial_start;
run;
I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)
Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;
By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.
Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;
To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75