SAS and Date operations - sas

I've tried googling and I haven't turned up any luck to my current problem. Perhaps someone can help?
I have a dataset with the following variables:
ID, AccidentDate
It's in long format, and each participant can have more than 1 accident, with participants having not necessarily an equal number of accidents. Here is a sample:
Code:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
What I need to do is count the number of days between each individuals First and Last recorded accident date. I've been playing around with first.byvariable and last.byvariable commands, but I'm just not making any progress. Any tips? or Any links to a source?
Thank you,
Also. I posted this originally over at Talkstats.com (cross-posting etiquette)

Not sure what you mean by in long format
long format should be like this
id accident date
1 1 1JAN2001
1 2 1JAN2002
2 1 1JAN2001
2 2 1JAN2003
Then you can try proc sql like this
Proc Sql;
select id, max(date)-min(date) from table;
group by id;
run;

By long format I think you mean it is a "stacked" dataset with each person having multiple observations (instead of one row per person with multiple columns). In your situation, it is probably the correct way to have the data stored.
To do it with data steps, I think you are on the right track with first. and last.
I would do it like this:
proc sort data=accidents;
by id date;
run;
data accidents; set accidents;
by id accident; *this is important-it makes first. and last. available for use;
retain first last;
if first.date then first=date;
if last.date then last=date;
run;
Now you have a dataset with ID, Date, Date of First Accident, Date of Last Accident
You could calculate the time between with
data accidents; set accidents;
timebetween = last-first;
run;
You can't do this directly in the same data step since the "last" variable won't be accurate until it has parsed the last line and as such the data will be wrong for anything but the last accident observation.

Assuming the data looks like:
ID AccidentDate
1 1JAN2001
2 4MAY2001
2 16MAY2001
3 15JUN2002
3 19JUN2002
3 05DEC2002
4 04JAN2003
You have the right idea. Retain the first accident date in order to have access to both the first and last dates. Then calculate the difference.
proc sort data=accidents;
by id accidentdate
run;
data accidents;
set accidents;
by id;
retain first_accidentdate;
if first.id then first_accidentdate = accidentdate;
if last.id then do;
daysbetween = date - first_accidentdate
output;
end;
run;

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

How can I pull certain records and update data in SAS?

I have a table of customer purchases. The goal is to be able to pull summary statistics on the last 20 purchases for each customer and update them as each new order comes in. What is the best way to do this? Do I need to a table for each customer? Keep in mind there are over 500 customers. Thanks.
This is asked at a high level, so I'll answer it at that level. If you want more detailed help, you'll want to give more detailed information, and make an attempt to solve the problem yourself.
In SAS, you have the BY statement available in every PROC or DATA step, as well as the CLASS statement, available in most PROCs. These both are useful for doing data analysis at a level below global. For many basic uses they give a similar result, although not in all cases; look up the particular PROC you're using to do your analysis for more detailed information.
Presumably, you'd create one table containing your most twenty recent records per customer, or even one view (a view is like a table, except it's not written to disk), and then run your analysis PROC BY your customer ID variable. If you set it up as a view, you don't even have to rerun that part - you can create a permanent view pointing to your constantly updating data, and the subsetting to last 20 records will happen every time you run the analysis PROC.
Yes, You can either add a Rank to your existing table or create another table containing the last 20 purchases for each customer.
My recommendation is to use a datasetp to select the top20 purchasers per customer then do your summary statistics. My Code below will create a table called "WANT" with the top 20 and a rank field.
Sample Data:
data have;
input id $ purchase_date amount;
informat purchase_date datetime19.;
format purchase_date datetime19.;
datalines;
cust01 21dec2017:12:12:30 234.57
cust01 23dec2017:12:12:30 2.88
cust01 24dec2017:12:12:30 4.99
cust02 21nov2017:12:12:30 34.5
cust02 23nov2017:12:12:30 12.6
cust02 24nov2017:12:12:30 14.01
;
run;
Sort Data in Descending order by ID and Date:
proc sort data=have ;
by id descending purchase_date ;
run;
Select Top 2: Change my 2 to 20 in your case
/*Top 2*/
%let top=2;
data want (where=(Rank ne .));
set have;
by id;
retain i;
/*reset counter for top */
if first.id then do; i=1; end;
if i <= &top then do; Rank= &top+1-i; output; i=i+1;end;
drop i;
run;
Output: Last 2 Customer Purchases:
id=cust01 purchase_date=24DEC2017:12:12:30 amount=4.99 Rank=2
id=cust01 purchase_date=23DEC2017:12:12:30 amount=2.88 Rank=1
id=cust02 purchase_date=24NOV2017:12:12:30 amount=14.01 Rank=2
id=cust02 purchase_date=23NOV2017:12:12:30 amount=12.6 Rank=1

SAS Retain statement - how to retain previous non missing values into new column for comparison

I am simply looking to create a new column retaining the value of a specific column within the previous record, so that I can compare the existing column against the new column. Further down the line I want to be able to output records whereby the values in both columns are different and lose the records where the values are the same
Essentially I want my dataset to look like this, where the first idr has the retained date set to null:
Idr Date1 Date2
1 20/01/2016 .
1 20/01/2016 20/01/2016
1 18/10/2016 20/01/2016
2 07/03/2016 .
2 18/05/2016 07/03/2016
2 21/10/2016 18/05/2016
3 29/01/2016 .
3 04/02/2016 29/01/2016
3 04/02/2016 04/02/2016
I have used code along the following lines in the past whereby I have created a temporary variable referencing the data I want to retain:
date_temp=date1;
data example2;
set example1;
by idr date1;
date_temp=date1;
retain date_temp ;
if first.idr then do;
date_temp=date1;
end;else do;
date2=date_temp;
end;
run;
I have searched the highs and lows of google - Any help would be greatly appreciated
The trick here is to set the value of the retained variable ready for the next row after you've already output the current row, rather than relying on the default implicit output at the end of the data step:
data example2;
set example1;
by idr;
retain date2;
if first.idr then call missing(date2);
output;
date2 = date1;
format date2 ddmmyy10.;
run;
Logic that executes after the output statement doesn't make any difference to the row that's just been output, but until the data step proceeds to the next iteration, everything is still in the PDV, including variables that are being retained across to the next row, so this is an opportunity to update them. More details on the PDV.
Another way of doing this, using the lag function:
data example3;
set example1;
date2 = lag(date1);
if idr ne lag(idr) then call missing(date2);
run;
Be careful when using lag - it returns the value from the last time that instance of the lag function was executed during your data step, not necessarily the value of that variable from the previous row, so weird things tend to happen if you do something like if condition then mylaggedvar=lag(var);
To achieve your final outcome (remove records where the idr and date are the same as the previous row), you can easily achieve this without creating the extra column. Providing the data is sorted by idr and date1, then just use first.date1 to keep the required records.
data have;
input Idr Date1 :ddmmyy10.;
format date1 ddmmyy10.;
datalines;
1 20/01/2016
1 20/01/2016
1 18/10/2016
2 07/03/2016
2 18/05/2016
2 21/10/2016
3 29/01/2016
3 04/02/2016
3 04/02/2016
;
run;
data want;
set have;
by idr date1;
if first.date1 then output;
run;

SAS Create missing numeric ids into individual observations.

I need to outline a series of ID numbers that are currently available based on a data set in which ID's are already assigned (if the ID is on the file then its in use...if its not on file, then its available for use).
The issue is I don't know how to create a data set that displays ID numbers which are between two ID #'s that are currently on file - Lets say I have the data set below -
data have;
input id;
datalines;
1
5
6
10
;
run;
What I need is for the new data set to be in the following structure of this data set -
data need;
input id;
datalines;
2
3
4
7
8
9
;
run;
I am not sure how I would produce the observations of ID #'s 2, 3 and 4 as these would be scenarios of "available ID's"...
My initial attempt was going to be subtracting the ID values from one observation to the next in order to find the difference, but I am stuck from there on how to use that value and add 1 to the observation before it...and it all became quite messy from there.
Any assistance would be appreciated.
As long as your set of possible IDs is know, this can be done by putting them all in a file and excluding the used ones.
e.g.
data id_set;
do id = 1 to 10;
output;
end;
run;
proc sql;
create table need as
select id
from id_set
where id not in (select id from have)
;
quit;
Create a temporary variable that stores the previous id, then just loop between that and the current id, outputting each iteration.
data have;
input id;
datalines;
1
5
6
10
;
run;
data need (rename=(newid=id));
set have;
retain _lastid; /* keep previous id value */
if _n_>1 then do newid=_lastid+1 to id-1; /* fill in numbers between previous and current ids */
output;
end;
_lastid=id;
keep newid;
run;
Building on Jetzler's answer: Another option is to use the MERGE statement. In this case:
note: before merge, sort both datasets by id (if not already sorted);
data want;
merge id_set (in=a)
have (in=b); /*specify datasets and vars to allow the conditional below*/
by id; /*merge key variable*/
if a and not b; /*on output keep only records in ID_SET that are not in HAVE*/
run;

Contingency table in SAS

I have data on exam results for 2 years for a number of students. I have a column with the year, the students name and the mark. Some students don't appear in year 2 because they don't sit any exams in the second year. I want to show whether the performance of students persists or whether there's any pattern in their subsequent performance. I can split the data into two halves of equal size to account for the 'first-half' and 'second-half' marks. I can also split the first half into quintiles according to the exam results using 'proc rank'
I know the output I want is a 5 X 5 table that has the original 5 quintiles on one axis and the 5 subsequent quintiles plus a 'dropped out' category as well, so a 5 x 6 matrix. There will obviously be around 20% of the total number of students in each quintile in the first exam, and if there's no relationship there should be 16.67% in each of the 6 susequent categories. But I don't know how to proceed to show whether this is the case of not with this data.
How can I go about doing this in SAS, please? Could someone point me towards a good tutorial that would show how to set this up? I've been searching for terms like 'performance persistence' etc, but to no avail. . .
I've been proceeding like this to set up my dataset. I've added a column with 0 or 1 for the first or second half of the data using the first procedure below. I've also added a column with the quintile rank in terms of marks for all the students. But I think I've gone about this the wrong way. Shoudn't I be dividing the data into quintiles in each half, rather than across the whole two periods?
Proc rank groups=2;
var yearquarter;
ranks ExamRank;
run;
Proc rank groups=5;
var percentageResult;
ranks PerformanceRank;
run;
Thanks in advance.
Why are you dividing the data into quintiles?
I would leave the scores as they are, then make a scatterplot with
PROC SGPLOT data = dataset;
x = year1;
y = year2;
loess x = year1 y = year2;
run;
Here's a fairly basic example of the simple tabulation. I transpose your quintile data and then make a table. Here there is basically no relationship, except that I only allow a 5% DNF so you have more like 19% 19% 19% 19% 19% 5%.
data have;
do i = 1 to 10000;
do year = 1 to 2;
if year=2 and ranuni(7) < 0.05 then call missing(quintile);
else quintile = ceil(5*ranuni(7));
output;
end;
end;
run;
proc transpose data=have prefix=year out=have_t;
by i;
var quintile;
id year;
run;
proc tabulate data=have_t missing;
class year1 year2;
tables year1,year2*rowpctn;
run;
PROC CORRESP might be helpful for the analysis, though it doesn't look like it exactly does what you want.
proc corresp data=have_t outc=want outf=want2 missing;
tables year1,year2;
run;