Good day, I am looking through an archive of policies and want to create a variables (column) that shows the price of the policy from 1 year ago.
Every policy has a Policy ID, and the archive has every policy (including renewals). So the same Policy ID can appear more than once in the archive but have different values in every other column. For example, say I have this
Policy_ID Start_Date End_Date Premium LYPremium15 LYPremium16
1 01/01/2015 31/12/2015 500 . .
2 04/03/2015 03/03/2016 450 . .
3 03/02/2015 02/02/2016 600 . .
4 07/04/2015 06/04/2016 470 . .
5 01/01/2015 31/12/2015 500 . .
2 04/03/2016 03/03/2017 510 . .
I would like to fill the columns LYPremium15, LYPremium16, LYPremium17 with the premium from the year before. So it will look like this,
Policy_ID Start_Date End_Date Premium LYPremium15 LYPremium16
1 01/01/2015 31/12/2015 500 . .
2 04/03/2015 03/03/2016 450 . .
3 03/02/2015 02/02/2016 600 . .
4 07/04/2015 06/04/2016 470 . .
5 01/01/2015 31/12/2015 500 . .
2 04/03/2016 03/03/2017 510 450 .
Because Policy ID 2 is a renewal, so it does have data from last year.
I am new to SAS, and not sure how I can code this. I was thinking of using where combined with if and contains but I am not sure that is an option.
Can I use the standard way of creating variable?
data mylib.van_LYprem;
set mylib.van_combined_total;
LYPrem15=...;
run;
Or will I have to approach this in a more advanced way?
SAS will process your dataset record by records. So you will have to keep the old year values.
I assume the startdate is what determines the year.
If we sort the dataset like :
proc sort data=work.van_combined_total;
by Policy_ID start_date;
run;
We can use a by statement and retain the values;
data work.van_LYprem;
set work.van_combined_total;
by Policy_ID start_date;
retain LYPrem15 LYPrem16 LYPrem17;
if (first.Policy_ID) then do;
LYPrem15=.;
LYPrem16=.;
LYPrem17=.;
end;
output;
if(year(start_date) eq 2015) then do;
LYPrem15=Premium;
end;
if(year(start_date) eq 2016) then do;
LYPrem16=Premium;
end;
if(year(start_date) eq 2017) then do;
LYPrem17=Premium;
end;
run;
After this you will have records with premium and LYPremiumXX. If there are more renewals in 1 year you will only have the last value in LYPremiumXX...
You could make it more dynamic using macro's...
Related
I have a table in the following structure:
month
filter1
filter2
account
payment
users
Jan
1
0
4535
564
5000
Jan
1
1
7387
787
8000
Feb
1
0
567
5
600
In SAS I am looking to convert it into the following format:
month
filter1
filter2
value
amount
Jan
1
0
account
4535
Jan
1
0
payment
564
Jan
1
0
users
5000
Jan
1
1
account
7387
I am looking at proc transpose but I do not have same prefixes in the values, and I will need to confirm amounts in another column. Would anyone have any suggestions? Thank you very much!
PROC TRANSPOSE will handle your example:
proc transpose data=have out=want(rename=(col1=amount)) name=value ;
by month filter1 filter2 NOTSORTED ;
var account payment users ;
run;
In this example since you want to transpose all of the numeric variables you can just skip the VAR statement.
Another option is a data step transpose, but this requires all your variables to be the same type or stored as a character field so I can see this being problematic as well. You may want to store the type to make your life easier on later.
data want;
set have;
array _vars(*) account payment users;
do i=1 to dim(_vars);
value = vname (_vars(i));
amount = _vars(i);
output;
end;
*drop account payment users;
run;
UCLA has good tutorials on the proc transpose option
and the
data step option:
I have difficulties coming up with a nice code for counting the number of times date intervals includes the 15th of the month between 01.01.2019-31.12.2020. As a simple example we can consider the two intervals:
Obs | DateStart | DateEnd
1 14Oct2019 20Mar2020
2 13Nov2018 29Jan2020
I want to determine how many overlaps these have with the 15th of the months between 01Jan2019-31Dec2020. In the end I would like to produce a crosstable showing year in rows and months in columns with a count for each time the above intervals included the 15th of the month. From the above two date intervals, I would like an output like the following:
Months: 1 2 3 4 5 6 7 8 9 10 11 12
2019 1 1 1 1 1 1 1 1 1 2 2 2
2020 2 1 1 . . . . . . . . .
I am currently trying to set up a dataset with columns 1 through 24, which I will then reformat and later cross in a proc freq. This seems like the long way around, and I am having trouble identifying when the date intervals include the 15th of any month.
I have ~100 observations to do this for. Any help would be appreciated.
You can iterate over the date range and output to a second data set (or view) when the date is a 15th. The second data set can be tabulated for frequency counts.
Example:
data have;
call streaminit(2021);
do _n_ = 1 to 1000;
date1 = today() - rand('integer', 2500);
date2 = date1 + rand('integer', 720);
output;
end;
run;
data _15ths(keep=year month);
set have;
do date = date1 to date2;
if day(date) ne 15 then continue;
year = year(date);
month = month(date);
output;
end;
run;
proc tabulate data=_15ths;
title 'Number of date ranges with a 15th';
class year month;
table year='',month*n=''/nocellmerge box='year';
run;
Will produce the following output
Hi I am trying to calculate how much the customer paid on the month by subtracting their balance from the next month.
Data looks like this: I want to calculate PaidAmount for A111 in Jun-20 by Balance in Jul-20 - Balance in June-20. Can anyone help, please? Thank you
For this situation there is no need to look ahead as you can create the output you want just by looking back.
data have;
input id date balance ;
informat date yymmdd10.;
format date yymmdd10.;
cards;
1 2020-06-01 10000
1 2020-07-01 8000
1 2020-08-01 5000
2 2020-06-01 10000
2 2020-07-01 8000
3 2020-08-01 5000
;
data want;
set have ;
by id date;
lag_date=lag(date);
format lag_date yymmdd10.;
lag_balance=lag(balance);
payment = lag_balance - balance ;
if not first.id then output;
if last.id then do;
payment=.;
lag_balance=balance;
lag_date=date;
output;
end;
drop date balance;
rename lag_date = date lag_balance=balance;
run;
proc print;
run;
Result:
Obs id date balance payment
1 1 2020-06-01 10000 2000
2 1 2020-07-01 8000 3000
3 1 2020-08-01 5000 .
4 2 2020-06-01 10000 2000
5 2 2020-07-01 8000 .
6 3 2020-08-01 5000 .
This is looking for a LEAD calculation which is typically done via PROC EXPAND but that's under the SAS/ETS license which not many users have. Another option is to merge the data with itself, offsetting the records by one so that the next months record is on the same line.
data want;
merge have have(firstobs=2 rename=balance = next_balance);
by clientID;
PaidAmount = Balance - next_balance;
run;
If you can be missing months in your series this is not a good approach. If that is possible you want to do an explicit merge using SQL instead. This assumes you have month as a SAS date as well.
proc sql;
create table want as
select t1.*, t1.balance - t2.balance as paidAmount
from have as t1
left join have as t2
on t1.clientID = t2.ClientID
/*joins current month with next month*/
and intnx('month', t1.month, 0, 'b') = intnx('month', t2.month, 1, 'b');
quit;
Code is untested as no test data was provided (I won't type out your data to test code).
I want to add an observation in SAS per group at a certain time and fill forward all values (except the time). I don't want to do it manually with datalines and proc append. Is there another way?
In the example: always insert a row per security at exactly 10:00am and use the value from the one above:
Security Time Value
ABC 9:59 2
ABC 10:01 3
.
.
.
DCE 9:58 9
DCE 10:01 3
.
.
Output:
Security Time Value
ABC 9:59 2
ABC 10:00 2
ABC 10:01 3
.
.
.
DCE 9:58 9
DCE 10:00 9
DCE 10:01 3
.
.
Thankful for any help!
Best
Also you can use proc sql to insert row:
PROC SQL;
INSERT INTO table_name
VALUES (value1,value2,value3,...);
QUIT;
OR
PROC SQL;
INSERT INTO table_name (column1,column2,column3,...)
VALUES (value1,value2,value3,...);
QUIT;
I'm very new therefore very fresh in the world of SAS , although I've used SAS last year during my studies but theoretical knowledge is not the same as hands-on knowledge.
Here is my problem.
I have tables on SAS, that look like the example below:
table1
date var_1 var_2 var_3 var_4 var_5
1957M1 . . . . .
1957M2 . . . . 23.5
1957M3 . 1.2 . . 23.6
1957M4 . 1.3 . . 23.7
1957M5 . 1.4 . 0.123 23.8
1957M6 . 1.5 . 0.124 23.9
1957M7 . 1.6 3.0 0.125 23.10
1957M8 . 1.7 3.1 0.126 23.11
1957M9 . 1.8 3.2 0.127 23.12
1957M10 2.1 1.9 3.3 0.128 23.13
1957M11 2.2 1.10 3.4 0.129 23.14
1957M12 2.3 1.11 3.5 0.130 23.15
As you have guessed, each variable is a time serie on its own right and date is a a time series as well. The columns are numeric ones except the date column, which is a character one.
My aim is to know for each of these variables , their respective start date and their latest date.
var_1 would start 1957 in October (or M10) and the latest date would be in l957 in December (or M12).
var_4 would start 1957 in October (or M10) and the latest date would be in December (or M12).
I've tried the following through SAS for one column for one table as a test but it is taking a hell of a long time , without results.
PROC SQL NOPRINT;
SELECT
MIN(input(substr(date,1,4),date4.)),
MAX(input(substr(date,1,4),date4.))
FROM
table1
WHERE
var_2 <> "."
quit;
For my query, the date column is in text. I'm trying through my query to transform it into a date format with the year only although as I will only be with the year and having the months would be great.
My boss told me about PROC FREQ to achieve the results I want but I have no idea how.
If you have any clues, I'm taking it.
Cheers.
Your problem is that your data structure isn't really right for your problem.
The right data structure is a more vertical structure, with DATE, VAR, VALUE. Then PROC MEANS is perfect for your needs.
data have;
input date $ var_1 var_2 var_3 var_4 var_5;
datalines;
1957M1 . . . . .
1957M2 . . . . 23.5
1957M3 . 1.2 . . 23.6
1957M4 . 1.3 . . 23.7
1957M5 . 1.4 . 0.123 23.8
1957M6 . 1.5 . 0.124 23.9
1957M7 . 1.6 3.0 0.125 23.10
1957M8 . 1.7 3.1 0.126 23.11
1957M9 . 1.8 3.2 0.127 23.12
1957M10 2.1 1.9 3.3 0.128 23.13
1957M11 2.2 1.10 3.4 0.129 23.14
1957M12 2.3 1.11 3.5 0.130 23.15
;;;;
run;
data want;
set have;
array var_[5];
date_num = mdy(substr(date,6),1,substr(date,1,4));
do _iter= 1 to dim(var_);
if not missing(var_[_iter]) then do;
var = vname(var_[_iter]);
value = var_[_iter];
output;
end;
end;
format date_num MONYY.;
run;
proc means data=want;
class var;
var date_num;
output out=edge_dates min= max= /autoname;
run;
If performance is an issue, this is the fastest way I know as it only needs to read the data once. Use Joe's code to create the 'have' dataset:
data want;
format date_num start1-start5 end1-end5 monyy.;
set have end=eof;
retain start1-start5 end1-end5 .; * RETAIN THE VALUES WE WILL BE CALCULATING AS WE ITERATE ACROSS ROWS IN THE DATASET;
array arr_var [*] var_1-var_5 ; * ARRAY FOR EXISTING VARIABLES;
array arr_start[*] start1-start5; * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN START DATE;
array arr_end [*] end1-end5 ; * ARRAY FOR NEW VARIABLES THAT WILL CONTAIN END DATE;
date_num = mdy(input(substr(date,6),best.),1,input(substr(date,1,4),best.));
do iter=1 to dim(arr_var); * LOOPING FOR THE NUMBER OF VARIABLES IN ARR_VAR;
if arr_var[iter] ne . then do; * ONLY GOING TO PERFORM CALCS WHEN THE VARIABLE IS NOT MISSING;
if arr_start[iter] eq . then do;
arr_start[iter] = date_num; * ONLY UPDATE THE START DATE IF IT HASNT ALREADY BEEN SET;
end;
arr_end[iter] = date_num; * IF ITS NOT MISSING, ALWAYS UPDATE THE END DATE;
end;
end;
if eof then do;
output; * ONLY OUTPUT THE CALCULATED VALUES ONCE WE HIT THE END OF THE DATASET;
end;
keep start: end:; * KEEP ONLY VARS STARTING WITH START OR END;
run;
Normally I wouldn't recommend calculating start and end dates this way unless performance is a consideration, or if this code ends up being simpler than the alternative.
Most of the time you're better off prepping the data structure differently - although there's certainly circumstances where having the data in the above format is advantageous as well.