Left Join collapses data

Left Join collapses data - sas

I am working with some bonds data and I'm looking to left join the interest rate projections. my data set for the bonds date looks like:
data have;
input ID Vintage Reference_Rate Base2017;
Datalines;
1 2017 LIBOR_001M 0.01
1 2018 LIBOR_001M 0.01
1 2019 LIBOR_001M 0.01
1 2020 LIBOR_001M 0.01
2 2017 LIBOR_003M 0.012
2 2018 LIBOR_003M 0.012
2 2019 LIBOR_003M 0.012
2 2020 LIBOR_003M 0.012
3 2017 LIBOR_006M 0.014
3 2018 LIBOR_006M 0.014
3 2019 LIBOR_006M 0.014
3 2020 LIBOR_006M 0.014
;
run;
the second dataset which I am looking to left join (or even full join) looks like
data have2;
input Reference_rate Base2018 Base2019 Base2020;
datalines;
LIBOR_001M 0.011 0.012 0.013
LIBOR_003M 0.013 0.014 0.015
LIBOR_006M 0.015 0.017 0.019
;
run;
the dataset I've been getting collapses the vintage into 1 and messes up the rest of the analysis I've been running such that it looks like
data dontwant;
input ID Vintage Reference_rate Base2017 Base2018 Base2019 Base2020;
datalines;
1 2017 LIBOR_001M 0.01 0.011 0.012 0.013
2 2017 LIBOR_003M 0.012 0.013 0.014 0.015
3 2017 LIBOR_006M 0.014 0.015 0.017 0,019
run;
the dataset I would like looks like this
data want;
input input Reference_rate Base2018 Base2019 Base2020;
datalines;
1 2017 LIBOR_001M 0.01 0.011 0.012 0.013
1 2018 LIBOR_001M 0.01 0.011 0.012 0.013
1 2019 LIBOR_001M 0.01 0.011 0.012 0.013
1 2020 LIBOR_001M 0.01 0.011 0.012 0.013
2 2017 LIBOR_003M 0.012 0.013 0.014 0.015
2 2018 LIBOR_003M 0.012 0,013 0.014 0.015
2 2019 LIBOR_003M 0.012 0.013 0.014 0.015
2 2020 LIBOR_003M 0.012 0.013 0.014 0.015
3 2017 LIBOR_006M 0.014 0.015 0.017 0.019
3 2018 LIBOR_006M 0.014 0.015 0.017 0.019
3 2019 LIBOR_006M 0.014 0.015 0.017 0.019
3 2020 LIBOR_006M 0.014 0.015 0.017 0.019
;
run;
the code I have been using is a pretty standard proc sql
PROC SQL;
CREATE TABLE want AS
SELECT a.*, b.*
FROM have A LEFT JOIN have2 B
ON A.reference_rate = B.reference_rate
ORDER BY reference_rate;
QUIT;

It's good practice to avoid using Select *, as it's better for the query performance and to avoid the case of having the same column name in both tables.
I ran your same code and it worked fine, except for one warning because you are using select a.* & b.*; you have the field "Reference_Rate" in both tables.
Solution:
PROC SQL;
CREATE TABLE want AS
SELECT
a.ID,
a.Vintage,
a.Reference_Rate,
b.Base2018,
b.Base2019,
b.Base2020
FROM have A LEFT JOIN have2 B
ON A.reference_rate = B.reference_rate
ORDER BY reference_rate;
QUIT;
Tip:
You can print the SAS table values to the log using Put _ALL_
The code below will not create a table, it will only print the table to the log which is good for debugging small tables.
data _null_;
set want;
put _all_;
run;
Log:
ID=1 Vintage=2019 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=1
ID=1 Vintage=2020 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=2
ID=1 Vintage=2017 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=3
ID=1 Vintage=2018 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=4
ID=2 Vintage=2019 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=5
ID=2 Vintage=2018 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=6
ID=2 Vintage=2017 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=7
ID=2 Vintage=2020 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=8
ID=3 Vintage=2020 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=9
ID=3 Vintage=2019 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=10
ID=3 Vintage=2018 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=11
ID=3 Vintage=2017 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=12

Related

Parsing date periods and summarise days

I'm asking support to manage something I'm able to manage with R but not in SAS and I must in SAS.
Suppose to deal with the following dataset:
ID Start End Label
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
....... ........ ........ ..........
I need the following:
for repeated entries after sorting Start and End so that the previous Start-End represents a time period before the subsequent:
if the difference between the Start and End is 1 day for the same repeated entry (ID) then sum the number of days, otherwise (>1 days) count without sum. This for holidays, maternity and disease.
if the difference between the Start and End is 0 days (consecutive) for the same repeated entry (ID) then sum the number of days, otherwise if > 1 days sum without count. This for holidays, maternity and disease.
For not repeated entries:
count the days;
count 1 day if the same Start-End.
Desired output:
ID Days Label Flag
subjectA 20 holidays summarised
subjectB 37 holidays summarised
subjectC 1 work_permit single_day
subjectD 184 maternity consecutive_not_summarized
subjectE 19 disease summarised
subjectF 1 work_permit single_day
subjectG 10 course consecutive_not_summarized
....... ........ ........ ..........
For ranges involving February, it was of 29 days. Moreover there might be more than two periods per repeated record.
Sorry it seems to be complex. I have no idea how to start writing this in SAS and so I need support and guide.
Thank you in advance

To calculate the days in the period just subtract start from end and add one. To calculate the gap between periods use the LAG() of end. Make sure to reset the calculated gap when starting a new ID.
data have;
input ID :$20. Start :ddmmyy. End :ddmmyy. Label :$20.;
format start end yymmdd10.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
if first.id then gap=. ;
run;
Result
Obs ID Start End Label days gap
1 subjectA 2020-01-01 2020-01-15 holidays 15 .
2 subjectA 2020-01-16 2020-01-20 holidays 5 1
3 subjectB 2020-05-01 2020-05-30 holidays 30 .
4 subjectB 2020-06-01 2020-06-07 holidays 7 2
5 subjectC 2020-02-01 2020-02-01 work_permit 1 .
6 subjectD 2020-03-01 2020-09-01 maternity 185 .
7 subjectE 2020-01-03 2020-01-09 disease 7 .
8 subjectE 2020-01-11 2020-01-13 disease 3 2
9 subjectF 2020-02-12 2020-02-12 work_permit 1 .
10 subjectG 2020-09-11 2020-09-20 course 10 .
But I cannot figure out what you want to do when the gap is larger than 1 and your example data and results do not really provide any real guidance. For most cases you seem to just want the sum of the days whether or not there is a large gap.
proc summary data=days;
by id label;
var days;
output out=want sum= ;
run;
Result:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 2 37
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 2 10
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If you want to exclude periods that are more than 1 day after the previous period you could just add a WHERE clause.
proc summary data=days;
where gap < 2;
by id label;
var days;
output out=want sum= ;
run;
Results:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 1 30
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 1 7
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If the goal is not collapse the intervals in periods without gaps then make a new variable to indicate when a new period starts.
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
period + (gap > 1);
if first.id then do;
gap=. ;
period=1;
end;
run;
proc summary data=days ;
by id period label ;
var days;
output out=want sum=;
run;
Now subjects B and E have two periods and the other examples only one.
Results
Obs ID period Label _TYPE_ _FREQ_ days
1 subjectA 1 holidays 0 2 20
2 subjectB 1 holidays 0 1 30
3 subjectB 2 holidays 0 1 7
4 subjectC 1 work_permit 0 1 1
5 subjectD 1 maternity 0 1 185
6 subjectE 1 disease 0 1 7
7 subjectE 2 disease 0 1 3
8 subjectF 1 work_permit 0 1 1
9 subjectG 1 course 0 1 10

Strong suspicion this won't scale because you're not accounting for weekends/days off allowed at work, ie Statutory holidays, office closures, consecutive days rules doesn't seem applied consistently.
But here's a start. You can modify as needed or update your question with more details. Please include data as a data step in future questions (like what I have in the first step of my answer below).
data have;
input ID : $14. Start : ddmmyy10. End : ddmmyy10. Label : $20.;
format start end date9.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
;
;;
run;
data groups;
set have;
by id label notsorted;
prev_end=lag(end);
if first.id then
do;
group=0;
call missing(prev_end);
end;
if first.label or prev_end+1 ne start then
group+1;
duration=end-start+1;
run;
proc means data=groups noprint nway;
class id label group;
var duration;
output out=summarized N=Number_Events Sum=Total_Duration;
run;
data want;
set groups;
IF n>2 AND LABEL not in ('work_permit', 'maternity') then flag = 'Summarized';
else if duration = 1 then flag = 'Single Day';
else flag = 'consecutive_not_summarized';
run;

DAX SUM between Dates is not working as expected

I have two tables:
DateDim
Time
I am trying to get the sum of hours_actual from my Time table where they are between two dates from my DateDim. They have a relationship on the date shown in the following:
I am currently using the following DAX formula:
PreviousPeriod_Hours = CALCULATE(SUM('Time'[hours_actual])
,DATESBETWEEN(
DateDim[FullDateAlternateKey],
[Start of Previous Period],
[End of Previous Period]),
ALL(DateDim)
)
The values for [Start of Previous Period] and [End of Previous Period] are calculated DAX dates, that are showing as I would expect.
In order to arrive at those dates I create a few DAX functions first:
Start of This Period = FIRSTDATE(DateDim[FullDateAlternateKey])
End of This Period = LASTDATE(DateDim[FullDateAlternateKey])
Days in This Period = DATEDIFF([Start of This Period],[End of This Period],DAY)
End of Previous Period = PREVIOUSDAY(LASTDATE(DATEADD(DateDim[FullDateAlternateKey],-1*[Days in This Period],DAY)))
Start of Previous Period = PREVIOUSDAY(FIRSTDATE(DATEADD(DateDim[FullDateAlternateKey],-1*[Days in This Period] + IF(MOD(Year('MeasureTable'[End of This Period]),4) == 0,1,0),DAY)))
To quickly summarize the above, it is finding the days between a start and end date, and then subtracting these days from my start and end dates that are selected. If it is a leap year, then add a day.
The dax formula is giving me the correct sum total I am expecting. However, if I display the hours by month between the 2 dates, they are showing something different altogether from what it should be, and don't add to the sum it displays.
I was expecting the following values:
I am not sure where the 13 is coming from, and the 28.25 looks to be a repeat from the previous month of the following year. What I am missing here? Is my current approach correct, I am just doing something incorrectly? or am I taking the wrong approach altogether?
UPDATE - Adding in some of the data I am working with:
Then the DateDim is just a generated date table, for example, a row looks like the following (2016-2021): 
FullDateAlternateKey Year Month Month Name Quarter Week of Year Week of Month Day Day of Week Day of Year Day Name Fiscal Year Fiscal Period Fiscal Quarter
2016-01-02 2016 1 January 1 1 1 2 6 2 Saturday 2016 5 2
And the hours_actual and date look like the following: 
Date_Start hours_actual
2019-03-05 12:00:00 AM 5
2019-03-26 12:00:00 AM 3
2019-04-23 12:00:00 AM 0.75
2019-04-24 12:00:00 AM 0.08
2019-05-22 12:00:00 AM 4
2019-05-22 12:00:00 AM 2
2019-05-22 12:00:00 AM 1.75
2019-05-27 12:00:00 AM 8
2019-05-31 12:00:00 AM 0.25
2019-06-03 12:00:00 AM 0.25
2019-06-05 12:00:00 AM 0.25
2019-06-21 12:00:00 AM 1
2019-06-27 12:00:00 AM 2
2019-06-27 12:00:00 AM 0.5
2019-06-28 12:00:00 AM 1
2019-06-28 12:00:00 AM 3
2019-07-04 12:00:00 AM 3
2019-07-05 12:00:00 AM 3
2019-07-10 12:00:00 AM 2.5
2019-07-10 12:00:00 AM 0.5
2019-07-10 12:00:00 AM 1.5
2019-07-10 12:00:00 AM 0.5
2019-07-10 12:00:00 AM 2
2019-07-12 12:00:00 AM 2.5
2019-07-17 12:00:00 AM 1
2019-07-18 12:00:00 AM 0.5
2019-07-24 12:00:00 AM 0.5
2019-07-24 12:00:00 AM 1
2019-07-24 12:00:00 AM 1.5
2019-07-24 12:00:00 AM 1
2019-07-25 12:00:00 AM 1
2019-07-25 12:00:00 AM 0.5
2019-07-31 12:00:00 AM 1
2019-07-31 12:00:00 AM 1.5
2019-07-31 12:00:00 AM 1
2019-07-31 12:00:00 AM 0.5
2019-08-01 12:00:00 AM 2
2019-08-07 12:00:00 AM 4
2019-08-07 12:00:00 AM 3.75
2019-08-08 12:00:00 AM 4
2019-08-14 12:00:00 AM 1.25
2019-09-11 12:00:00 AM 3.5
2019-09-11 12:00:00 AM 2.5
2019-09-12 12:00:00 AM 3
2019-09-12 12:00:00 AM 1.75
2019-09-13 12:00:00 AM 4
2019-09-13 12:00:00 AM 1.75
2019-09-13 12:00:00 AM 3
2019-09-14 12:00:00 AM 2
2019-09-14 12:00:00 AM 3.25
2019-09-16 12:00:00 AM 0.5
2019-09-16 12:00:00 AM 0.5
2019-09-26 12:00:00 AM 2.5

After experimenting a little more, the DAX functions for the previous start and end dates were being picked up on a monthly basis as well as a yearly basis. My mistake was thinking the DAX function would only evaluate on the slicers and not on table values presented.
I took a different approach, and basically created a reference table of the Time table, and added a column that added a year to the date for each row. I then joined the reference table to my DateDim table by this future_date column. I was finally able to show the values by the current period and previous period and it accurately gave the results I was looking for.

Summing results by month in SAS

I have this kind of data
Configuration Retail Price month
1 450 Jan
1 520 Feb
1 630 Mar
5 650 Jan
5 320 Feb
5 480 Mar
9 770 Jan
9 180 Feb
9 320 Mar
I want my data to look like this
Configuration Jan Feb Mar
1 450 520 630
5 650 320 480
9 770 180 320

Generating some data to tinker with:
data begin;
length Configuration 3 Retail_Price 3 month $3;
input Configuration Retail_Price month;
datalines;
1 450 Jan
1 520 Feb
1 630 Mar
5 650 Jan
5 320 Feb
5 480 Mar
9 770 Jan
9 180 Feb
9 320 Mar
;
run;
In SAS everything must be sorted right. (Or you can use index, but that different thing) We use PROC SORT for this.
proc sort data= begin; by Configuration month; run;
A way to calculate sum is to utilize PROC MEANS.
proc means data=begin noprint; /*Noprint is for convienience*/
by Configuration month; /*These are the subsets*/
output out = From_means(drop=_TYPE_ _FREQ_) /*Drops are for ease sake*/
sum(Retail_Price)=
;
run;
At this point we have the data in narrow format. A way to transform this to wide format is PROC TRANSPOSE.
proc transpose data=From_means out=Wide_format;
by Configuration;
id month;
run;
Bear in mind that there are multiple other ways to accomplish the same. A popular way is to utilize PROC SQL for almost everything, but in my experience large datasets are better to be handled by SAS proc commands..

How to change sas plot x axis order

I have a dataset like the following:
x y
16:00 1
17:00 2
18:00 2
19:00 3
20:00 4
21:00 5
22:00 6
23:00 1
24:00 1
01:00 2
02:00 3
03:00 1
04:00 7
...
I want to plot the relationship between x and y using the following code. I want my x axis start from 16:00 and end at 04:00. However using the code below, x axis start from 00:00 and end at 16:00. can anyone teach me how to adjust my code please. ( i dont want to type the order one by one like the following order = ("16:00" ..."04:00").
PROC SGPLOT DATA = data;
SERIES X = x Y = y;
axis order=("16:00:00"t to "03:00:00"t by hour);
TITLE 'Plot';
RUN;

So the problem is that numerical X axis values cannot be put out of order. And a time in SAS 1am < 11pm. So you cannot go around the clock, so to say.
A work around is to make the time values date times. That is, add a day component to it. Then you only display the time portion.
data have;
informat x time5. y best.;
format x time5.;
input x y;
datalines;
16:00 1
17:00 2
18:00 2
19:00 3
20:00 4
21:00 5
22:00 6
23:00 1
24:00 1
01:00 2
02:00 3
03:00 1
04:00 7
;
run;
data have;
retain day 0;
set have;
format x_new datetime.;
/*Count Days*/
if x = "24:00"t then
day = day + 1;
x_new = dhms(day,hour(x),minute(x),second(x));
run;
proc sgplot data=have;
series x=x_new y=y;
xaxis valuesformat=tod5.;
run;
Here I am looking for the 24 hour mark to increment the day count. Then creating a new variable to hold the day + the time.
When plotting, tell SAS to use the TODw.d format which only displays the time portion.
Here's what I get

Proc Rank-whole dataset

I am trying to create ranks for 2 variables, which I will then sum to create a score.
Issue: I need to rank the whole dataset (i.e. into k quantile groups where k=n).
I'm using proc rank right now to calculate the rank for 1 variable. The variable is called first and I want to generate the rank called firstrank.
proc rank data = moo out= outmoo;
var firstrank;
run;
My output looks like
Obs first firstrank
1 0.000 9.5
2 0.000 9.5
3 0.000 9.5
4 0.000 9.5
5 0.000 9.5
6 0.000 9.5
7 0.000 9.5
8 0.000 9.5
9 0.000 9.5
10 0.000 9.5
11 0.000 9.5
12 0.000 9.5
13 0.000 9.5
14 0.000 9.5
15 0.000 9.5
16 0.000 9.5
17 0.000 9.5
18 0.000 9.5
19 0.105 19.5
20 0.105 19.5
21 0.210 23.5
22 0.210 23.5
23 0.210 23.5
24 0.210 23.5
25 0.210 23.5
26 0.210 23.5
As you can see the ranks are being averaged across ties in the variable first.
What I am trying to achieve is that all the values where first=0, firstrank=1, and first=0.105, firstrank=2, and so on.
Is there a way using SAS proc rank to do this? Or is there another proc to do this?

If I understand your question, you need the TIES=DENSE option (or CONDENSE, its alias). See the documentation on PROC RANK.
data test;
do x = 1 to 8;
do y = 1 to 3;
output;
end;
end;
run;
proc rank data=test out=want ties=dense;
var x;
ranks r;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js