Proc Rank-whole dataset - sas

I am trying to create ranks for 2 variables, which I will then sum to create a score.
Issue: I need to rank the whole dataset (i.e. into k quantile groups where k=n).
I'm using proc rank right now to calculate the rank for 1 variable. The variable is called first and I want to generate the rank called firstrank.
proc rank data = moo out= outmoo;
var firstrank;
run;
My output looks like
Obs first firstrank
1 0.000 9.5
2 0.000 9.5
3 0.000 9.5
4 0.000 9.5
5 0.000 9.5
6 0.000 9.5
7 0.000 9.5
8 0.000 9.5
9 0.000 9.5
10 0.000 9.5
11 0.000 9.5
12 0.000 9.5
13 0.000 9.5
14 0.000 9.5
15 0.000 9.5
16 0.000 9.5
17 0.000 9.5
18 0.000 9.5
19 0.105 19.5
20 0.105 19.5
21 0.210 23.5
22 0.210 23.5
23 0.210 23.5
24 0.210 23.5
25 0.210 23.5
26 0.210 23.5
As you can see the ranks are being averaged across ties in the variable first.
What I am trying to achieve is that all the values where first=0, firstrank=1, and first=0.105, firstrank=2, and so on.
Is there a way using SAS proc rank to do this? Or is there another proc to do this?

If I understand your question, you need the TIES=DENSE option (or CONDENSE, its alias). See the documentation on PROC RANK.
data test;
do x = 1 to 8;
do y = 1 to 3;
output;
end;
end;
run;
proc rank data=test out=want ties=dense;
var x;
ranks r;
run;

Related

Left Join collapses data

I am working with some bonds data and I'm looking to left join the interest rate projections. my data set for the bonds date looks like:
data have;
input ID Vintage Reference_Rate Base2017;
Datalines;
1 2017 LIBOR_001M 0.01
1 2018 LIBOR_001M 0.01
1 2019 LIBOR_001M 0.01
1 2020 LIBOR_001M 0.01
2 2017 LIBOR_003M 0.012
2 2018 LIBOR_003M 0.012
2 2019 LIBOR_003M 0.012
2 2020 LIBOR_003M 0.012
3 2017 LIBOR_006M 0.014
3 2018 LIBOR_006M 0.014
3 2019 LIBOR_006M 0.014
3 2020 LIBOR_006M 0.014
;
run;
the second dataset which I am looking to left join (or even full join) looks like
data have2;
input Reference_rate Base2018 Base2019 Base2020;
datalines;
LIBOR_001M 0.011 0.012 0.013
LIBOR_003M 0.013 0.014 0.015
LIBOR_006M 0.015 0.017 0.019
;
run;
the dataset I've been getting collapses the vintage into 1 and messes up the rest of the analysis I've been running such that it looks like
data dontwant;
input ID Vintage Reference_rate Base2017 Base2018 Base2019 Base2020;
datalines;
1 2017 LIBOR_001M 0.01 0.011 0.012 0.013
2 2017 LIBOR_003M 0.012 0.013 0.014 0.015
3 2017 LIBOR_006M 0.014 0.015 0.017 0,019
run;
the dataset I would like looks like this
data want;
input input Reference_rate Base2018 Base2019 Base2020;
datalines;
1 2017 LIBOR_001M 0.01 0.011 0.012 0.013
1 2018 LIBOR_001M 0.01 0.011 0.012 0.013
1 2019 LIBOR_001M 0.01 0.011 0.012 0.013
1 2020 LIBOR_001M 0.01 0.011 0.012 0.013
2 2017 LIBOR_003M 0.012 0.013 0.014 0.015
2 2018 LIBOR_003M 0.012 0,013 0.014 0.015
2 2019 LIBOR_003M 0.012 0.013 0.014 0.015
2 2020 LIBOR_003M 0.012 0.013 0.014 0.015
3 2017 LIBOR_006M 0.014 0.015 0.017 0.019
3 2018 LIBOR_006M 0.014 0.015 0.017 0.019
3 2019 LIBOR_006M 0.014 0.015 0.017 0.019
3 2020 LIBOR_006M 0.014 0.015 0.017 0.019
;
run;
the code I have been using is a pretty standard proc sql
PROC SQL;
CREATE TABLE want AS
SELECT a.*, b.*
FROM have A LEFT JOIN have2 B
ON A.reference_rate = B.reference_rate
ORDER BY reference_rate;
QUIT;
It's good practice to avoid using Select *, as it's better for the query performance and to avoid the case of having the same column name in both tables.
I ran your same code and it worked fine, except for one warning because you are using select a.* & b.*; you have the field "Reference_Rate" in both tables.
Solution:
PROC SQL;
CREATE TABLE want AS
SELECT
a.ID,
a.Vintage,
a.Reference_Rate,
b.Base2018,
b.Base2019,
b.Base2020
FROM have A LEFT JOIN have2 B
ON A.reference_rate = B.reference_rate
ORDER BY reference_rate;
QUIT;
Tip:
You can print the SAS table values to the log using Put _ALL_
The code below will not create a table, it will only print the table to the log which is good for debugging small tables.
data _null_;
set want;
put _all_;
run;
Log:
ID=1 Vintage=2019 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=1
ID=1 Vintage=2020 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=2
ID=1 Vintage=2017 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=3
ID=1 Vintage=2018 Reference_Rate=LIBOR_001M Base2018=0.011 Base2019=0.012 Base2020=0.013 _ERROR_=0 _N_=4
ID=2 Vintage=2019 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=5
ID=2 Vintage=2018 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=6
ID=2 Vintage=2017 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=7
ID=2 Vintage=2020 Reference_Rate=LIBOR_003M Base2018=0.013 Base2019=0.014 Base2020=0.015 _ERROR_=0 _N_=8
ID=3 Vintage=2020 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=9
ID=3 Vintage=2019 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=10
ID=3 Vintage=2018 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=11
ID=3 Vintage=2017 Reference_Rate=LIBOR_006M Base2018=0.015 Base2019=0.017 Base2020=0.019 _ERROR_=0 _N_=12

A dynamic SAS program to consolidate dates of events that are nested within each other

Hello,
I want to write a dynamic program which helps me to flag the start and end dates of events that are nested within the consolidated dates that are present at the top of each Pt.ID in the attached example. I can easily do these if there is only one such consolidated period per Pt.ID. However, there could be more than one such consolidated periods per Pt. ID. (As shown for second Pt.ID, 1002). As shown in the example, the events that fall within the consolidated period/s are fagged as "Y" in the flag variable and if they don't fall within the consolidated period then they are flagged as "N" in this variable. How can I write a program that accounts for all of such consolidated periods per Pt.ID and then compare them with the dates for the rest of the events of a particular patient and flag events which fall within any of those consolidated periods?
Thank you.
So join the event records with the period records and calculate whether the event is within the period. Then you could take the MAX over all periods.
For example here is code for your sample that creates a binary 1/0 flag variable called INCLUDED.
data Sample;
infile datalines missover;
input Pt_ID Event_ID Category $ Start_Date : mmddyy10.
Start_Day End_date : mmddyy10. End_day Duration
;
format Start_date End_date mmddyy10.;
datalines;
1001 . Moderate 8/5/2016 256 9/3/2016 285 30
1001 1 Moderate 3/8/2016 106 3/16/2016 114 9
1001 2 Moderate 8/5/2016 256 8/14/2016 265 10
1001 3 Moderate 8/21/2016 272 8/24/2016 275 4
1001 4 Moderate 8/23/2016 274 9/3/2016 285 12
1002 . Severe 11/28/2016 13 12/19/2016 34 22
1002 . Severe 2/6/2017 83 2/28/2017 105 23
1002 1 Severe 11/28/2016 13 12/5/2016 20 8
1002 2 Severe 12/12/2016 27 12/19/2016 34 8
1002 3 Severe 1/9/2017 55 1/12/2017 58 4
1002 4 Severe 2/6/2017 83 2/13/2017 90 8
1002 5 Severe 2/20/2017 97 2/28/2017 105 9
1002 6 Severe 3/17/2017 122 3/24/2017 129 8
1002 7 Severe 5/4/2017 170 5/13/2017 179 10
1002 8 Severe 5/24/2017 190 5/30/2017 196 7
1002 9 Severe 6/9/2017 206 6/13/2017 210 5
;
proc sql ;
create table want as
select a.*
, max(b.start_date <= a.start_date and b.end_date >= a.end_date ) as Included
from sample a
left join sample b
on a.pt_id = b.pt_id and missing(b.event_id)
group by 1,2,3,4,5,6,7,8
order by a.pt_id, a.event_id, a.start_date , a.end_date
;
quit;

Incorrect SAS freq table output with ODS Excel

trying to export SAS "proc freq" results to an Excel file (xlsx), using Enterprise guide 7.12 with SAS 9.4 on windows.
The following code example :
ODS EXCEL
file='C:\Download\example.xlsx'
STYLE=HtmlBlue
OPTIONS ( sheet_interval="none" sheet_name="Results" );
data example;
input ins_cd$ 1-2 decl_aatrim $ 4-8 prog $ 10-13 compt $ 15-18;
cards;
02 20153 7646 XC12
02 20153 7646 AB02
02 20153 7646 CC13
02 20153 9999
02 20153 7595 PS03
02 20153 7595 PS04
02 20153 6080 XC12
02 20153 6080 XC15
02 20153 6080 CC18
02 20153 6080 DC08
;
proc sort data=example;
by ins_cd decl_aatrim prog compt;
run;
data example2;
set example;
by ins_cd decl_aatrim prog compt;
if first.prog=1 then do;
test=first.prog;
rank=1;
retain rank 1;
end;
else rank=rank+1;
run;
proc freq data=example2;
tables prog*compt;
run;
ods EXCEL close;
outputs the freq table as expected in the results viewer, with four rows per prog like so :(truncated for less copy paste, and freq row labels values roughly translated ):
compt
AB02 CC13 CC18 [...]
prog
6080 Freq 0 0 1 1 0 0 1 1
Pct 0.00 0.00 11.11 11.11 0.00 0.00 11.11 11.11
row pct 0.00 0.00 25.00 25.00 0.00 0.00 25.00 25.00
col.pct 0.00 0.00 100.00 100.00 0.00 0.00 50.00 100.00
7595 Freq 0 0 0 0 0 [...]
[...]
but when the xlsx file produced by ods is opened in Excel, the freq table looks like this:
prog compt
Freq
Pct
row pct
col.pct AB02 CC13 CC18 DC08 PS03 PS04 XC12 XC15 Total
6080 0 0 1 1
0.00 0 11.11 [...]
0.00 0.00 25.00
0.00 0.00 100.00
7595 0
0.00 [...]
and the four cells with freq calculations are merged into one cell and row for each prog.
This http://support.sas.com/kb/32/115.html seems to be related to my problem, but the proposed crosslist solution does not give the wanted output in Excel either.
Any ideas? Thanks!
This is caused by how PROC FREQ works, and the ODS HTML solution (what you refer to as the results viewer) is no different. Notice that it has:
<td class="r t stacked_cell data"><table width="100%" border="0" cellpadding="7" cellspacing="0">
<tr>
<td class="r t data top_stacked_value">1</td>
</tr>
<tr>
<td class="r t data bottom_stacked_value">11.11</td>
</tr>
</table></td>
Inside each cell - so one main table cell has a mini-table in it with the freq/rowpct/colpct/totalpct in it (or in the case of the above, the two elements on a bottom header).
You can solve this a number of ways. One option is, as Reeza notes in another answer, to use PROC TABULATE.
Another option would be to write your own table template via PROC TEMPLATE; that's how PROC FREQ's crosstab is done, after all; you could look at how they did that and change it, perhaps.
A third option would be to postprocess this output; since the resulting table has all of the data you want, just not in rows, you could easily write a VBA routine to change the format to the desired one.
If you can use Proc Tabulate instead. You have more control over your table and the appearance anyways.

Counting how many times a condition succeeds in SAS

I have a table A with real time values in it.
Amount Count Pct1 Pct2
300 2 0.000 100.000
1,891 2 0.001 100.000
500 2 0.000 100.000
100 2 0.000 100.000
1,350 2 0.001 100.000
2,648 2 0.001 100.000
2,255 2 0.001 100.000
500 2 0.000 100.000
200 2 0.000 30.441
10 2 0.000 100.000
1,928 2 0.001 100.000
40 2 0.000 100.000
200 2 0.000 100.000
256 2 0.000 100.000
254 2 0.000 100.000
100 2 0.001 100.000
50 1 0.000 33.333
1,512 2 0.001 100.000
I have a table B with a set of conditions. I want to generate the Condition success count in SAS. i.e. If I pass the row 1 in the below table as a condition to the table A it succeeds 2 times. I am using a join to generate a cartesin product and its not efficient. I want an efficient way to solve this problem (similar to what countifs function does in excel). Thanks a lot for your help.
Amount Count Pct1 Pct2 Condion Success Count
1,576 2 0 100 4
1,537 2 0 100 4
1,484 2 0 100 5
1,405 2 0 100 5
1,290 2 0 100 6
1,095 2 0 100 6
948 2 0 100 6
932 2 0 100 6
914 2 0 100 6
887 2 0 100 6
850 2 0 100 6
774 2 0 100 6
707 2 0 100 6
704 2 0 100 6
695 2 0 100 6
646 2 0 100 6
50 1 0 5.42 16
50 1 0 5.42 16
You said that you have tried join to make to make a cartesian product. However, since you didn't post any code I am not sure if you tried to make full product and then calculate the rows. Doing the counting in one SQL statement is much faster since actually full cartesian product is not written anywhere. Like this:
proc sql;
create table tableC as
select c.*, coalesce(s,0) as SuccessCount from TableB c
left join (
select id, count(*) as s from TableA a,TableB b
where
a.amount >= b.amount and
a.count >= b.count and
a.pct1 >= b.pct1 and
a.pct2 >= b.pct2
group by id
) as d
on c.id = d.id
;
quit;
Note that tableB needs to have some unique id column. You should always have some column to use as id but if you don't have it already simple create it like this for example:
data tableB;
set tableB;
id = _N_;
run;

SAS SGPlot scatterplot options--Grouping by attribute

I have two questions concerning SGPlot in SAS. I have a large dataset, and am trying to create a scatterplot which highlights differences between brands of similar prooduct. I am able to get different colors for each brand, but for some reason the symbols will not show up. I had to close ods listing and html because I was getting an error ERROR: Cannot write image to filename.png. Please ensure that proper disk permissions are set. I'm not sure this can be fixed. Is there another way to get the symbols?
Also, I have a 95% prediction ellipse, but am wondering if there is a way to have an ellipse for each brand. Thanks.
data example;
length product brand $20;
input product $ brand $ item price trans;
cards;
sconces AllenRoth 1 1.5 300
sconces AllenRoth 2 2.75 350
sconces AllenRoth 3 1.75 300
sconces AllenRoth 4 0.75 400
sconces GardenTreasures 1 3 200
sconces GardenTreasures 2 3.25 175
sconces GardenTreasures 3 2.75 100
sconces GardenTreasures 4 3.5 100
sconces GardenTreasures 5 4 150
sconces OtherBrand 1 0.5 850
sconces OtherBrand 2 0.45 875
sconces OtherBrand 3 0.75 900
sconces OtherBrand 4 1 650
sconces OtherBrand 5 0.75 700
sconces BrandX 1 1 200
sconces BrandX 2 1.25 500
sconces BrandX 3 1.2 400
sconces BrandX 4 0.95 375
sconces BrandX 5 1 300
sconces BrandX 6 1 200
sconces BrandX 7 1.35 400
sconces BrandX 8 1.5 350
curtains AllenRoth 1 10 200
curtains AllenRoth 2 12 250
curtains AllenRoth 3 11.5 200
curtains AllenRoth 4 10 400
curtains AllenRoth 5 17 500
curtains AllenRoth 6 15 100
curtains AllenRoth 7 29 50
curtains AllenRoth 8 50 12
curtains GardenTreasures 1 80 150
curtains GardenTreasures 2 60 75
curtains GardenTreasures 3 100 50
curtains BrandX 1 9 300
curtains BrandX 2 12 350
curtains BrandX 3 10 275
curtains BrandX 4 7.5 400
curtains BrandX 5 12 200
curtains BrandX 6 8.5 500
;
run;
proc format;
value legfmt
1 = 'legend value 1'
2 = 'legend value 2'
3 = 'legend value 3'
4 = 'legend value 4';
run;
proc sort data= example;
by product brand;
run;
ods listing close;
ods html close;
proc sgplot data= example ;
title1 "plotting trans by price";
footnote1 "final";
by product;
scatter x= trans y= price / datalabel= item group= brand name= "scp";
ellipse x=trans y=price;
xaxis label= "Number of Transactions";
yaxis label= "Average Selling Price";
keylegend "scp" / noborder across= 1 down= 4 location= outside position= topright
title= "Legend";
run;
ods graphics off;
ods listing close;
Likely your gpath is not set properly. This works for me, for example:
ods html path='c:\temp' file='test.html' gpath='c:\temp\' style=htmlblue;
proc sgplot data= example ;
title1 "plotting trans by price";
footnote1 "final";
by product;
scatter x= trans y= price / datalabel= item group= brand name= "scp";
ellipse x=trans y=price;
xaxis label= "Number of Transactions";
yaxis label= "Average Selling Price";
keylegend "scp" / noborder across= 1 down= 4 location= outside position= topright
title= "Legend";
run;
ods html close;
If you don't set gpath to a path you can write to, it may be set to something that you don't have write access to, especially if you have a server installation. PATH and GPATH can be set to the same or different paths.
I don't believe you can have an ellipse for each brand, largely because it would look terrible. Having four prediction ellipses, even with different colors, would be very difficult to visually distinguish. A different chart type may be appropriate if you are trying to show that (perhaps a bar graph for example with brand as the bar type and selling price buckets as a group variable).