I have a dataset that looks like the following:
pt_fin Admit_Type MONTH_YEAR BED_ORDERED_TO_DISPO (minutes)
1 Acute Jan 214
2 Acute Jan 628
3 ICU Jan 300
4 ICU Feb 99
I already have a code (see below) that produces a plot with a x (admit type grouped my month) and y axes (median bed to dispo time), but I want to add a secondary Y axes which counts the number of patients which were used to compute each respective median.
For example, I want a secondary Y axis data point that corresponds to the month and admit type, so for Jan, the secondary Y axis data point will have a 2 separate counts 1)of the patients admitted to acute and 2) of the patients admitted to ICU.
proc sgplot data=Combined;
title "Median Bed Order To Dispo By Month, Admit Location";
vbar MONTH_YEAR / response=BED_ORDERED_TO_DISPO stat=median
group = Admit_Type groupdisplay=cluster ;
run;
I've been trying to adapt what I've found here but the plots my code produces are super messy and incorrect.
https://blogs.sas.com/content/iml/2019/01/14/align-y-y2-axes-sgplot.html
Desired output(pretend X's and *'s, respectively, are connected in a line graph corresponding to the Y axis):
| * |
m | | | X | | #
e | x | | * |
d | | | | | |
|-------------------------------|
Acute ICU Acute ICU
Jan FEb
Code which I've tried that produce rubbish
proc sgplot data=Combined;
vbarbasic MONTH_YEAR/ response=Bed_Order_Hour y2axis; /*needs to be on y axis 1*/
group = Admit_Type
series x=MONTH_YEAR y=Pt_fin/ markers; *Pt_fin needs to be on y axis 2*/
run;
Your visualization explanation is weak. You might want to use two plotting statements in your SGPLOT, VBAR and VLINE.
data have;
do type = 'Acute', 'ICU';
do month = '01jan2018'd to '31dec2018'd;
do _n_ = 1 to floor (50 * ranuni(123));
patid + 1;
minutes = 10 + floor(1000 * ranuni(123));
output;
end;
month = intnx ('month', month, 0, 'e');
end;
end;
format month monname3.;
run;
ods html5 file="plot.html" path="c:\temp";
proc sgplot data=have;
title "Median of patient minutes by month";
vbar month / group=type groupdisplay=cluster response=minutes stat=median;
vline month / group=type groupdisplay=cluster response=minutes stat=freq y2axis ;
run;
ods html5 close;
The vline presents the viewer a secondary focus on the frequency for each median. The same information (as an aspect) of the median could be communicated instead with just a modification of the vbar intensity. The highest freq bars (of median) would be 'strongest' shade and the lower 'freq' bars would be faded.
Related
I am currently trying to use PROC SGPLOT in SAS to create a series plot with five lines (8th grade, 10th grade, 12th grade, College Students, and Young Adults). The yaxis is a percentage of prevalence in drug use ranging from 0-100. The xaxis is the year 1975-2019, but formatted (using proc format) so that it shows the value of year as '75-'19. I would like to label each line using its respective group (8th grade - Young Adult). But when I use:
proc sgplot data = save.fig2_1data noautolegend ;
series x=year y=eighth / lineattrs=(color=orange) curvelabel='8th Grade' curvelabelpos=start ;
series x=year y=tenth / lineattrs=(color=green) curvelabel='10th Grade' curvelabelpos=start ;
series x=year y=twelfth / lineattrs=(color=blue) curvelabel='12th Grade' curvelabelpos=start;
series x=year y=college / lineattrs=(color=red) curvelabel='College Students' curvelabelpos=start;
series x=year y=youngadult / lineattrs=(color=purple) curvelabel='Young Adults' curvelabelpos=start ;
xaxis label="YEAR" values=(1975 to 2019 by 2) minor;
yaxis label="PERCENT" max=100 min=0 ;
format year yr. ; run ;
Series Plot
The "curvelabelpos=" does not give the option to place my label above the first data point of "12th Grade" and "College Students" so that my xaxis does not have all of the space on the left side of the plot. How do I move these two labels above the first data point of each line so that the xaxis does not have empty space?
There are no series statement options that will produce the labeling you want.
You will have to create an annotation data set for the sgplot.
In this sample code the curvelabel= option was set to '' so the procedure generates a series line that uses the widest amount of horizontal drawing space. The sganno data set contains the annotation functions that will draw your own curvelabel text near the first data point of the series with the blank curvelabel. Adjust the %sgtext anchor= value as needed. Be sure to read the SG Annotation Macro Dictionary documentation to understand all the text annotation capabilities.
For the case of wanting an artificial split in the series lines there are two things to try:
introduce a fake year, 2012.5, for which none of the series variables have a value. I tried this but only 1 of 5 series drew with a 'fake' split.
introduce N new variables for the N lines needing a split. For the post split time frame copy the data into the new variables and set the original to missing.
add SERIES statements for the new variables.
data have;
call streaminit(1234);
do year = 1975 to 2019;
array response eighth tenth twelfth college youngadult;
if year >= 1991 then do;
eighth = round (10 + rand('uniform',10), .1);
tenth = eighth + round (5 + rand('uniform',5), .1);
twelfth = tenth + round (5 + rand('uniform',5), .1);
if year in (1998:2001) then tenth = .;
end;
else do;
twelfth = 20 + round (10 + rand('uniform',25), .1);
end;
if year >= 1985 then do;
youngadult = 25 + round (5 + rand('uniform',20), .1);
end;
if year >= 1980 then do;
college = 35 + round (7 + rand('uniform',25), .1);
end;
if year >= 2013 then do _n_ = 1 to dim(response);
%* simulate inflated response level;
if response[_n_] then response[_n_] = 1.35 * response[_n_];
end;
output;
end;
run;
data have_split;
set have;
array response eighth tenth twelfth college youngadult;
array response2 eighth2 tenth2 twelfth2 college2 youngadult2;
if year >= 2013 then do _n_ = 1 to dim(response);
response2[_n_] = response[_n_];
response [_n_] = .;
end;
run;
ods graphics on;
ods html;
%sganno;
data sganno;
%* these variables are used to track '1st' or 'start' point
%* of series being annotated
;
retain y12 ycl;
set have;
if missing(y12) and not missing(twelfth) then do;
y12=twelfth;
%sgtext(label="12th Grade", textcolor="blue", drawspace="datavalue", anchor="top", x1=year, y1=y12, width=100, widthunit='pixel')
end;
if missing(ycl) and not missing(college) then do;
ycl=college;
%sgtext(label="College Students", textcolor="red", drawspace="datavalue", anchor="bottom", x1=year, y1=ycl, width=100, widthunit='pixel')
end;
run;
proc sgplot data=have_split noautolegend sganno=sganno;
series x=year y=eighth / lineattrs=(color=orange) curvelabel='8th Grade' curvelabelpos=start;*auto curvelabelloc=outside ;
series x=year y=tenth / lineattrs=(color=green) curvelabel='10th Grade' curvelabelpos=start;*auto curvelabelloc=outside ;
series x=year y=twelfth / lineattrs=(color=blue) curvelabel='' curvelabelpos=start;*auto curvelabelloc=outside ;
series x=year y=college / lineattrs=(color=red) curvelabel='' curvelabelpos=start;*auto curvelabelloc=outside ;
series x=year y=youngadult / lineattrs=(color=purple) curvelabel='Young Adults' curvelabelpos=start;*auto curvelabelloc=outside ;
* series for the 'shifted' time period use the new variables;
series x=year y=eighth2 / lineattrs=(color=orange) ;
series x=year y=tenth2 / lineattrs=(color=green) ;
series x=year y=twelfth2 / lineattrs=(color=blue) ;
series x=year y=college2 / lineattrs=(color=red) ;
series x=year y=youngadult2 / lineattrs=(color=purple) ;
xaxis label="YEAR" values=(1975 to 2019 by 2) minor;
yaxis label="PERCENT" max=100 min=0 ;
run ;
ods html close;
ods html;
Richard's answered what you explicitly want, but I think what you want isn't ideal from a graphical standpoint - and that's why SAS won't do it for you.
Labelling over a line is hard to read, especially when you use the same color as the line. Labelling outside the chart is much cleaner, as is placing the labels in a keylegend.
In this case, I would use CURVELABELLOC=OUTSIDE, and either use CURVELABELPOS=MAX (default, which places them to the right of the chart), or CURVELABELPOS=MIN, which places them nearer the start as you prefer but also overlays the axis (which is not as clean-looking).
See this as an example. This is highly legible, the curve labels are in a place that the eye naturally travels to, and doesn't alter the size of the axis. Putting them at the right also means they're in the same spot for all of the lines, which is cleaner than having them at the start of the lines which are staggered.
data fig2_1data;
call streaminit(7);
tenth = 0.5;
twelfth= 0.6;
do year=1975 to 2019;
if year eq 1987 then eighth=0.4;
eighth = rand('Uniform',0.2)-0.1 + eighth;
tenth = rand('Uniform',0.2)-0.1 + tenth;
twelfth = rand('Uniform',0.2)-0.1 + twelfth;
output;
end;
run;
proc sgplot data = fig2_1data noautolegend ;
series x=year y=eighth / lineattrs=(color=orange)
curvelabel='8th Grade' curvelabelpos=max curvelabelloc=outside;
series x=year y=tenth / lineattrs=(color=green)
curvelabel='10th Grade' curvelabelpos=max curvelabelloc=outside;
series x=year y=twelfth / lineattrs=(color=blue)
curvelabel='12th Grade' curvelabelpos=max curvelabelloc=outside;
xaxis label="YEAR" values=(1975 to 2019 by 2) minor;
yaxis label="PERCENT" max=1 min=0 ;
format year yr. ; run ;
My data are structured as follows (these are just sample data as the original data are secret)
id | crime | location | crimedate
------------------------------
1 | Theft | public | 2019-01-04
1 | Theft | public | 2019-02-06
1 | Theft | public | 2019-02-20
1 | Theft | private | 2019-03-10
1 | Theft | private | 2019-03-21
1 | Theft | public | 2019-03-01
1 | Theft | private | 2019-03-14
1 | Theft | public | 2019-06-15
1 | Murder | private | 2019-01-04
1 | Murder | private | 2019-10-20
1 | Murder | private | 2019-11-18
1 | Murder | private | 2019-01-01
1 | Assault | private | 2019-03-19
1 | Assault | private | 2019-01-21
1 | Assault | public | 2019-04-11
1 | Assault | public | 2019-01-10
… | … | … | …
My goal is to create a lineplot (time series plot) showing how the numbers of the three crimes have changed over the year. Therefore on the x-axis I would like to show the monthes (1-12) and on the y-axis the number of crimes in each month. There should be two lines (one for each location).
I started with this code:
DATA new;
SET old;
month=month(datepart(crimedate));
RUN;
PROC sgplot DATA=new;
series x=month y=no_of_crimes / group=location;
run;
But I have no idea, how I can aggregate the number of crimes per month. Could anyone please give me a hint? I have been looking in the internet for a solution, but usually the examples just use data that are already aggregated.
The SG routines will aggregate Y axis values for a VBAR or HBAR statement. The same aggregate information displayed in a SERIES statement would have to be from a apriori aggregate computation, easily done with Proc SUMMARY.
Additionally, to plot the counts for each crime in a separate visual, you would want a BY CRIME statement, or Proc SGPANEL with PANELBY crime.
The crime datetime value does not have to be down transformed to a date value, you can use the appropriate datetime format in the procedures and they will auto-aggregate based on the formatted value.
Example with some simulated crime data:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
* shorter graphs for SO answer;
ods graphics / height=300px;
proc sgplot data=have;
title "VBAR all crimes combined by location";
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc sgpanel data=have;
title "VBAR crime * location";
panelby crime;
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc summary data=have noprint;
class crime_dt crime location;
format crime_dt dtmonyy7.;
output out=freqs;
run;
proc sgplot data=freqs;
title "SERIES all crimes,summary _FREQ_ * location";
where _type_ = 5;
series x=crime_dt y=_freq_ / group=location;
xaxis type=discrete;
run;
proc sgpanel data=freqs;
title "SERIES all crimes,summary _FREQ_ * crime * location";
where _type_ = 7;
panelby crime;
series x=crime_dt y=_freq_ / group=location;
rowaxis min=0;
colaxis type=discrete;
run;
If you want to group by location without definition by type of crime:
proc sql noprint;
create table new as
select id,location
, month(crimedate) as month,count(crime) as crime_n
from old
group by id,location,CALCULATED month;
quit;
proc sgplot data=new;
series x=month y=crime_n /group=location;
run;
The result:
To show different series by type of crime you could use sgpanel:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby location;
series x=month y=crime_n /group=crime;
run;
The result is:
One more variant of perfoming this data:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby crime;
series x=month y=crime_n /group=location GROUPDISPLAY=cluster;
run;
The result is:
Of course, you can specify this plots how you want.
To perhaps answer the question more directly, the VLINE or HLINE plots will summarize the data for you, similar to running a proc freq and then proc sgplot with series.
Using Richard's test data, you'll see this is exactly identical to the plot his PROC FREQ -> SERIES gives:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
proc sgplot data=have;
vline crime_dt/group=location groupdisplay=cluster;
format crime_dt dtmonyy7.;
run;
I want to create a bar chart on yearly death count (based on gender). I want to plot gender and year on x axis and count on Y axis. Can you kindly help how to modify the below code?
TITLE 'DEATH GRAPH BY GENDER';
PROC SGPLOT DATA = DREPORT;
VBAR deathcount / GROUP = gender GROUPDISPLAY = CLUSTER;
RUN;
I am not able to put deathyear in the Y axis. Kindly frame the code.
The VBAR variable is the mid-point values to show on the horizontal axis.
Are you sure that is what you want ?
Do you really want to know how many times a give death count occurred over all the years ?
You probably want deathcount as the response
Consider this example:
data have_raw;
do id = 1 to 1000;
gender = substr('MF',1 + 2 * ranuni(123),1);
year = 2019 - floor (30 * ranuni(123));
output;
end;
run;
proc sql;
create table have as
select year, gender, count(*) as deathcount
from have_raw
group by year, gender
;
proc sgplot data=have;
vbar gender
/ response=deathcount
group=year
groupdisplay=cluster
;
run;
I have a dataset called stores.I want to extract total_sales(retail_price),
proportion of sales and cumulative proportion of sales by each store in
SAS.
Sample dataset : - Stores
Date Store_Postcode Retail_Price month Distance
08/31/2013 CR7 8LE 470 8 7057.8
10/26/2013 CR7 8LE 640 10 7057.8
08/19/2013 CR7 8LE 500 8 7057.8
08/17/2013 E2 0RY 365 8 1702.2
09/22/2013 W4 3PH 395.5 12 2522
06/19/2013 W4 3PH 360.5 6 1280.9
11/15/2013 W10 6HQ 475 12 3213.5
06/20/2013 W10 6HQ 500 1 3213.5
09/18/2013 E7 8NW 315 9 2154.8
10/23/2013 E7 8NW 570 10 5777.9
11/18/2013 W10 6HQ 455 11 3213.5
08/21/2013 W10 6HQ 530 8 3213.5
Code i tried: -
Proc sql;
Create table work.Top_sellers as
Select Store_postcode as Stores,SUM(Retail_price) as Total_Sales,Round((Retail_price/Sum(Retail_price)),0.01) as
Proportion_of_sales
From work.stores
Group by Store_postcode
Order by total_sales;
Quit;
I've no idea on how to calculate cumulative variable in proc sql...
Please help me improve my code!!
Computing a cumulative result in SQL requires the data to have an explicit unique ordered key and the query involves a reflexive join with 'triangular' criteria for the cumulative aspect.
data have;
do id = 100 to 120;
sales = ceil (10 + 25 * ranuni(123));
output;
end;
run;
proc sql;
create table want as
select
have1.id
, have1.sales
, sum(have2.sales) as sales_cusum
from
have as have1
join
have as have2
on
have1.id >= have2.id /* 'triangle' criteria */
group by
have1.id, have1.sales
order by
have1.id
;
quit;
A second way is re-compute the cusum on row by row basis
proc sql;
create table want as
select have.id, have.sales,
( select sum(inner.sales)
from (select * from have) as inner
where inner.id <= have.id
)
as cusum
from
have;
I change my mind, CDF is a different calculation.
Here's how to do this via a data step. First calculate the cumulative totals (I used a data step here, but I could use PROC EXPAND if you had SAS/ETS).
*sort demo data;
proc sort data=sashelp.shoes out=shoes;
by region sales;
run;
data cTotal last (keep = region cTotal);
set shoes;
by region;
*calculate running total;
if first.region then cTotal=0;
cTotal = cTotal + sales;
*output records, everything to cTotal but only the last record which is total to Last dataset;
if last.region then output last;
output cTotal;
retain cTotal;
run;
*merge in results and calculate percentages;
data calcs;
merge cTotal Last (rename=cTotal=Total);
by region;
percent = cTotal/Total;
run;
If you need a more efficient solution, I'd try a DoW solution.
There is an example "Tornado Diagram" here. I am trying to modify that code. Here is my modified version:
%let name=ex_17;
goptions reset=(global goptions);
GOPTIONS DEVICE=png xpixels=800 ypixels=600;
goptions gunit=pct border cback=lightgray colors=(blacks) ctext=black
htitle=6.5 htext=3 ftitle="albany amt" ftext="albany amt";
data mileage;
input factor $ level $ value;
datalines;
Screening M 7199
Diagnosis F 4502
Biopsy M 12304
Treatment F 5428
Recovery M 15701
Metastasis F 6915
;
data convert;
set mileage;
if level='F' then value=-value;
run;
proc format;
picture posval low-high='000,009';
run;
data anlabels(drop=factor level value);
length text $ 24;
retain function 'label' when 'a' xsys ysys '2' hsys '3' size 2;
set convert;
midpoint=factor; subgroup=level;
text=left(put(value, posval.));
if level ='F' then position='>';
else position='<'; output;
run;
title1 'One-Way Sensitivity Analysis on NNS to Gain 1 QALY';
*axis1 label=(justify=left 'Disutility') style=0 color=black;
axis1 label=(justify=left '') style=0 color=black;
axis2 label=none value=(tick=3 '') minor=none major=none
width=3 order=(-10000 to 20000 by 10000) color=black;
pattern1 value=solid color=green;
pattern2 value=solid color=blue;
proc gchart data=convert;
format value posval.;
note move=(25,80) height=3 'Women' move=(+10,+0) 'Men';
hbar factor / sumvar=value discrete nostat subgroup=level
maxis=axis1 raxis=axis2 nolegend annotate=anlabels
coutline=same des='';
run;
quit;
However, as you can see by running this code, the labels for each bar are cut off, not fully visible. Also, some halves of the bars aren't visible.
What am I doing to make these things not visible, and how can I fix this?
Your axis labels are getting cut off in the input dataset.
data mileage;
length factor $20;
input factor $ level $ value;
datalines;
Screening M 7199
Diagnosis F 4502
Biopsy M 12304
Treatment F 5428
Recovery M 15701
Metastasis F 6915
;
run;
As far as "some halves are not visible", what halves aren't visible? You only have either M or F for each factor, so you aren't going to get two bars on each factor. You're getting all of the bars you're asking for, or at least I see all of them (6 bars, some on left some on right).