I know how to add labels to reference lines, but how can I add labels between reference lines?
Here is my code right now:
proc sgplot data=biosc.Summary;
series x=day y=Mean / group=Treat;
scatter x=day y=Mean / group=Treat
yerrorlower=LowerSD yerrorupper=UpperSD;
where day in (3:10);
xaxistable N / location=inside class=Treat colorgroup=Treat
title="Number of Patients Participating by Treatment Day"
valueattrs=(size=10) labelattrs=(size=10);
yaxis label='Mean +/- SD';
xaxis label='Study Day' values=(3 4 5 6 7 8 9 10);
refline 4 8 / axis=x;
run;
And here is my graph:
What I want to do is have a label "Phase 1" to the left of the reference line at 4, "Phase 2" between the 2 reference lines, and "Phase 3" to the right of the reference lines at 4.
How can I do this?
Add them using the LABEL option on REFLINE statement.
LABEL <=variable> | <=(“text-string-1” ... “text-string-n”)>
creates labels for each reference line. If you do not specify a label value, the reference value for that line is used as the label.
If you specify a label value, the following options are available.
You can use labelloc and labelpos to help control the location.
refline 4 8 / axis=x label = ("Phase 1", "Phase 2") labelloc=inside;
If you cannot get them placed exactly where you want, you can specify an X/Y location and use a TEXT statement instead. This does require you to add the data to your graphing data set though.
data refLabels;
infile cards dlm=',';
input label_x label_y label_text $;
4, 45, Phase 1
5, 45, Phase 2
;;;;
data biosc.summary1;
set biosc.summary1 refLabels;
run;
Then add a text statement to your SGPLOT.
text x = label_x y = label_y text = label_text;
Related
I'm new to SAS and would like to get help with the question as follows:
1: Sample table shows as below
Time Color Food label
2020 red Apple A
2019 red Orange A,B
2018 blue Apple A,B
2017 blue Orange B
Logic to return label is:
when color = 'red' then 'A'
when color = 'blue' then 'B'
when food = 'orange' then 'B'
when food = 'apple' then 'A',
since for row 2, we have both red and orange then our label should contains both 'A,B', same as row 3.
The requirement is to print out the label for each combination. I know that we can use CASE WHEN statement to define how is our label should be based on color and food. Here we only have 2 kind of color and 2 different food, but what if we like 7 different color and 10 different food, then we would have 7*10 different combinations. I don't want to list all of those combinations by using case when statement.
Is there any convenient way to return the label? Thanks for any ideas!(prefer to achieve it in PROC SQL, but SAS is also welcome)
This looks like a simple application of formats. So define a format that converts COLOR to a code letter and a second one that converts FOOD to a code letter.
proc format ;
value color 'red'='A' 'blue'='B';
value food 'Apple'='A' 'Orange'='B' ;
run;
Then use those to convert the actual values of COLOR and FOOD variables into the labels. Either in a data step:
data want;
set have ;
length label $5 ;
label=catx(',',put(color,color.),put(food,food.));
run;
Or an SQL query:
proc sql ;
create table want as
select *
, catx(',',put(color,color.),put(food,food.)) as label length=5
from have
;
run;
You do not need to re-create the format if the data changes, only if the list of possible values changes.
EDIT!!!! GO TO BOTTOM FOR BETTER REPRODUCABLE CODE!
I have a data set with a quantitative variable that's missing 65 values that I need to impute. I used the ODS output and proc glm to simultaneously fit a model for this variable and predict values:
ODS output
predictedvalues=pred_val;
proc glm data=Six_min_miss;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
However, I am missing 21 predicted values because 21 of my observations are missing either of the two independent predictors.
If SAS can't make a prediction because of this missingness, it leaves an underscore (not a period) to show that it didn't make a prediction.
For some reason, if it can't make a prediction, SAS also puts an underscore for the 'observed' value--even if an observed value is present (the value in the highlighted cell under 'observed' should be 181.0512):
The following code merges the ODS output data set with the observed and predicted values, and the original data. The second data step attempts to create a new 'imputed' version of the variable that will use the original observation if it's not missing, but uses the predicted value if it is missing:
data PT_INFO_6MIN_IMP_temp;
merge PT_INFO pred_val;
drop dependent observation biased residual;
run;
data PT_INFO_6MIN_IMP_temp2;
set PT_INFO_6MIN_IMP_temp;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
However, as you can see, SAS is putting an underscore in the imputed column, when there was an original value that should have been used:
In other words, because the original variable values is not missing (it's 181.0512) SAS should have taken that value and copied it to the imputed value column. Instead, it put an underscore.
I've also tried if SIX_MIN_WALK_z =. then observed=predicted
Please let me know what I'm doing wrong and/or how to fix. I hope this all makes sense.
Thanks
EDIT!!!!! EDIT!!!!! EDIT!!!!!
See below for a truncated data set so that one can reproduce what's in the pictures. I took only the first 30 rows of my data set. There are three missing observations for the dependent variable that I'm trying to impute (obs 8, 11, 26). There are one of each of the independent variables missing, such that it can't make a prediction (obs 8 & 24). You'll notice that the "_IMP" version of the dependent variable mirrors the original. When it gets to missing obs #8, it doesn't impute a value because it wasn't able to predict a value. When it gets to #11 and #26, it WAS able to predict a value, so it added the predicted value to "_IMP." HOWEVER, for obs #24, it was NOT able to predict a value, but I didn't need it to, because we already have an observed value in the original variable (181.0512). I expected SAS to put this value in the "_IMP" column, but instead, it put an underscore.
data test;
input Study_ID nyha_4_enroll kccq12sf_both_base SIX_MIN_WALK_z;
cards;
01-001 3 87.5 399.288
01-002 4 83.333333333 411.48
01-003 2 87.5 365.76
01-005 4 14.583333333 0
01-006 3 52.083333333 362.1024
01-008 3 52.083333333 160.3248
01-009 2 56.25 426.72
01-010 4 75 .
01-011 3 79.166666667 156.3624
01-012 3 27.083333333 0
01-013 4 45.833333333 0
01-014 4 54.166666667 .
01-015 2 68.75 317.2968
01-017 3 29.166666667 196.2912
01-019 4 100 141.732
01-020 4 33.333333333 0
01-021 2 83.333333333 222.504
01-022 4 20.833333333 389.8392
01-025 4 0 0
01-029 4 43.75 0
01-030 3 83.333333333 236.22
01-031 2 35.416666667 302.0568
01-032 4 64.583333333 0
01-033 4 33.333333333 0
01-034 . 100 181.0512
01-035 4 12.5 0
01-036 4 66.666666667 .
01-041 4 75 0
01-042 4 43.75 0
01-043 4 72.916666667 0
;
run;
data test2;
set test;
drop Study_ID;
run;
ODS output
predictedvalues=pred_val;
proc glm data=test2;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
data combine;
merge test2 pred_val;
drop dependent observation biased residual;
run;
data combine_imp;
set combine;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
The special missing values (._) mark the observations excluded from the model because of missing values of the independent variables.
Try a simple example:
data class;
set sashelp.class(obs=10) ;
keep name sex age height;
if _n_=3 then age=.;
if _n_=4 then height=.;
run;
ods output predictedvalues=pred_val;
proc glm data=class;
class sex;
model height = sex age /p solution;
run; quit;
proc print data=pred_val; run;
Since for observation #3 the value of the independent variable AGE was missing in the predicted result dataset the values of observed, predicted and residual are set to ._.
Obs Dependent Observation Biased Observed Predicted Residual
1 Height 1 0 69.00000000 64.77538462 4.22461538
2 Height 2 0 56.50000000 58.76153846 -2.26153846
3 Height 3 1 _ _ _
4 Height 4 1 . 61.27692308 .
5 Height 5 0 63.50000000 64.77538462 -1.27538462
6 Height 6 0 57.30000000 59.74461538 -2.44461538
7 Height 7 0 59.80000000 56.24615385 3.55384615
8 Height 8 0 62.50000000 63.79230769 -1.29230769
9 Height 9 0 62.50000000 62.26000000 0.24000000
10 Height 10 0 59.00000000 59.74461538 -0.74461538
If you really want to just replace the values of OBSERVED or PREDICTED in the output with the values of the original variable that is pretty easy to do. Just re-combine with the source dataset. You can use the ID statement of PROC GLM to have it include any variables you want into the output. Like
id name sex age height;
Now you can use a dataset step to make any adjustments. For example to make a new height variable that is either the original or predicted value you could use:
data want ;
set pred_val ;
NEW_HEIGHT = coalesce(height,predicted);
run;
proc print data=want width=min;
var name height age predicted new_height ;
run;
Results:
NEW_
Obs Name Height Age Predicted HEIGHT
1 Alfred 69.0 14 64.77538462 69.0000
2 Alice 56.5 13 58.76153846 56.5000
3 Barbara 65.3 . _ 65.3000
4 Carol . 14 61.27692308 61.2769
5 Henry 63.5 14 64.77538462 63.5000
6 James 57.3 12 59.74461538 57.3000
7 Jane 59.8 12 56.24615385 59.8000
8 Janet 62.5 15 63.79230769 62.5000
9 Jeffrey 62.5 13 62.26000000 62.5000
10 John 59.0 12 59.74461538 59.0000
I want to plot Y by X plot where I group by year, but color code year based on different variable (dry). So each year shows as separate line but dry=1 years plot one color and dry=0 years plot different color. I actually figured one option (yeah!) which is below. But this doesn't give me much control.
Is there a way to put a where clause in the series statement to select specific categories so that I can specifically assign a color (or other format)? Or is there another way? This would be analogous to R where one can use multiple line statements for different subsets of data.
Thanks!!
This code works.
proc sgplot data = tmp;
where microsite_id = "&msit";
by microsite_id ;
yaxis label= "Pct. Stakes" values = (0 to 100 by 20);
xaxis label= 'Date' values = (121 to 288 by 15);
series y=tpctwett x=jday / markers markerattrs=(symbol=plus) group = year grouplc=dry groupmc=dry;
format jday tadjday metajday jdyfmt.;
label tpctwett='%surface water' tadval1='breed' metaval1='meta';
run;
Use an Attribute map, see the documentation
You can use the DRY variable to set the specific colours. For each year, assign the colour using the DRY variable in a data step.
proc sort data=tmp out=attr_data; by year; run;
data attrs;
set attr_data;
id='year';
if dry=0 then linecolor='green';
if dry=1 then linecolor='red';
keep id linecolor;
run;
Then add the dattrmap=attrs in the PROC SGPLOT statement and the attrid=year in the SGPLOT options.
ods graphics / attrpriority=none;
proc sgplot data = tmp dattrmap=attrs;
where microsite_id = "&msit";
by microsite_id ;
yaxis label= "Pct. Stakes" values = (0 to 100 by 20);
xaxis label= 'Date' values = (121 to 288 by 15);
series y=tpctwett x=jday / markers markerattrs=(symbol=plus) group = year grouplc=dry groupmc=dry attrid=year;
format jday tadjday metajday jdyfmt.;
label tpctwett='%surface water' tadval1='breed' metaval1='meta';
run;
Note that I tested and edited this post so it should work now.
Say that my data set has quite a lot of missing/invalid values and I would like to remove (or drop) the entire variable (or column) if it contains too many invalid values.
Take the following example, the variable 'gender' has quite a lot of "#N/A"s. I would like to remove that variable if a certain percentage of the data points in there are "#N/A"s, say more than 50%, more than 30%.
In addition, I would like to make the percentage a configurable value, i.e., I am willing to remove the entire variable if more than x% of the observations under that variable are "#N/A". And I also want to be able to define what an invalid value is, could be "#N/A", could be "Invalid Value", could be " ", could be anything else that I pre-define.
data dat;
input id score gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
Please make the solution as generalized as possible. For example, if the real data set contains thousands of variables, I need to be able to loop through all those variables instead of referencing their variable names one by one. Furthermore, the data set could contain more than just "#N/A" as bad values, other things like ".", "Invalid Obs", "N.A." could also exist at the same time.
PS: Actually I thought of a way to make this problem easier. We could probably read in all the data points as numerical values, so that all the "#N/A", "N.A.", " " stuff get turned into ".", which makes the drop criterion easier. Hope that helps you solve this problem for me ...
Update: below is the code I am working on. Got stuck at the last block.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
proc contents data=dat out=test0(keep=name type) noprint;
/*A DATA step is used to subset the test0 data set to keep only the character */
/*variables and exclude the one ID character variable. A new list of numeric*/
/*variable names is created from the character variable name with a "_n" */
/*appended to the end of each name. */
data test0;
set test0;
if type=2;
newname=trim(left(name))||"_n";
/*The macro system option SYMBOLGEN is set to be able to see what the macro*/
/*variables resolved to in the SAS log. */
options symbolgen;
/*PROC SQL is used to create three macro variables with the INTO clause. One */
/*macro variable named c_list will contain a list of each character variable */
/*separated by a blank space. The next macro variable named n_list will */
/*contain a list of each new numeric variable separated by a blank space. The */
/*last macro variable named renam_list will contain a list of each new numeric */
/*variable and each character variable separated by an equal sign to be used on*/
/*the RENAME statement. */
proc sql noprint;
select trim(left(name)), trim(left(newname)),
trim(left(newname))||'='||trim(left(name))
into :c_list separated by ' ', :n_list separated by ' ',
:renam_list separated by ' '
from test0;
quit;
/*The DATA step is used to convert the numeric values to character. An ARRAY */
/*statement is used for the list of character variables and another ARRAY for */
/*the list of numeric variables. A DO loop is used to process each variable */
/*to convert the value from character to numeric with the INPUT function. The */
/*DROP statement is used to prevent the character variables from being written */
/*to the output data set, and the RENAME statement is used to rename the new */
/*numeric variable names back to the original character variable names. */
data test2;
set dat;
array ch(*) $ &c_list;
array nu(*) &n_list;
do i = 1 to dim(ch);
nu(i)=input(ch(i),8.);
end;
drop i &c_list;
rename &renam_list;
run;
data test3;
set test2;
array myVars(*) &c_list;
countTotal=1;
do i = 1 to dim(myVars);
myCounter = count(.,myVars(i));
/* if sum(countMissing)/sum(countTotal) lt 0.5 then drop VNAME(myVars(i)); */
end;
run;
The problem is, and where I got stuck on, is that I am not able to drop the variables that I want to drop. And the reason is because I do not want to use the variable names in the drop function. Instead, I want it done in a loop where I can reference the variable names with the looper "i". I tried to use the array "myVars(i)" but it doesnt seem to work with the drop function.
My understanding is that SAS processes drop statements during data step compilation, i.e. before it looks at any of the data from any input datasets. Therefore, you cannot use the vname function like that to select variables to drop, as it doesn't evaluate the variable names until the data step has finished compiling and has moved on to execution.
You will need to output a temporary dataset or view containing all your variables, including the ones you don't want, build up a list of variables that you want to drop, in a macro variable, then drop them in a subsequent data step.
Refer to this paper and page 3 in particular for more details of which things run during compilation rather than execution:
http://www.lexjansen.com/nesug/nesug11/ds/ds04.pdf
In general, you'll find this sort of thing simplified using built in procs - this is SAS's bread and butter. You just need to restate the question.
What you want is to drop variables with a % of missing/bad data higher than 50%, so you need a frequency table of variables, right?
So - use PROC FREQ. This is the simplified version (only looks for "#N/A"), but it should be easy to modify the last step to make it look for other values (and to sum up the percents for them). Or, like you'll see in the linked question (from my comment on the question), you can use a special format that puts all invalid values to one formatted value, and all valid values to another formatted value. (You'll have to construct this format.)
Concept: use PROC FREQ to get frequency table, then look at that dataset to find the rows with > 50% of the rows and an invalid value in the F_ column.
This won't work with actual missing (" " or .); you'll need to add the /MISSING option to PROC FREQ if you have those also.
data dat;
input id $ score $ gender $;
cards;
1 10 1
1 10 1
1 9 #N/A
1 9 #N/A
1 9 #N/A
1 8 #N/A
2 9 #N/A
2 8 #N/A
2 9 #N/A
2 9 2
2 10 2
;
run;
*shut off ODS for the moment, and only use ODS OUTPUT, so we do not get a mess in our results window;
ods exclude all;
ods output onewayfreqs=freq_tables;
proc freq data=dat;
tables id score gender;
run;
ods output close;
ods exclude none;
*now we check for variables that match our criteria;
data has_missing;
set freq_tables;
if coalescec(of f_:) ='#N/A' and percent>50;
varname = substr(table,7);
run;
*now we put those into a macro variable to drop;
proc sql;
select varname
into :droplist separated by ' '
from has_missing;
quit;
*and we drop them;
data dat_fixed;
set dat;
drop &droplist.;
run;
I want to create a 'nice looking table' using the SAS ODS RTF output and the PROC REPORT procedure. After spending the whole day on Google I've managed to produce the following:
The dataset
DATA survey;
INPUT id var1 var2 var3 var4 var5 var6 ;
DATALINES;
1 1 35 17 7 2 2
17 1 50 14 5 5 3
33 1 45 6 7 2 7
49 1 24 14 7 5 7
65 2 52 9 4 7 7
81 2 44 11 7 7 7
2 2 34 17 6 5 3
18 2 40 14 7 5 2
34 2 47 6 6 5 6
50 2 35 17 5 7 5
;
RUN;
DATA survey;
SET survey;
LABEL var1 ='Variable 1';
LABEL var2 ='Fancy variable 2';
LABEL var3 ='Another variable no 3';
RUN;
LIBNAME mylib 'C:\my_libs';
RUN;
PROC FORMAT LIBRARY = mylib.survey;
VALUE groups 1 = 'Group A'
2 = 'Group B'
;
OPTIONS FMTSEARCH = (mylib.survey);
DATA survey;
SET survey;
FORMAT var1 groups.;
RUN;
** The code for creating the rtf-file **
ods listing close;
ods escapechar = '^';
ods noproctitle;
options nodate number;
footnote;
ODS RTF FILE = 'C:\my_workdir\output.rtf'
author = 'NN'
title = 'Table 1 name'
bodytitle
startpage = no
style = journal;
options papersize = A4
orientation = landscape;
title1 /*bold*/ /*italic*/ font = 'Times New Roman' height = 12pt justify = center underlin = 0 color = black bcolor = white 'Table 1 name';
footnote1 /*bold*/ /*italic*/ font = 'Times New Roman' height = 9pt justify = center underlin = 0 color = black bcolor = white 'Note: Created on January 2012';
PROC REPORT DATA = survey nowindows headline headskip MISSING
style(header) = {/*font_weight = bold*/ font_face = 'Times New Roman' font_size = 12pt just = left}
style(column) = {font_face = 'Times New Roman' font_size = 12pt just = left /*asis = on*/};
COLUMN var1 var1=var1_n var1=var1_pctn;
DEFINE var1 / GROUP ORDER=FREQ DESCENDING 'Variable';
DEFINE var1_n / ANALYSIS N 'Data/(N=)';
DEFINE var1_pctn / ANALYSIS PCTN format = percent8. '';
RUN;
ODS RTF CLOSE;
This generates an RTF table in Word something like the following (a little simplified):
However, I want to add a variable lable 'Variable 1, n (%)' above the groups in the variable name column as a separate row (NOT in the header row). I also want to add additional variables and statistics in an aggregated table.
In the end, I want something that looks like this:
I have tried "everything" - is there anyone who knows how to do this?
I know this has been open for awhile, but I too was struggling with this for awhile, and this is what I figured out. So...
In short, SAS has trouble outputting nicely formatted tables that contain more than one type of table "format" in them. For instance, a table where the columns change midway through (like you commonly find in the "Table 1" of a research study describing the study population).
In this case, you're trying to use PROC REPORT, but I don't think it's going to work here. What you want to do is stack two different reports on top of each other, really. You're changing the column value midway through and SAS doesn't natively support that.
Some alternative approaches are:
Perform all your calculations and carefully output them to a data set in SAS, in the positions you want. Then, use PROC PRINT to print them. This is what I can only describe as a tremendous effort.
Create a new TAGSET that allows you to output multiple files, but removes the spacing between each one and aligns them to the same width, effectively creating a single table. This is also quite time consuming; I attempted it using HTML with a custom CSS file and tagset, and it wasn't terribly easy.
Use a different procedure (in this case, PROC TABULATE) and then manually delete the spacing between each table and fiddle with the width to get a final table. This isn't fully automated, but it's probably the quickest option.
PROC TABULATE is cool because you can use multiple table statements in a single example. Below, I put some code in that shows what I'm talking about.
DATA survey;
INPUT id grp var1 var2 var3 var4 var5;
DATALINES;
1 1 35 17 7 2 2
17 1 50 14 5 5 3
33 1 45 6 7 2 7
49 1 24 14 7 5 7
65 2 52 9 4 7 7
81 2 44 11 7 7 7
2 2 34 17 6 5 3
18 2 40 14 7 5 2
34 2 47 6 6 5 6
50 2 35 17 5 7 5
;
RUN;
I found your example code to be a little confusing; var1 looked like a grouping variable, and var2 looked like the first actual analysis variable, so I slightly changed the code. Next, I quickly created the same format you were using before.
PROC FORMAT;
VALUE groupft 1 = 'Group A' 2 = 'Group B';
RUN;
DATA survey;
SET survey;
LABEL var1 ='Variable 1';
LABEL var2 ='Fancy variable 2';
LABEL var3 ='Another variable no 3';
FORMAT var1 groupft.;
RUN;
Now, the meat of the PROC TABULATE statement.
PROC TABULATE DATA=survey;
CLASS grp;
VAR var1--var5;
TABLE MEDIAN QRANGE,var1;
TABLE grp,var2*(N PCTN);
RUN;
TABULATE basically works with commas and asterisks to separate things. The default for something like grp*var1 is an output where the column is the first variable and then there are subcolumns for each subgroup. To add rows, you use a column; to specify which statistics you want, you add a keyword.
This above code gets you something close to what you had in your first example (not ODS formatted, but I figure you can add that back in); it's just in two different tables.
I found the following papers useful when I was tackling this problem:
http://www.lexjansen.com/pharmasug/2005/applicationsdevelopment/ad16.pdf
http://www2.sas.com/proceedings/sugi31/089-31.pdf
1 ODS has some interesting formatting features (like aligning the numbers so a decimal point goes at the same column) but their usefulness is limited for more complex cases. The most flexible solution is to create a formatted string yourself and bypass PROC REPORT's formatting facility completely, like:
data out;
length str $25;
set statistics;
varnum = 1;
group = 1;
str = put( median, 3. );
output;
group = 2;
str = put( q1, 3. ) || " - " || put( q3, 3. );
output;
run;
You can set varnum and group as ORDER variables in PROC REPORT and add headings like "Variable 1" or "Fancy variable 2" via COMPUTE BEFORE; LINE
2 To further keep PROC REPORT from messing up the layout in ODS RTF output, consider re-enabling ASIS style option:
define str / "..." style( column ) = { asis= on };