SAS Proc GLM and Npar1way - sas

1.(a) Examine the significance of the differences between the mean worms counts for the four control groups using ANOVA test. Alpha=0.05
(b) If part (a) is significant, find pairs of groups that are significantly different in mean worms counts. Alpha=0.05
(a) Examine the significance of the differences between the means for the four control groups using
Kruskal-Wallis test (Nonparametric Test). Alpha=0.05
(b) Rank your response (all worms counts) in ascending order (choose “mean" tie) and save the ranks
in a dataset "rankworms".
(c) Examine the significance of the differences between the mean ranks of the four control groups using
ANOVA test. Compare your result with Kruskal Wallis test. Use “rankworms" dataset Alpha=0.05enter code here
My Code
data why;
input group $ worm ##;
datalines;
1 279 2 378 3 172 4 381
1 238 2 275 3 335 4 346
1 234 2 412 3 335 4 340
1 198 2 265 3 282 4 471
1 303 2 286 3 250 4 318
;
*Part 1A*;
Proc GLM data=why Alpha=0.05;
Class group;
Model worm = group;
Means group;
Run;
Quit;
*Part 2A*;
Proc Npar1way data=why Alpha=0.05 Wilcoxon;
Class group;
Var worm;
Run;
Quit;
*Part 2B*;
Proc Rank data=why Ties=Mean out=rankworms;
By worm;
Ranks newworm;
Var worm;
Run; Quit;
*Part 2C*;
Proc Npar1way data=why Alpha=0.05 Anova;
Class group;
Var worm;
Run;Quit;
For part 2B I keep getting the error code "data may be incomplete". I tried using proc sort in order to use a BY statement, however I kept getting that the data was not sorted in ascending order. I was under the perception that whenever you used Proc sort it automatically sorted everything by ascending order. For everything else I would just like to make sure that I am on the right track the questions are kinda confusing to me. Thanks in advance!

Remove the following from 2B.
BY WORM ;
since you don't want each worm to be ranked, but the groups of worm. It should probably be
BY GROUP;
You'll probably need to sort it by groups first as well.

Related

Repeating values after page break in proc report

I created a report of the following form:
ID VAR1
VAR2
111 1
2
3
4
5
6
222 1
2
I need to follow a requirement that if a page break appears inside the ID block, then the ID value must be displayed on the next page. The following form is not acceptable:
ID VAR1
VAR2
111 1
2
3
4
-----PAGE BREAK----
5
6
222 1
2
The page break must not occur between VAR1 and VAR2, either:
VAR2
111 1
2
3
-------PAGE BREAK--------
4
5
6
222 1
2
The report should look like this:
ID VAR1
VAR2
111 1
2
3
4
-------PAGE BREAK-----
111 5
6
222 1
2
The question is - how to obtain the result? I don't want to present each ID on a separate page because unique ID blocks differ in length. So there is no simple solution like creating page break variable with different values for different IDs. I would like to avoid modifying any variables (except grouping/sorting variables) in the dataset I feed into proc report.
I would appreciate any input on this. Thanks.
You need to use the spanrows option, like is shown in this paper from PharmaSUG 2011 - Beyond the Basics: Advanced REPORT Procedure Tips and Tricks
Updated for SAS® 9.2. You don't share your code, but it goes on the PROC REPORT line. Here's the example from the paper:
data spanrows_example;
set sashelp.class
sashelp.class
sashelp.class;
run;
ods pdf file='c:\spanrows.pdf';
proc report nowd data= spanrows_example spanrows;
col sex age name height weight;
define sex / order;
run;
ods pdf close;
You can't necessarily get what you want as far as var1/var2, though, without forcing a page break if you're close to a page (which is quite challenging to calculate accurately).

Why is SAS replacing an observed value with an underscore in the ODS for proc glm

EDIT!!!! GO TO BOTTOM FOR BETTER REPRODUCABLE CODE!
I have a data set with a quantitative variable that's missing 65 values that I need to impute. I used the ODS output and proc glm to simultaneously fit a model for this variable and predict values:
ODS output
predictedvalues=pred_val;
proc glm data=Six_min_miss;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
However, I am missing 21 predicted values because 21 of my observations are missing either of the two independent predictors.
If SAS can't make a prediction because of this missingness, it leaves an underscore (not a period) to show that it didn't make a prediction.
For some reason, if it can't make a prediction, SAS also puts an underscore for the 'observed' value--even if an observed value is present (the value in the highlighted cell under 'observed' should be 181.0512):
The following code merges the ODS output data set with the observed and predicted values, and the original data. The second data step attempts to create a new 'imputed' version of the variable that will use the original observation if it's not missing, but uses the predicted value if it is missing:
data PT_INFO_6MIN_IMP_temp;
merge PT_INFO pred_val;
drop dependent observation biased residual;
run;
data PT_INFO_6MIN_IMP_temp2;
set PT_INFO_6MIN_IMP_temp;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
However, as you can see, SAS is putting an underscore in the imputed column, when there was an original value that should have been used:
In other words, because the original variable values is not missing (it's 181.0512) SAS should have taken that value and copied it to the imputed value column. Instead, it put an underscore.
I've also tried if SIX_MIN_WALK_z =. then observed=predicted
Please let me know what I'm doing wrong and/or how to fix. I hope this all makes sense.
Thanks
EDIT!!!!! EDIT!!!!! EDIT!!!!!
See below for a truncated data set so that one can reproduce what's in the pictures. I took only the first 30 rows of my data set. There are three missing observations for the dependent variable that I'm trying to impute (obs 8, 11, 26). There are one of each of the independent variables missing, such that it can't make a prediction (obs 8 & 24). You'll notice that the "_IMP" version of the dependent variable mirrors the original. When it gets to missing obs #8, it doesn't impute a value because it wasn't able to predict a value. When it gets to #11 and #26, it WAS able to predict a value, so it added the predicted value to "_IMP." HOWEVER, for obs #24, it was NOT able to predict a value, but I didn't need it to, because we already have an observed value in the original variable (181.0512). I expected SAS to put this value in the "_IMP" column, but instead, it put an underscore.
data test;
input Study_ID nyha_4_enroll kccq12sf_both_base SIX_MIN_WALK_z;
cards;
01-001 3 87.5 399.288
01-002 4 83.333333333 411.48
01-003 2 87.5 365.76
01-005 4 14.583333333 0
01-006 3 52.083333333 362.1024
01-008 3 52.083333333 160.3248
01-009 2 56.25 426.72
01-010 4 75 .
01-011 3 79.166666667 156.3624
01-012 3 27.083333333 0
01-013 4 45.833333333 0
01-014 4 54.166666667 .
01-015 2 68.75 317.2968
01-017 3 29.166666667 196.2912
01-019 4 100 141.732
01-020 4 33.333333333 0
01-021 2 83.333333333 222.504
01-022 4 20.833333333 389.8392
01-025 4 0 0
01-029 4 43.75 0
01-030 3 83.333333333 236.22
01-031 2 35.416666667 302.0568
01-032 4 64.583333333 0
01-033 4 33.333333333 0
01-034 . 100 181.0512
01-035 4 12.5 0
01-036 4 66.666666667 .
01-041 4 75 0
01-042 4 43.75 0
01-043 4 72.916666667 0
;
run;
data test2;
set test;
drop Study_ID;
run;
ODS output
predictedvalues=pred_val;
proc glm data=test2;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
data combine;
merge test2 pred_val;
drop dependent observation biased residual;
run;
data combine_imp;
set combine;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
The special missing values (._) mark the observations excluded from the model because of missing values of the independent variables.
Try a simple example:
data class;
set sashelp.class(obs=10) ;
keep name sex age height;
if _n_=3 then age=.;
if _n_=4 then height=.;
run;
ods output predictedvalues=pred_val;
proc glm data=class;
class sex;
model height = sex age /p solution;
run; quit;
proc print data=pred_val; run;
Since for observation #3 the value of the independent variable AGE was missing in the predicted result dataset the values of observed, predicted and residual are set to ._.
Obs Dependent Observation Biased Observed Predicted Residual
1 Height 1 0 69.00000000 64.77538462 4.22461538
2 Height 2 0 56.50000000 58.76153846 -2.26153846
3 Height 3 1 _ _ _
4 Height 4 1 . 61.27692308 .
5 Height 5 0 63.50000000 64.77538462 -1.27538462
6 Height 6 0 57.30000000 59.74461538 -2.44461538
7 Height 7 0 59.80000000 56.24615385 3.55384615
8 Height 8 0 62.50000000 63.79230769 -1.29230769
9 Height 9 0 62.50000000 62.26000000 0.24000000
10 Height 10 0 59.00000000 59.74461538 -0.74461538
If you really want to just replace the values of OBSERVED or PREDICTED in the output with the values of the original variable that is pretty easy to do. Just re-combine with the source dataset. You can use the ID statement of PROC GLM to have it include any variables you want into the output. Like
id name sex age height;
Now you can use a dataset step to make any adjustments. For example to make a new height variable that is either the original or predicted value you could use:
data want ;
set pred_val ;
NEW_HEIGHT = coalesce(height,predicted);
run;
proc print data=want width=min;
var name height age predicted new_height ;
run;
Results:
NEW_
Obs Name Height Age Predicted HEIGHT
1 Alfred 69.0 14 64.77538462 69.0000
2 Alice 56.5 13 58.76153846 56.5000
3 Barbara 65.3 . _ 65.3000
4 Carol . 14 61.27692308 61.2769
5 Henry 63.5 14 64.77538462 63.5000
6 James 57.3 12 59.74461538 57.3000
7 Jane 59.8 12 56.24615385 59.8000
8 Janet 62.5 15 63.79230769 62.5000
9 Jeffrey 62.5 13 62.26000000 62.5000
10 John 59.0 12 59.74461538 59.0000

Two Way Transpose SAS Table

I am trying to create a two way transposed table. The original table I have looks like
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
I am trying to create a table that looks like
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
In other words, how many id have cc1 also have cc2..cc177..etc
The number under ID is not count; an ID could range from 3 digits to 5 digits ID or with numbers such as 122345ab78
Is it possible to have percentage display next to each other?
CC CC1 % CC2 %… …CC177
1 264 100% 5 1.9% 0
2 0 132 6
…
…
177 2 1 692
If I want to change the CC1 CC2 to characters, how do I modify the arrays?
Eventually, I would like my table looks like
CC Dell Lenovo HP Sony
Dell
Lenovo
HP
Sony
The order of the names must match the CC number I provided above. CC1=Dell CC2=Lenovo, etc. I would also want to add percentage to the matrice. If Dell X Dell = 100 and Dell X Lenovo = 25, then Dell X Lenovo = 25%.
This changes your data structure to a wide format with an indicator for each value of CC and then uses proc corr (correlation) to create the summary table.
Proc Corr will generate the SCCP - which is the uncorrected sum of squares and crossproducts. It's something that's related to correlation, but the gist is it creates the table you're looking for. The table is output in the SAS results window and the ODS OUTPUT statement will capture the table in a dataset called coocs.
data temp;
set have;
by ID;
retain CC1-CC177;
array CC_List(177) CC1-CC177;
if first.ID then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_List(CC)=1;
if last.ID then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=temp sscp;
var CC1-CC177;
run;
proc print data=coocs;
run;
Here's another answer, but it's inefficient and has it's issues. For one, if a value is not anywhere in the list it will not show up in the results, i.e. if there is no 20 in the dataset there will be no 20 in the final data. Also, the variables are out of order in the final dataset.
proc sql;
create table bigger as
select a.id, catt("CC", a.cc) as cc1, catt("CC", b.cc) as cc2
from have as a
cross join have as b
where a.id=b.id;
quit;
proc freq data=bigger noprint;
table cc1*cc2/ list out=bigger2;
run;
proc transpose data=bigger2 out=want2;
by cc1;
var count;
id cc2;
run;

SAS 9.3 Proc Rank Alternatives

I have an imported excel file, DATASET looks like:
Family Weight
1 150
1 210
1 99
2 230
2 100
2 172
I need to find the sum of ranks for each family.
I know that I can do this easily using PROC RANK but this is a HW problem and the only PROC statement I can use is PROC Means. I cannot even use Proc Sort.
The ranking would be as follows (lowest weight receives rank = 1, etc)
99 - Rank = 1
100 - Rank = 2
150 - Rank = 3
172 - Rank = 4
210 - Rank = 5
230 - Rank = 6
Resulting Dataset:
Family Sum_Ranking
1 9
2 12
Family 1 Sum_Ranking was calculated by (3+5+1)
Family 2 Sum_Ranking was calculated by (6+2+4)
Thank you for assistance.
I'm not going to give you code, but some tips.
Specifically, the most interesting part about the instructions is the explicit "not even PROC SORT".
PROC MEANS has a useful side effect, in that it sorts data by the class variables (in the class variable order). So,
PROC SORT data=blah out=blah_w;
by x y;
run;
and
PROC MEANS data=blah;
class x y;
var y;
output out=blah_w n=;
run;
Have almost the identical results. Both produce a dataset sorted by x y, even though PROC MEANS didn't require a sort.
So in this case, you can use PROC MEANS' class statement to produce a dataset that is sorted by weight and family (you must carry over family here even though you don't need it). Then you must use a data step to produce a RANK variable, which is the rank of the current line (use the _FREQ_ column to figure that out in case there are more than one with the same rank in the same family, and think about what to do in case of ties), then another PROC MEANS to summarize by family this time.

determining variables that are constant within each id (stacked dataset)

I inherited a poorly documented person-month dataset that does not have a matching person-level dataset. I want to determine which of the variables in the person-month dataset are actually person-level variables (constant for all observations with a particular id), such as you would expect for date of birth. Simplistic example:
id month dob race tx weight
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
In this example, dob and race are fixed within an individual but tx and weight vary by month within an individual.
I have come up with a clumsy solution: use proc means to calculate the standard deviation of all numeric variables BY id, and then take the maximum of those standard deviations. If the maximum of the std of a variable is 0, there is no variance of that column within any individual, and I can flag that variable as being fixed (or person-level).
I feel like I'm missing a simpler statistical test to determine which of my hundreds of variables are fixed within each individuals and which vary within an individual's observations. Any suggestions?
pT
I would use the NLEVELS option in PROC FREQ. This gives you the number of unique values for each variable, so you're looking for variables with a unique value (nlevels) of 1.
Here's the code, you'll need to sort the data by id beforehand if not done already.
data have;
input id month dob race tx weight;
cards;
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
;
run;
ods select nlevels;
ods output nlevels=want;
ods noresults;
proc freq data=have nlevels;
by id;
run;
ods results;
I don't think there's a 'simple statistical test' beyond what you have worked out - standard deviation, or even MIN/MAX (which is about the same). I'd probably just do it in PROC SQL, unless there are a huge number of variables; this allows you to use character variables also.
%macro comparetype(var);
max(&var.) = min(&var.) as &var.
%mend comparetype;
proc sql;
select min(origin) as origin, min(type) as type, min(drivetrain) as drivetrain,
min(msrp) as msrp,min(invoice) as invoice,min(enginesize) as enginesize from (
select make,
%comparetype(origin),
%comparetype(type),
%comparetype(drivetrain),
%comparetype(msrp),
%comparetype(invoice),
%comparetype(enginesize)
from sashelp.cars
group by make
);
quit;