SAS 9.3 Proc Rank Alternatives - sas

I have an imported excel file, DATASET looks like:
Family Weight
1 150
1 210
1 99
2 230
2 100
2 172
I need to find the sum of ranks for each family.
I know that I can do this easily using PROC RANK but this is a HW problem and the only PROC statement I can use is PROC Means. I cannot even use Proc Sort.
The ranking would be as follows (lowest weight receives rank = 1, etc)
99 - Rank = 1
100 - Rank = 2
150 - Rank = 3
172 - Rank = 4
210 - Rank = 5
230 - Rank = 6
Resulting Dataset:
Family Sum_Ranking
1 9
2 12
Family 1 Sum_Ranking was calculated by (3+5+1)
Family 2 Sum_Ranking was calculated by (6+2+4)
Thank you for assistance.

I'm not going to give you code, but some tips.
Specifically, the most interesting part about the instructions is the explicit "not even PROC SORT".
PROC MEANS has a useful side effect, in that it sorts data by the class variables (in the class variable order). So,
PROC SORT data=blah out=blah_w;
by x y;
run;
and
PROC MEANS data=blah;
class x y;
var y;
output out=blah_w n=;
run;
Have almost the identical results. Both produce a dataset sorted by x y, even though PROC MEANS didn't require a sort.
So in this case, you can use PROC MEANS' class statement to produce a dataset that is sorted by weight and family (you must carry over family here even though you don't need it). Then you must use a data step to produce a RANK variable, which is the rank of the current line (use the _FREQ_ column to figure that out in case there are more than one with the same rank in the same family, and think about what to do in case of ties), then another PROC MEANS to summarize by family this time.

Related

Why is SAS replacing an observed value with an underscore in the ODS for proc glm

EDIT!!!! GO TO BOTTOM FOR BETTER REPRODUCABLE CODE!
I have a data set with a quantitative variable that's missing 65 values that I need to impute. I used the ODS output and proc glm to simultaneously fit a model for this variable and predict values:
ODS output
predictedvalues=pred_val;
proc glm data=Six_min_miss;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
However, I am missing 21 predicted values because 21 of my observations are missing either of the two independent predictors.
If SAS can't make a prediction because of this missingness, it leaves an underscore (not a period) to show that it didn't make a prediction.
For some reason, if it can't make a prediction, SAS also puts an underscore for the 'observed' value--even if an observed value is present (the value in the highlighted cell under 'observed' should be 181.0512):
The following code merges the ODS output data set with the observed and predicted values, and the original data. The second data step attempts to create a new 'imputed' version of the variable that will use the original observation if it's not missing, but uses the predicted value if it is missing:
data PT_INFO_6MIN_IMP_temp;
merge PT_INFO pred_val;
drop dependent observation biased residual;
run;
data PT_INFO_6MIN_IMP_temp2;
set PT_INFO_6MIN_IMP_temp;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
However, as you can see, SAS is putting an underscore in the imputed column, when there was an original value that should have been used:
In other words, because the original variable values is not missing (it's 181.0512) SAS should have taken that value and copied it to the imputed value column. Instead, it put an underscore.
I've also tried if SIX_MIN_WALK_z =. then observed=predicted
Please let me know what I'm doing wrong and/or how to fix. I hope this all makes sense.
Thanks
EDIT!!!!! EDIT!!!!! EDIT!!!!!
See below for a truncated data set so that one can reproduce what's in the pictures. I took only the first 30 rows of my data set. There are three missing observations for the dependent variable that I'm trying to impute (obs 8, 11, 26). There are one of each of the independent variables missing, such that it can't make a prediction (obs 8 & 24). You'll notice that the "_IMP" version of the dependent variable mirrors the original. When it gets to missing obs #8, it doesn't impute a value because it wasn't able to predict a value. When it gets to #11 and #26, it WAS able to predict a value, so it added the predicted value to "_IMP." HOWEVER, for obs #24, it was NOT able to predict a value, but I didn't need it to, because we already have an observed value in the original variable (181.0512). I expected SAS to put this value in the "_IMP" column, but instead, it put an underscore.
data test;
input Study_ID nyha_4_enroll kccq12sf_both_base SIX_MIN_WALK_z;
cards;
01-001 3 87.5 399.288
01-002 4 83.333333333 411.48
01-003 2 87.5 365.76
01-005 4 14.583333333 0
01-006 3 52.083333333 362.1024
01-008 3 52.083333333 160.3248
01-009 2 56.25 426.72
01-010 4 75 .
01-011 3 79.166666667 156.3624
01-012 3 27.083333333 0
01-013 4 45.833333333 0
01-014 4 54.166666667 .
01-015 2 68.75 317.2968
01-017 3 29.166666667 196.2912
01-019 4 100 141.732
01-020 4 33.333333333 0
01-021 2 83.333333333 222.504
01-022 4 20.833333333 389.8392
01-025 4 0 0
01-029 4 43.75 0
01-030 3 83.333333333 236.22
01-031 2 35.416666667 302.0568
01-032 4 64.583333333 0
01-033 4 33.333333333 0
01-034 . 100 181.0512
01-035 4 12.5 0
01-036 4 66.666666667 .
01-041 4 75 0
01-042 4 43.75 0
01-043 4 72.916666667 0
;
run;
data test2;
set test;
drop Study_ID;
run;
ODS output
predictedvalues=pred_val;
proc glm data=test2;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
data combine;
merge test2 pred_val;
drop dependent observation biased residual;
run;
data combine_imp;
set combine;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
The special missing values (._) mark the observations excluded from the model because of missing values of the independent variables.
Try a simple example:
data class;
set sashelp.class(obs=10) ;
keep name sex age height;
if _n_=3 then age=.;
if _n_=4 then height=.;
run;
ods output predictedvalues=pred_val;
proc glm data=class;
class sex;
model height = sex age /p solution;
run; quit;
proc print data=pred_val; run;
Since for observation #3 the value of the independent variable AGE was missing in the predicted result dataset the values of observed, predicted and residual are set to ._.
Obs Dependent Observation Biased Observed Predicted Residual
1 Height 1 0 69.00000000 64.77538462 4.22461538
2 Height 2 0 56.50000000 58.76153846 -2.26153846
3 Height 3 1 _ _ _
4 Height 4 1 . 61.27692308 .
5 Height 5 0 63.50000000 64.77538462 -1.27538462
6 Height 6 0 57.30000000 59.74461538 -2.44461538
7 Height 7 0 59.80000000 56.24615385 3.55384615
8 Height 8 0 62.50000000 63.79230769 -1.29230769
9 Height 9 0 62.50000000 62.26000000 0.24000000
10 Height 10 0 59.00000000 59.74461538 -0.74461538
If you really want to just replace the values of OBSERVED or PREDICTED in the output with the values of the original variable that is pretty easy to do. Just re-combine with the source dataset. You can use the ID statement of PROC GLM to have it include any variables you want into the output. Like
id name sex age height;
Now you can use a dataset step to make any adjustments. For example to make a new height variable that is either the original or predicted value you could use:
data want ;
set pred_val ;
NEW_HEIGHT = coalesce(height,predicted);
run;
proc print data=want width=min;
var name height age predicted new_height ;
run;
Results:
NEW_
Obs Name Height Age Predicted HEIGHT
1 Alfred 69.0 14 64.77538462 69.0000
2 Alice 56.5 13 58.76153846 56.5000
3 Barbara 65.3 . _ 65.3000
4 Carol . 14 61.27692308 61.2769
5 Henry 63.5 14 64.77538462 63.5000
6 James 57.3 12 59.74461538 57.3000
7 Jane 59.8 12 56.24615385 59.8000
8 Janet 62.5 15 63.79230769 62.5000
9 Jeffrey 62.5 13 62.26000000 62.5000
10 John 59.0 12 59.74461538 59.0000

Interpolate values in unbalanced panel data using SAS

Say we are confined to using SAS and have a panel/longitudinal dataset. We have indicators for cohort and time, as well as some measured variable y.
data in;
input cohort time y;
datalines;
1 1 100
1 2 101
1 3 102
1 4 103
1 5 104
1 6 105
2 2 .
2 3 .
2 4 .
2 5 .
2 6 .
3 3 .
3 4 .
3 5 .
3 6 .
4 4 108
4 5 110
4 6 112
run;
Note that units of cohort and time are the same so that if the dataset goes out to time unit 6, each successive panel unit will be one period shorter than the one before it in time.
We have a gap of two panel units between actual data. The goal is to linearly interpolate the two missing panel units (values for cohort 2 and 3) from the two that "sandwich" them. For cohort 2 at time 5 the interpolated value should be 0.67*104 + 0.33*110, while for cohort 3 at time 5 it would be 0.33*104 + 0.67*110. Basically you just weight 2/3 for the closer panel unit with actuals, and 1/3 for the further panel unit. You'll of course have missing values, but for this toy example that's not a problem.
I'm imagining the solution involves lagging and using the first. operator and loops but my SAS is so poor I hesitate to provide even my broken code example.
I've got a solution, it is however tortured. There must be a better way to do it, this takes one line in Stata.
First we use proc SQL to make a table of the two populated panel units, the "bread of the sandwich"
proc sql;
create table haveY as
select time, cohort, y
from startingData
where y is not missing
order by time, cohort;
quit;
Next we loop over the rows of this reduced dataset to produce interpolated values, I don't completely follow the operations here, I modified a related example I found.
data wantY;
set haveY(rename=(y=thisY cohort=thisCohort));
by time;
retain lastCohort lastY;
lastcohort = lag(thisCohort);
lastY = lag(thisY);
if not first.time then do;
do cohort = lastCohort +1 to thisCohort-1;
y = ((thisCohort-cohort)*lastY + (cohort-lastCohort)*thisY)/(thisCohort-lastCohort);
output;
end;
end;
cohort=thisCohort;
y=thisY;
drop this: last:;
run;
proc sort data=work.wantY;
by cohort time;
run;
This does produce what is needed, it can be joined using proc sql into the starting table: startingData. Not a completely satisfying solution due to the verbosity but it does work.

SAS/ Long to Wide with missing data using data step

I want converted my data from long to wide format using data step. The problem is that due to missing values the values are not placed in the correct cells. I think to solve the problem I have to include placeholder for missing values.
The problem is I don't know how to do. Can someone please give me tip on how to go about it.
data tic;
input id country$ month math;
datalines;
1 uk 1 10
1 uk 2 15
1 uk 3 24
2 us 2 15
2 us 4 12
3 fl 1 15
3 fl 2 16
3 fl 3 17
3 fl 4 15
;
run;
proc sort data=tic;
by id;
run;
data tot(drop=month math);
retain month1-month4 math1-math4;
array tat{4} month1-month4;
array kat{4} math1-math4;
set tic;
by id;
if first.id then do;
i=1;
do j=1 to 4;
tat{j}=.;
kat{j}=.;
end;
end;
tat(i)=month;
kat(i)=math;
if last.id then output;
i+1;
run;
Edit
I finally figured out what the problem is:
changed this lines of code
tat(i)=month;
kat(i)=math;
to:
tat(month)=month;
kat(month)=math;
and it fixed the problem.
Data transformations from tall and skinny to short and wide often mean that categorical data ends up as column names. This is a process of moving data to metadata, which can be a problem later on for dealing with BY or CLASS groups.
SAS has Proc TABULATE and Proc REPORT for creating pivoted output. Proc TRANSPOSE is also a good standard way of creating pivoted data.
I did notice that you are pivoting two columns at once. TRANSPOSE can't multi-pivot. The DATA Step approach you showed is a typical way for doing a transpose transform when the indices lie within known ranges. In your case the array declaration must be such that 'direct-addressing' via index can to handle the minimal and maximal month values that occur over all the data.

Weighted Rank in SAS

I have a table with different scores for R60,R90,R120,R150,R180 and how can I make one table with a weighted rank based on this five variables, and CODE_RAC where NORM_PCT has 40% weightage, RB_PCT has 30% weightage and RB_PCT has 40% weightage ][1]
Can you help me with this in SAS Enterprise Edition? Please find the sample attached from the dataset
This isn't done with enterprise edition, but I hope it would serve.
There should be a proc rank program, which does the ranking for you. Either that or you can just sort the data by calculated 'ranking variable (rank_calc in example). I'm quite sure you could do this in single step, but may this be more informative.
data Begin;
length code_rac $10 norm_R60 3 rb_R60 3 Reso_R60 3;
input code_rac norm_R60 rb_R60 Reso_R60;
datalines;
first 10 6 2
second 0 0 10
third 8 6 4
forth 0 10 7
fifth 0 0 8
;
ruN;
data begin; /*Calculate weighted value for ranking*/
set begin;
rank_calc= norm_R60*0.4 + rb_R60*0.3 + Reso_R60*0.4;
run;
proc rank data=begin out=sorted_by_rank;
var rank_calc;
ranks my_rank;
run;
For more on ranking see http://www.lexjansen.com/nesug/nesug09/ap/AP01.pdf

SAS proc rank is omitting a rank i.e. is generating 9 ranks when it is supposed to be generating 10

proc rank data=___ out=____ ties=low
groups=10;
var variable_name1;
ranks variable_name2;
run;
This code is generating ranks (variable_name2) from 0-9 with only 7 missing. Anyone know why that's the case?
variable_name1 is distributed as follows:
mean = 440
stddev 12
min = 217
max = 474
Most likely you have too many ties, repeat of a single value. This means that your 7/8 or 6/7th ranks are essentially the same.
Run a proc freq on your data and you'll find that you have a single number that accounts for more than 10% of your data.