I inherited a poorly documented person-month dataset that does not have a matching person-level dataset. I want to determine which of the variables in the person-month dataset are actually person-level variables (constant for all observations with a particular id), such as you would expect for date of birth. Simplistic example:
id month dob race tx weight
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
In this example, dob and race are fixed within an individual but tx and weight vary by month within an individual.
I have come up with a clumsy solution: use proc means to calculate the standard deviation of all numeric variables BY id, and then take the maximum of those standard deviations. If the maximum of the std of a variable is 0, there is no variance of that column within any individual, and I can flag that variable as being fixed (or person-level).
I feel like I'm missing a simpler statistical test to determine which of my hundreds of variables are fixed within each individuals and which vary within an individual's observations. Any suggestions?
pT
I would use the NLEVELS option in PROC FREQ. This gives you the number of unique values for each variable, so you're looking for variables with a unique value (nlevels) of 1.
Here's the code, you'll need to sort the data by id beforehand if not done already.
data have;
input id month dob race tx weight;
cards;
1 1 4058 1 1 105
1 2 4058 1 1 107
1 3 4058 1 2 108
2 1 1622 2 1 153
2 2 1622 2 3 153
2 3 1622 2 2 153
;
run;
ods select nlevels;
ods output nlevels=want;
ods noresults;
proc freq data=have nlevels;
by id;
run;
ods results;
I don't think there's a 'simple statistical test' beyond what you have worked out - standard deviation, or even MIN/MAX (which is about the same). I'd probably just do it in PROC SQL, unless there are a huge number of variables; this allows you to use character variables also.
%macro comparetype(var);
max(&var.) = min(&var.) as &var.
%mend comparetype;
proc sql;
select min(origin) as origin, min(type) as type, min(drivetrain) as drivetrain,
min(msrp) as msrp,min(invoice) as invoice,min(enginesize) as enginesize from (
select make,
%comparetype(origin),
%comparetype(type),
%comparetype(drivetrain),
%comparetype(msrp),
%comparetype(invoice),
%comparetype(enginesize)
from sashelp.cars
group by make
);
quit;
Related
EDIT!!!! GO TO BOTTOM FOR BETTER REPRODUCABLE CODE!
I have a data set with a quantitative variable that's missing 65 values that I need to impute. I used the ODS output and proc glm to simultaneously fit a model for this variable and predict values:
ODS output
predictedvalues=pred_val;
proc glm data=Six_min_miss;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
However, I am missing 21 predicted values because 21 of my observations are missing either of the two independent predictors.
If SAS can't make a prediction because of this missingness, it leaves an underscore (not a period) to show that it didn't make a prediction.
For some reason, if it can't make a prediction, SAS also puts an underscore for the 'observed' value--even if an observed value is present (the value in the highlighted cell under 'observed' should be 181.0512):
The following code merges the ODS output data set with the observed and predicted values, and the original data. The second data step attempts to create a new 'imputed' version of the variable that will use the original observation if it's not missing, but uses the predicted value if it is missing:
data PT_INFO_6MIN_IMP_temp;
merge PT_INFO pred_val;
drop dependent observation biased residual;
run;
data PT_INFO_6MIN_IMP_temp2;
set PT_INFO_6MIN_IMP_temp;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
However, as you can see, SAS is putting an underscore in the imputed column, when there was an original value that should have been used:
In other words, because the original variable values is not missing (it's 181.0512) SAS should have taken that value and copied it to the imputed value column. Instead, it put an underscore.
I've also tried if SIX_MIN_WALK_z =. then observed=predicted
Please let me know what I'm doing wrong and/or how to fix. I hope this all makes sense.
Thanks
EDIT!!!!! EDIT!!!!! EDIT!!!!!
See below for a truncated data set so that one can reproduce what's in the pictures. I took only the first 30 rows of my data set. There are three missing observations for the dependent variable that I'm trying to impute (obs 8, 11, 26). There are one of each of the independent variables missing, such that it can't make a prediction (obs 8 & 24). You'll notice that the "_IMP" version of the dependent variable mirrors the original. When it gets to missing obs #8, it doesn't impute a value because it wasn't able to predict a value. When it gets to #11 and #26, it WAS able to predict a value, so it added the predicted value to "_IMP." HOWEVER, for obs #24, it was NOT able to predict a value, but I didn't need it to, because we already have an observed value in the original variable (181.0512). I expected SAS to put this value in the "_IMP" column, but instead, it put an underscore.
data test;
input Study_ID nyha_4_enroll kccq12sf_both_base SIX_MIN_WALK_z;
cards;
01-001 3 87.5 399.288
01-002 4 83.333333333 411.48
01-003 2 87.5 365.76
01-005 4 14.583333333 0
01-006 3 52.083333333 362.1024
01-008 3 52.083333333 160.3248
01-009 2 56.25 426.72
01-010 4 75 .
01-011 3 79.166666667 156.3624
01-012 3 27.083333333 0
01-013 4 45.833333333 0
01-014 4 54.166666667 .
01-015 2 68.75 317.2968
01-017 3 29.166666667 196.2912
01-019 4 100 141.732
01-020 4 33.333333333 0
01-021 2 83.333333333 222.504
01-022 4 20.833333333 389.8392
01-025 4 0 0
01-029 4 43.75 0
01-030 3 83.333333333 236.22
01-031 2 35.416666667 302.0568
01-032 4 64.583333333 0
01-033 4 33.333333333 0
01-034 . 100 181.0512
01-035 4 12.5 0
01-036 4 66.666666667 .
01-041 4 75 0
01-042 4 43.75 0
01-043 4 72.916666667 0
;
run;
data test2;
set test;
drop Study_ID;
run;
ODS output
predictedvalues=pred_val;
proc glm data=test2;
class nyha_4_enroll;
model SIX_MIN_WALK_z= nyha_4_enroll kccq12sf_both_base /p solution;
run;
ODS output close;
data combine;
merge test2 pred_val;
drop dependent observation biased residual;
run;
data combine_imp;
set combine;
if missing (SIX_MIN_WALK_z) then observed=predicted;
rename observed=SIX_MIN_WALK_z_IMPUTED;
run;
The special missing values (._) mark the observations excluded from the model because of missing values of the independent variables.
Try a simple example:
data class;
set sashelp.class(obs=10) ;
keep name sex age height;
if _n_=3 then age=.;
if _n_=4 then height=.;
run;
ods output predictedvalues=pred_val;
proc glm data=class;
class sex;
model height = sex age /p solution;
run; quit;
proc print data=pred_val; run;
Since for observation #3 the value of the independent variable AGE was missing in the predicted result dataset the values of observed, predicted and residual are set to ._.
Obs Dependent Observation Biased Observed Predicted Residual
1 Height 1 0 69.00000000 64.77538462 4.22461538
2 Height 2 0 56.50000000 58.76153846 -2.26153846
3 Height 3 1 _ _ _
4 Height 4 1 . 61.27692308 .
5 Height 5 0 63.50000000 64.77538462 -1.27538462
6 Height 6 0 57.30000000 59.74461538 -2.44461538
7 Height 7 0 59.80000000 56.24615385 3.55384615
8 Height 8 0 62.50000000 63.79230769 -1.29230769
9 Height 9 0 62.50000000 62.26000000 0.24000000
10 Height 10 0 59.00000000 59.74461538 -0.74461538
If you really want to just replace the values of OBSERVED or PREDICTED in the output with the values of the original variable that is pretty easy to do. Just re-combine with the source dataset. You can use the ID statement of PROC GLM to have it include any variables you want into the output. Like
id name sex age height;
Now you can use a dataset step to make any adjustments. For example to make a new height variable that is either the original or predicted value you could use:
data want ;
set pred_val ;
NEW_HEIGHT = coalesce(height,predicted);
run;
proc print data=want width=min;
var name height age predicted new_height ;
run;
Results:
NEW_
Obs Name Height Age Predicted HEIGHT
1 Alfred 69.0 14 64.77538462 69.0000
2 Alice 56.5 13 58.76153846 56.5000
3 Barbara 65.3 . _ 65.3000
4 Carol . 14 61.27692308 61.2769
5 Henry 63.5 14 64.77538462 63.5000
6 James 57.3 12 59.74461538 57.3000
7 Jane 59.8 12 56.24615385 59.8000
8 Janet 62.5 15 63.79230769 62.5000
9 Jeffrey 62.5 13 62.26000000 62.5000
10 John 59.0 12 59.74461538 59.0000
I have data that's tracking a certain eye phenomena. Some patients have it in both eyes, and some patients have it in a single eye. This is what some of the data looks like:
EyeID PatientID STATUS Gender
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
As you can see from the data above, there are 9 patients total and all of them have the particular phenomena in one eye.
I need the count the number of patients with this eye phenomena.
To get the number of total patients in the dataset, I used:
PROC FREQ data=new nlevels;
tables PatientID;
run;
To count the number of patients with this eye phenomena, I used:
PROC SORT data=new out=new1 nodupkey;
by Patientid Status;
run;
proc freq data=new1 nlevels;
tables Status;
run;
However, it gave the correct number of patients with the phenomena (9), but not the correct number without (0).
I now need to calculate the gender distribution of this phenomena. I used:
proc freq data=new1;
tables gender*Status/chisq;
run;
However, in the cross table, it has the correct number of patients who have the phenomena (9), but not the correct number without (0). Does anyone have any thoughts on how to do this chi-square, where if the has this phenomena in at least 1 eye, then they are positive for this phenomena?
Thanks!
PROC FREQ is doing what you told it to: counting the status=0 cases.
In general here you are using sort of blunt tools to accomplish what you're trying to accomplish, when you probably should use a more precise tool. PROC SORT NODUPKEY is sort of overkill for example, and it doesn't really do what you want anyway.
To set up a dataset of has/doesn't have, for example, let's do a few things. First I add one more row - someone who actually doesn't have - so we see that working.
data have;
input eyeID patientID status gender $;
datalines;
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
15 10 0 M
;;;;
run;
Now we use the data step. We want a patient-level dataset at the end, where we have eye-level now. So we create a new patient-level status.
data patient_level;
set have;
by patientID;
retain patient_status;
if first.patientID then patient_status =0;
patient_status = (patient_Status or status);
if last.patientID then output;
keep patientID patient_Status gender;
run;
Now, we can run your second proc freq. Also note you have a nice dataset of patients.
title "Patients with/without condition in any eye";
proc freq data=patient_level;
tables patient_status;
run;
title;
You also may be able to do your chi-square analysis, though I'm not a statistician and won't dip my toe into whether this is an appropriate analysis. It's likely better than your first, anyway - as it correctly identifies has/doesn't have status in at least one eye. You may need a different indicator, if you need to know number of eyes.
title "Crosstab of gender by patient having/not having condition";
proc freq data=patient_level;
tables gender*patient_Status/chisq;
run;
title;
If your actual data has every single patient having the condition, of course, it's unlikely a chi-square analysis is appropriate.
I am a new user to SAS so any feedback is greatly appreciated. I am trying to create a bar chart in SAS which shows the percent of patients that received a test by category (a stratification of risk) and then within that, show where the test was received (location). My dataset looks like this:
Category Test Test_location
High Risk 1 Site 1
Intermediate Risk 1 Site 2
Low Risk 0 .
Intermediate Risk 0 .
High Risk 1 Site 3
Where each patient is listed with the category they have been assigned to (variable 'Category'), an indicator variable that shows whether or not they received a test (variable 'test' where '1'=received test and '0'=did not receive test) and if they received a test, where that test took place (variable 'test_location').
I want to create a bar graph with the categories on the x axis and the yaxis showing the percentage of patients who got a test (test=1), and each bar showing how many tests occurred in Site 1, 2 and 3.
I have this code but it gives me the counts of patients who received test rather than percentages:
proc sgplot data=test;
vbar category / response=test
group=test_location groupdisplay=stack;
yaxis grid values=(0 to 100 by 10) label="Percentage of patients who received testing (%)";
label Category= "Risk Stratification";
keylegend /title="Testing Location" position=bottom;
quit;
I don't think proc sgplot has a percent stat, so I tried doing a proc freq but I can't figure out how to do that accurately for all of the variables I have.
Thanks for your help!
EDIT;
I added in percent stat like the poster below suggested, but it is not giving me the percentages I want (it gives me a pct_col output of test*category, and I want pct_row).
The below code gives me the percentages I want, but I also want to add test_location to show on each bar what percentages of patients were in each location.
proc tabulate data=test_util out=freq1;
class category test;
tables category,test*rowpctn;
run;
proc sgplot data=Freq1;
where test=1;
vbar category / response=pctn_10;
quit;
Example of what I want:
In the dummy dataset below, for high risk patients, for example, I want a bar that shows 75% (12 patients with tests out of the total 16 high risk patients) received tests, and then have the bar shaded to show 41.66% of those test were at Site 1, 33.34% at Site 2 and 25% at Site 3. And so on for the intermediate and low risk categories. If there is a way to label the subsections with the exact percentages, that would be great too.
Dummy data set:
data test;
infile datalines missover;
input ID Category $ Test Test_location $;
datalines;
1 High 1 Site_1
2 High 1 Site_1
3 High 1 Site_1
4 High 1 Site_1
5 High 1 Site_1
6 High 1 Site_2
7 High 1 Site_2
8 High 1 Site_2
9 High 1 Site_2
10 High 1 Site_3
11 High 1 Site_3
12 High 1 Site_3
13 High 0
14 High 0
15 High 0
16 High 0
17 Intermediate 1 Site_1
18 Intermediate 1 Site_1
19 Intermediate 1 Site_2
20 Intermediate 0
21 Intermediate 0
22 Intermediate 0
23 Intermediate 0
24 Intermediate 0
25 Intermediate 0
26 Low 1 Site_1
27 Low 1 Site_1
28 Low 1 Site_1
29 Low 1 Site_2
30 Low 1 Site_2
31 Low 1 Site_2
32 Low 1 Site_3
33 Low 0
34 Low 0
35 Low 0
36 Low 0
37 Low 0
38 Low 0
;
proc sgplot data=mydata pctlevel=graph;
vbar Category / response=Test stat=percent
group=Test_location;
run;
proc sgplot does have a percent stat, if you're in SAS 9.4+. That wasn't added until that point (According to the doc, anyway; you might try it and see in case it's undocumented.)
If not, then proc freq should allow you to create values that proc sgplot with vbarparm can use. You don't post your proc freq above so I don't know what you've tried yet, but look around; for example, my MWSUG paper shows an example of doing this (for a different reason, but the result is largely the same).
Here's a tabulate that does the work - probably easier than using freq.
proc tabulate data=have out=freq1;
class category test test_location/missing;
tables test_location,category,test*pctn;
run;
proc sgplot data=Freq1;
where test=1;
vbar category / response=pctn_000 group=test_location;
quit;
I am trying to create a two way transposed table. The original table I have looks like
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
I am trying to create a table that looks like
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
In other words, how many id have cc1 also have cc2..cc177..etc
The number under ID is not count; an ID could range from 3 digits to 5 digits ID or with numbers such as 122345ab78
Is it possible to have percentage display next to each other?
CC CC1 % CC2 %… …CC177
1 264 100% 5 1.9% 0
2 0 132 6
…
…
177 2 1 692
If I want to change the CC1 CC2 to characters, how do I modify the arrays?
Eventually, I would like my table looks like
CC Dell Lenovo HP Sony
Dell
Lenovo
HP
Sony
The order of the names must match the CC number I provided above. CC1=Dell CC2=Lenovo, etc. I would also want to add percentage to the matrice. If Dell X Dell = 100 and Dell X Lenovo = 25, then Dell X Lenovo = 25%.
This changes your data structure to a wide format with an indicator for each value of CC and then uses proc corr (correlation) to create the summary table.
Proc Corr will generate the SCCP - which is the uncorrected sum of squares and crossproducts. It's something that's related to correlation, but the gist is it creates the table you're looking for. The table is output in the SAS results window and the ODS OUTPUT statement will capture the table in a dataset called coocs.
data temp;
set have;
by ID;
retain CC1-CC177;
array CC_List(177) CC1-CC177;
if first.ID then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_List(CC)=1;
if last.ID then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=temp sscp;
var CC1-CC177;
run;
proc print data=coocs;
run;
Here's another answer, but it's inefficient and has it's issues. For one, if a value is not anywhere in the list it will not show up in the results, i.e. if there is no 20 in the dataset there will be no 20 in the final data. Also, the variables are out of order in the final dataset.
proc sql;
create table bigger as
select a.id, catt("CC", a.cc) as cc1, catt("CC", b.cc) as cc2
from have as a
cross join have as b
where a.id=b.id;
quit;
proc freq data=bigger noprint;
table cc1*cc2/ list out=bigger2;
run;
proc transpose data=bigger2 out=want2;
by cc1;
var count;
id cc2;
run;
Using PROC REPORT in SAS, if a certain ACROSS variable has 5 different value possibilities (for example, 1 2 3 4 5), but in my data set there are no observations where that variable is equal to, say, 5, how can I get the report to show the column for 5 and display 0 for the # of observations having that value?
Currently my PROC REPORT output is just not displaying those value columns that have no observations.
When push comes to shove, you can do some hacks like this. Notice that there are no missing on SEX variable of the SASHELP.CLASS:
proc format;
value $sex 'F' = 'female' 'M' = 'male' 'X' = 'other';
run;
options missing=0;
proc report data=sashelp.class nowd ;
column age sex;
define age/ group;
define sex/ across format=$sex. preloadfmt;
run;
options missing=.;
/*
Sex
Age female male other
11 1 1 0
12 2 3 0
13 2 1 0
14 2 2 0
15 2 2 0
16 0 1 0
*/