Applying cutoff to data set with IDs - sas

I am using SAS and managed to run proc logistic, which gives me a table like so.
Classification Table
Prob Correct Incorrect Percentages
Level Event Non- Event Non- Correct Sensi- Speci- FALSE FALSE
Event Event tivity ficity POS NEG J
0 33 0 328 0 9.1 100 0 90.9 . 99
0.02 33 62 266 0 26.3 100 18.9 89 0 117.9
0.04 31 162 166 2 53.5 93.9 49.4 84.3 1.2 142.3
0.06 26 209 119 7 65.1 78.8 63.7 82.1 3.2 141.5
How do I include IDs for the rows of data in lib.POST_201505_PRED below that have at least 0.6 probability?
proc logistic data=lib.POST_201503 outmodel=lib.POST_201503_MODEL descending;
model BUYER =
age
tenure
usage
payment
loyalty_card
/outroc=lib.POST_201503_ROC;
Score data=lib.POST_201505 out=lib.POST_201505_PRED outroc=lib.POST_201505_ROC;
run;
I've been reading the documentation and searching online but haven't found anything on it. I must be searching for the wrong keywords, as I presume this is a frequently used process.

You just need an id-statement to tell SAS your ID-variable identifies your observations;
proc logistic data=lib.POST_201503 outmodel=lib.POST_201503_MODEL descending;
id ID;
model BUYER = age tenure usage payment loyalty_card
/outroc=lib.POST_201503_ROC;
Score data=lib.POST_201505
out=lib.POST_201505_PRED
outroc=lib.POST_201505_ROC;
run;
Now your output contains all you need.
For instance to print the IDs that get had probability of at least 0.6 assigned of being a BUYER to them;
proc print data=lib.POST_201505_PRED (where=(P_1 GE 0.6));
var ID P_1;
run;
You find these id yourKey; statements throughout the statistical procedures in SAS, for instance ;
proc univariate data=psydata.stroop;
id Subject;
var ReadTime;
run;
** will report the most extreme values of ReadTime as
;

Turns out I just had to include the ids in lib.POST_201505

Related

PROC TABULATE with ALL and row percentage

I am not able to get a row with ALL using row percentages. I would like the first row to give sum and percentage for column totals. So the percent under borderline for ALL should display 1861 * 100/5049=36.8% and under Desirable to display 1399 * 100/5049=27.7%. Currently it is displaying 100% and I need to change that.
proc tabulate data=sashelp.heart;* format=8.2;
class chol_status smoking_status sex;
table (all smoking_status sex),
(all chol_status)*(n*f=8. colpctn) ;
run;
The output is
All Cholesterol Status
Borderline Desirable High
N ColPctN N ColPctN N ColPctN N ColPctN
All 5049 100.00 1861 100.00 1399 100.00 1789 100.00 <- change the cholesterol % to denominator 5049
Smoking Status
Heavy (16-25) 1029 20.38 383 20.58 285 20.37 361 20.18
Light (1-5) 563 11.15 192 10.32 174 12.44 197 11.01
Moderate (6-15) 563 11.15 217 11.66 170 12.15 176 9.84
Non-smoker 2436 48.25 886 47.61 655 46.82 895 50.03
Very Heavy (> 25) 458 9.07 183 9.83 115 8.22 160 8.94
Sex
Female 2770 54.86 959 51.53 803 57.40 1008 56.34
Male 2279 45.14 902 48.47 596 42.60 781 43.66
I think the closest you can get is this:
proc tabulate data=sashelp.heart;* format=8.2;
class chol_status smoking_status sex;
table all*rowpctn=' ' (smoking_status sex)*(n=' '*f=8. colpctn=' '),
(all) (chol_status) ;
run;
That's not what you want, though, and doesn't really look very good. It's the only option that comes out of proc tabulate, though, as Tabulate won't let you assign statistics to both the rows and the columns - you have to pick one.
PROC REPORT will do what you want, with some effort. However, you could also run this in a two step process - output the tabulate to a dataset, fix the row percentages, then re-print it, either in Report or Tabulate, not asking it to percentage things that time.

Rolling Window Model for Unbalanced Dataset in SAS

I have an unbalanced panel dataset of the following form (simplified):
data have;
input ID YEAR EARN LAG_EARN;
datalines;
1 1960 450 .
1 1961 310 450
1 1962 529 310
2 1978 10 .
2 1979 15 10
2 1980 8 15
2 1981 10 8
2 1982 15 10
2 1983 8 15
2 1984 10 8
3 1972 1000 .
3 1973 1599 1000
3 1974 1599 1599
;
run;​
I now want to estimate the following model for each ID:
proc reg;
by ID;
EARN = LAG_EARN;
run;
However, I want to do this for rolling windows of some size. Say for example for windows of size 2. The window should only contain non-empty observations. For example, in the case of firm A, the window is applicable from 1961 onwards and thus only one time (since only one year follows after 1961 and the window is supposed to be of size 2).
Finally, I want to get a table with year columns and firm rows. The table should indicate the following: The regression model (with window size 2) has been performed one time for firm A. The quantity of available years, has only allowed one estimation of this model. Put differently, in 1962 the coefficient of the regression model has a value of X based on the 2 year prior window. Applying the same logic to the other two firms, one can get the following table. "X" representing the respective estimated coefficient value in certain year for firm A/B/C based on the 2-year window and "n" indicating the non-existence of such a value:
data want;
input ID 1962 1974 1980 1981 1982 1983 1984;
datalines;
1 X n n n n n n
2 n n X X X X X
3 n X n n n n n
;
run;​
I do not know how to execute this. Furthermore, I would like to create a macro that allows me to estimate different rolling window models while still creating analogous output dataframes. I would appreciate any help with it, since I have been struggling quite some time now.
Try this macro. This will only output if there are non-missing values of lags that you specify.
%macro lag(data=, out=, window=);
data _want_;
set &data.;
by ID;
LAG_EARN = lag&window.(earn);
if(first.ID) then call missing(lag_earn);
if(NOT missing(lag_earn));
run;
proc sort data=_want_;
by year id;
run;
proc transpose data=_want_
out=&out.(drop=_NAME_);
by ID notsorted;
id year;
var lag_earn;
run;
proc sort data=&out.;
by id;
run;
%mend;
%lag(data=have, out=want, window=1);

How to do weighting in regression in SAS?

I've set up a table with age and average spending by age. Age is my dependent variable. In my dataset, I have a lot of members at age 21, so I need to put more weight on it when I run regression in SAS. I'm new to SAS. I have used that regression button, but have not written codes. Is there another built in button for weighting? Or how would you do this?
Age Ave Spending Total Members
20 $100 35
21 $80 85
22 $75 20
You didn't specify which SAS product you use, but if you use SAS Enterprise Guide, the "Tasks > Regression > Linear Regression" menu gives a "relative weight" option where you can specify Total Members.
If you want to do this programatically, here is a short example:
DATA regdata;
INPUT Age 3.0
Ave_spending 3.0
total_members 3.0;
DATALINES;
20 100 35
21 80 85
22 75 20
;
RUN;
PROC REG DATA=regdata;
WEIGHT total_members;
MODEL Age = Ave_spending;
RUN;
The "Relative Weight" option translates into the "WEIGHT" command you see in the code above.

SAS: Count number of a particular type of disease with patient data on multiple lines

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
We have found that a particular type of disease is frequently comorbid with a common condition.
I would like to count the number of this type of disease that each patient has, and then create a histogram showing how many people have 1,2,3,4, etc. additional diseases.
This is the format of the data.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 282.1 1/2/10 F 87
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 282.1 12/15/13 M 41
2 232.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 601.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Generally, I was thinking of a DO loop, but I'm not sure where to start because there are duplicates in the dataset, like with patient 1 (282.1 is listed twice). I'm not sure how to account for that. Any thoughts?
Target diagnoses to count would be 282.1, 232.1, 250.02. In this example, patient 1 would have a count of 3, patient 2 would have 2, etc.
Edit:
This is what I have used, but the output is showing each PatientID on multiple lines in the output.
PROC SQL;
create table want as
select age, gender, patientID,
count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID;
quit;
This is what the output table looks like. Why is this patientID showing up so many times?
Obs AGE GENDER PATIENTID count
1 55 Male 107828695 1
2 54 Male 107828695 1
3 54 Male 107828695 1
4 54 Male 107828695 1
5 54 Male 107828695 1
If you include variables that are neither grouping variables or summary statistics then SAS will happily re-merge your summary statistics back with all of the source records. That is why you are getting multiple records. AGE can usually vary if your dataset covers many years. And GENDER can also vary if your data is messy. So for a quick analysis you might try something like this.
create table want as
select patientID
, min(age) as age_at_onset
, min(gender) as gender
, count(distinct diagnosis_description) as count
from dz_prev
where diagnosis in (282.1, 232.1)
group by patientID
;
I think you can get what you want with an SQL statement
PROC SQL NOPRINT;
create table want as
select PatientID,
count(distinct Diagnosis) as count
from have
where Diagnosis in (282.1, 232.1, 250.02)
group by PatientID;
quit;
This filters to only the diagnoses you are interested in, counts the distinct times they are seen, by the PatientID, and saves the results to a new table.

How to subset automatically in SAS?

I am new to SAS, so this might be a silly type of question.
Assume there are several datasets with similar structure but different column names. I want to get new datasets with the same number of rows but only a subset of columns.
In the following example, Data_A and Data_B are original datasets and SubA and SubBare what I want. What is the efficient way of deriving SubA and SubB?
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
DATA SubA;
set A_auto;
keep A_make A_price;
RUN;
DATA SubB;
set B_auto;
keep B_make B_price;
RUN;
Here's my new answer. This introduces quite a few concepts, but all are necessary to complete this task.
First of all I would store the required part variable names (the suffixes that are common to all datasets) in a new dataset. This keeps them all in one place and makes it easier to change if required.
The next step is to create a regular expression (regex) search string that combines all the names, separated by a pipe (|), which is the regex symbol for or. I've also added a $ symbol to end of the names, this ensures only variables ending with the part names will be selected.
select into :[macroname] is the method to create macro variables within proc sql
Then I set up a macro to extract the specific variable names for the current dataset and use those names to create a view (like my original answer)
The dictionary library referenced in the proc sql is a metadata library that contains information on all active libraries, tables, columns etc, so is a good source of identifying what the actual variable names are called (based on the regex search string created earlier).
You won't need the proc print in your code, I just put it in to show everything is working as expected.
Let me know if this works for you
/* create intial datasets */
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH B_make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
/* create dataset containing partial name of variables to keep */
data keepvars;
input part_name $ :20.;
datalines;
_make
_price
;
run;
/* create regular expression search string from partial names */
proc sql noprint;
select
cats(part_name,'$') /* '$' matches end of string */
into
:name_str separated by '|' /* '|' is an 'or' search operator in regular expressions */
from
keepvars;
quit;
%put &name_str.; /* print search string to log */
/* macro to create views from datasets */
%macro create_views (dsname, vwname); /* inputs are dataset name being read in and view name being created */
/* extract specific variable names to be kept, based on search string */
proc sql noprint;
select
name
into
:vars separated by ' '
from
dictionary.columns
where
libname = 'WORK'
and memname = upper("&dsname.")
and prxmatch("/&name_str./",strip(name))>0; /* prxmatch is regular expression search function */
quit;
%put &vars.; /* print variables to keep to log */
/* create views */
data &vwname. / view=&vwname.;
set &dsname. (keep=&vars.);
run;
/* test view by printing */
proc print data=&vwname.;;
run;
%mend create_views;
/* run macro for each dataset */
%create_views(A_auto, SubA);
%create_views(B_auto, SubB);