Percentage calculation based on multiple columns SAS - sas

I have a data with patientID and the next column with illness, which has more than one category seperated by commas. I need to find the total number of patients per each illness category and the percentage of patients per category. I tried the normal way, it gives the frequency correct but not the percent.
The data looks like this.
ID Type_of_illness
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
where the empty spaces represent no illness. I first separated the illnesses into separate columns, but then got stuck there not knowing how to process to find out the percent.
The code I wrote is as below:
Data new;
set old;
array P(3) L1 L2 L3;
do i to dim(p);
p(i)=scan(type_of_illness,i,',');
end;
run;
Then I created a new column to copy all the illnesses to it so I thought it would give me the correct frequency, but it did not give me the correct percent.
data new;
set new;
L=L1;output;
L=L2;output;
L=L3;output;
run;
proc freq data=new;
tables L;run;
I have to create a table something like
*Total numer of patients Percent*
.......................................
lf5
lf7
lf6
lf11
lf12
lf13
Please help.

You're trying to output percentages on non-mutually exclusive groups (each illness). It isn't obvious in SAS how to do this.
The following takes Joe's input code but takes an alternative route in determining percentages from event data (a 'long' dataset, if you will). I prefer this to creating a binary variable for each illness at the patient level (a 'wide' dataset) as, for me, this soon gets unwieldy. That said, if you then go on to do some modelling then a 'wide' dataset is usually more useful.
The following code produces output as follows:-
| | Pats | Pats | | | Mean | | |
| | with 0 |with 1+ | % with | Num | events | | |
| |records | record | record | Events |per pat |Std Dev | Median |
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf11 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf12 | 19| 7| 27| 7| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf13 | 24| 2| 8| 2| 1.0| 0.00| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf16 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf17 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf5 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf6 | 25| 1| 4| 1| 1.0| .| 1|
|-----------------------|--------|--------|--------|--------|--------|--------|---------
|lf7 | 24| 2| 8| 2| 1.0| 0.00| 1|
---------------------------------------------------------------------------------------|
Note that patient 5 is repeated in your data for illness lf5. My code only counts this record once. This is fine if a chronic illness but not if acute. Also, my code includes patients in the denominator who do not have an event.
Finally, you can see another example of this code using dates - with test data - here at the mycodestock.com code sharing site => https://mycodestock.com/public/snippet/11251
Here's the code for the table above:-
options nodate nonumber nocenter pageno=1 obs=max nofmterr ps=52 ls=100 formchar="|----||---|-/\<>*";
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
proc sort;
by id;
run;
** Create patient level data;
proc sort data = have(keep = id) out = pat_data nodupkey;
by id;
run;
** Create event table (1 row per patient*event);
** NOTE: Patients without events are dropped (as is usual in events data);
data events(drop = i type_of_illness);
set have;
attrib grp length = $5 label = 'Illness';
do i = 1 to countc(type_of_illness, ',') + 1;
grp = scan(type_of_illness, i, ',');
if grp ne '' then output;
end;
run;
** Count the number of events each patient had for each grp;
** NOTE: The NODUPKEY in the PROC SORT remove duplicate records (within PAT & GRP);
** NOTE: The use of CLASSDATA and COMPLETETYPES ensures zero counts for all patients and grps;
proc sort in = events out = perc2_summ_grp_pat nodupkey;
by grp id;
proc summary data = perc2_summ_grp_pat nway missing classdata = pat_data completetypes;
by grp;
class id;
output out = perc2_summ_grp_pat(rename=(_freq_ = num_events) drop=_type_);
run;
** Add a denominator variable - value '1' for each row.;
** Ensure when num_events = 0 the value is set to missing;
** Create a flag variable - set to 1 - if a patient has a record (no matter how many);
data perc2_summ_grp_pat;
set perc2_summ_grp_pat;
denom = 1;
if num_events = 0 then num_events = .;
flg_scripts = ifn(num_events, 1, .);
run;
proc tabulate data = perc2_summ_grp_pat format=comma8.;
title1 bold "Table 1: N, % and basic statistics of events within non-mutually exclusive groups";
title2 "Units: Patients - within each group level";
title3 "The statistics summarises the number of events (not whether a patient had at least 1 event)";
title4 "This means, for the statistics, only patients with 1+ record are included in the denominator";
class grp;
var denom flg_scripts num_events;
table grp='', flg_scripts=''*(nmiss='Pats with 0 records' n='Pats with 1+ record' pctsum<denom>='% with record')
num_events=''*(sum='Num Events' mean='Mean events per pat'*f=8.1 stddev='Std Dev'*f=8.2 p50='Median');
run; title; footnote;

You're going about this right, but you need to pick percent differently. Normally percent is 'percent of whole dataset', which means that it is going to triplicate your base. You want the percent based to the illness. This means you need a 1/0 for each illness.
The one downside is you have the 0's in your automatic tables; you would have to output the table to a dataset and remove them, then proc print/report the resulting dataset to get the 1's only - or use PROC SQL to generate the table.
data have;
format type_of_illness $30.;
infile datalines truncover;
input ID Type_of_illness $;
datalines;
4 lf13
5 lf5,lf11
63
13 lf12
85
80
15
20
131 lf6,lf7,lf12
22
24
55 lf12
150 lf12
34 lf12
49 lf12
151 lf12
60
74
88
64
82 lf13
5 lf5,lf7
112
87 lf17
78
79 lf16
83 lf11
;;;;
run;
data want;
set have;
array L[8] lf5-lf7 lf11-lf13 lf16 lf17;
do _t = 1 to dim(L);
if find(type_of_illness,trim(vname(L[_t]))) then L[_t]=1;
else L[_t]=0;
end;
run;
proc tabulate data=want;
class lf:;
tables lf:,n pctn;
run;

The multilabel format solution is interesting, so I present it separately.
Using the same have, we create a format that takes every combination of illnesses and outputs a row for each illness in it, ie, if you have "1,2,3", it outputs rows
1,2,3 = 1
1,2,3 = 2
1,2,3 = 3
Enabling multilabel formats and using a class-enabled proc like proc tabulate, you can then use this to allow each respondent to count in each of the label values, but not be counted more than once against the total.
data for_procformat;
set have;
start=type_of_illness; *start is the input to the format;
hlo=' m'; *m means multilabel, adding a space
here to leave room for the o later;
type='c'; *character format - n is numeric;
fmtname='$ILLF'; *whatever name you like;
do _t = 1 to countw(type_of_illness,','); *for each 'word' do this once;
label=scan(type_of_illness,_t,','); *label is the 'result' of the format;
if not missing(label) then output;
end;
if _n_=1 then do; *this block adds a row to deal with values;
hlo='om'; *not defined (in this case, just missings);
label='No Illness'; *the o means 'other';
output;
end;
run;
proc sort data=for_procformat nodupkey; *remove duplicates (which there will be many);
by start label;
run;
proc format cntlin=for_procformat; *import the formats;
quit;
proc tabulate data=have;
class type_of_illness/mlf missing ; *mlf means multilabel formats;
format type_of_illness $ILLF.; *apply said format;
tables type_of_illness,n pctn; *and your table;
run;

Related

How to create 2 new columns with appropriate prefix based on values in columns with same prefix in SAS Enterprise Guide / PROC SQL?

I have table in SAS Enterprise Guide like below:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B
-----|-------------|-------------|-----------|------------
111 | 10 | 10 | 320 | 120
222 | 15 | 80 | 500 | 500
333 | 1 | 5 | 110 | 350
444 | 20 | 5 | 670 | 0
Requirements:
I need to create new column "TOP_COUNT" where will be name of column (COUNT_COL_A or COUNT_COL_B) with the highest value per each ID,
if some ID has same values in both "COUNT_" columns take to "TOP_COUNT" column name which has higher value in its counterpart with prefix SUM_ (SUM_COL_A or SUM_COL_B)
I need to create new column "TOP_SUM" where will be name of column (SUM_COL_A or SUM_COL_B) with the highest value per each ID,
if some ID has same values in both "SUM_" columns take to "TOP_SUM" column name which has higher value in its counterpart with prefix COUNT_ (COUNT_COL_A or COUNT_COL_B)
It is not possible to have only 0 in columns with prefix _COUNT or only 0 in columns with prefix _SUM
There is not null in table
Desire output:
ID | COUNT_COL_A | COUNT_COL_B | SUM_COL_A | SUM_COL_B | TOP_COUNT | TOP_SUM
-----|-------------|-------------|-----------|------------|-------------|---------
111 | 10 | 10 | 320 | 120 | COUNT_COL_A | SUM_COL_A
222 | 15 | 80 | 500 | 500 | COUNT_COL_B | SUM_COL_B
333 | 1 | 5 | 110 | 350 | COUNT_COL_B | SUM_COL_B
444 | 20 | 5 | 670 | 0 | COUNT_COL_A | SUM_COL_A
How can i do that in SAS Enterprise Guide or in PROC SQL ?
Use an array with loops methodology:
Declare an array of the count variables
Set the maximum value to 0
Loop through the array
Check if each value is more than current
maximum
If yes, assign value to current maximum value and store name
If no, keep looping
Non looping, function methodology:
Use MAX to find the maximum value of the array
Use WHICHN() to find the location of the array
Use VNAME to get the variable name based on the location
*for count - you can extend for max;
data want;
set have;
array _count(*) count_col_:;
*looping methodology;
top_count_value=0;
do i=1 to _count;
if _count(i) > top_count_value then do;
top_count = vname(_count(i));
top_count_value = _count(i);
end;
end;
/*or function methodology*/
top_count_max = max(of _count(*));
index_top_count = whichn(top_count_max, of _count(*));
top_count_name_2 = vname(_count(index_top_count);
run;
Just do the same thing as your other question. But because you want to transpose two sets of variable it is probably going to be easier to a data step and arrays to do the first transform.
data tall;
set have;
array counts count_col_a count_col_b;
array sums sum_col_a sum_col_b;
do index=1 to dim(sums);
length type $5 name $32 ;
type='COUNT';
name=vname(counts[index]);
value1=counts[index];
value2=sums[index];
output;
type='SUM';
name=vname(sums[index]);
value1=sums[index];
value2=counts[index];
output;
end;
run;
Now sort and take the last per ID/TYPE combination to find the largest.
proc sort;
by id type value1 value2 name;
run;
data top;
set tall;
by id type value1 value2;
if last.type;
run;
And then transpose and re-merge.
proc transpose data=top out=want(drop=_name_) prefix=TOP_;
by id;
id type;
var name;
run;
data want;
merge have want;
by id;
run;
Result:
COUNT_ COUNT_ SUM_ SUM_
Obs ID COL_A COL_B COL_A COL_B TOP_COUNT TOP_SUM
1 111 10 10 320 120 COUNT_COL_A SUM_COL_A
2 222 15 80 500 500 COUNT_COL_B SUM_COL_B
3 333 1 5 110 350 COUNT_COL_B SUM_COL_B
4 444 20 5 670 0 COUNT_COL_A SUM_COL_A

Group rows in PROC TABULATE

I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |
Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";

Sgplot for multiple response variables

I have a dataset with over 50,000 records that looks like the following; ID, season(either high or low), bed_time and triage_time in minutes:
ID Season Bed_time Triage_time
1 high 34 68
2 low 44 20
3 high 90 14
4 low 71 88
5 low 27 54
I would like to create a GROUPED VERTICAL BAR CHART where both then Bed_time Triage_time are reflected as a median and grouped by season on the X-axis:
| | |
| | | |
| | | | |
|-------------------------------
BT TT BT TT
High Low
I reckon I have to transpose the data and then plug into an SGPLOT, but I'm not quite sure how to do that to ensure that data can then be graphed.
proc sgplot data=mysas.projects;
vbar season/ stat=median
group=[Bed_time Triage_time] /*NEED FROM TRANSPOSED DATA*/
groupdisplay=cluster;
run;
quit;
Indeed, you will need to reshape data from wide to long to use group call. Additionally, you will need to include a response call. Consider proc transpose for reshaping:
*** POSTED DATA
data time_data;
length Season $ 5;
input ID Season $ Bed_time Triage_time;
cards;
1 high 34 68
2 low 44 20
3 high 90 14
4 low 71 88
5 low 27 54
;
run;
*** RESHAPE LONG TO WIDE;
proc transpose
data = time_data
out = time_data_long
name = time_group;
by ID Season;
run;
*** CLEAN UP OUTPUT;
data time_data_long;
set time_data_long;
label time_group = "Time Group";
rename col1 = value;
run;
proc sgplot data=time_data_long;
vbar season / response=value stat=median
group = time_group
groupdisplay=cluster;
run;

Proc sql subquery based on nonexisitng column returns not null

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;

How to output R square and RMSE in SAS

I regress Y on X. The regression is grouped by DATE.
proc reg data=A noprint;
model Y=X;
by DATE;
run;
In the regression result for each DATE, there is a R square and a RMSE value.
How do I output a table of R square and RMSE values as below. Thank you!
|------------|-----------|------------|
| Date | R Square | RMSE |
|------------|-----------|------------|
| 1/1/2009 | R1 | RMSE1 |
| 2/1/2009 | R2 | RMSE2 |
| 3/1/2009 | R3 | RMSE3 |
| .... | .... | .... |
| 10/1/2010 | R22 | RMSE22 |
| 11/1/2010 | R23 | RMSE23 |
| 12/1/2010 | R24 | RMSE24 |
|------------|-----------|------------|
If you look at one of the examples of the SAS PROC REG, this is pretty easy to do.
The example data:
data htwt;
input sex $ age :3.1 height weight ##;
datalines;
f 143 56.3 85.0 f 155 62.3 105.0 f 153 63.3 108.0 f 161 59.0 92.0
f 191 62.5 112.5 f 171 62.5 112.0 f 185 59.0 104.0 f 142 56.5 69.0
f 160 62.0 94.5 f 140 53.8 68.5 f 139 61.5 104.0 f 178 61.5 103.5
f 157 64.5 123.5 f 149 58.3 93.0 f 143 51.3 50.5 f 145 58.8 89.0
m 164 66.5 112.0 m 189 65.0 114.0 m 164 61.5 140.0 m 167 62.0 107.5
m 151 59.3 87.0
;
run;
Now, the PROC REG. Pay particular attention to the OUTEST statement; this is going to output a dataset est1 that will contain what you want.
proc reg outest=est1 outsscp=sscp1 rsquare;
by sex;
eq1: model weight=height;
eq2: model weight=height age;
proc print data=sscp1;
title2 'SSCP type data set';
proc print data=est1;
title2 'EST type data set';
run;
quit;
So now, the last PROC PRINT of est1 can be easily routed to contain only what you want:
proc print data=est1;
var sex _RMSE_ _RSQ_;
run;
You can use this code:
proc reg data=A noprint;
model Y=X;
by DATE;
ods output FitStatistics=final_table;
run;
data final_table; set final_table;
rename cValue1=RMSE;
rename cValue2=RSquared;
proc print data=final_table;
where Label1='Root MSE';
var DATE RSquared RMSE;
run;
I simply turned the trace on and took the name of the table I was interested in and saved it as the output of proc reg.
ODS Trace on;
proc reg data=A noprint;
model Y=X;
by DATE;
run;
ODS Trace off;
this prints out the name of all the output tables in the log file.