I have table in SAS Enterprise Guide like below:
COL1 | COL2 | COL3
-----|-------|------
111 | A | C
111 | B | C
222 | A | D
333 | A | D
And I need to aggregate abve table to know how many each value in columns occured, so as to have something like below:
COL2_A | COL2_B | COL3_C | COL3_D
--------|--------|--------|--------
3 | 1 | 2 | 2
Because:
COL_2A = 3, because in COL2 value "A" exists 3 times
and so on...
How can I do that in SAS Enterprise Guide or in PROC SQL ?
I need the output as SAS dataset
Try this
data have;
input COL1 COL2 $ COL3 $;
datalines;
111 A C
111 B C
222 A D
333 A D
;
data long;
set have;
array col COL2 COL3;
do over col;
c = col;
n = cats(vname(col), '_', c);
output;
end;
run;
proc summary data = long nway;
class n;
output out = freq(drop = _TYPE_);
run;
proc transpose data = freq out = wide_freq(drop = _:);
id n;
run;
I have a dataset with several variables like the one below:
Data have(drop=x);
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
Run;
In my analysis I need to get some summary stats using PROC MEANS and output the results for each variable into one dataset. I tried doing it with the code below, but it only includes stats from the first variable in the dataset. How can I output the remaining variables into the same dataset?
Proc means data=have n sum mean;
By group;
Output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
Run;
Output:
+-------+----+----------+----------+
| group | n | sum | mean |
+-------+----+----------+----------+
| A | 10 | 4.517081 | 0.451708 |
+-------+----+----------+----------+
| B | 10 | -0.77369 | -0.07737 |
+-------+----+----------+----------+
Desired output:
+----------+-------+----+----------+----------+
| variable | group | n | sum | mean |
+----------+-------+----+----------+----------+
| var1 | A | 10 | 4.517081 | 0.451708 |
+----------+-------+----+----------+----------+
| var1 | B | 10 | -0.77369 | -0.07737 |
+----------+-------+----+----------+----------+
| var2 | A | 10 | 7.947089 | 0.794709 |
+----------+-------+----+----------+----------+
| var2 | B | 10 | 5.003049 | 0.500305 |
+----------+-------+----+----------+----------+
You requested SAS to name the count n, the sum sum and the mean mean.
It can only do that for one variable.
This is the syntax to ask SAS to use different names for the statistics of each variable:
Output out=want(drop=_freq_ _type_)
n(var1 var2)=n1 n2
sum(var1 var2)=sum1 sum2
mean(var1 var2)=mean1 mean2;
To get that output you will need to transpose the data. Either transpose before hand and add the _NAME_ variable to the BY or CLASS statement.
data have;
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
run;
proc transpose data=have out=tall;
by group x;
run;
proc means data=tall nway n sum mean;
by group;
class _name_;
output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
run;
Or use /autoname and transpose the resulting dataset from one observation per GROUP to multiple observations.
proc means data=have(drop=x) nway n sum mean;
by group;
output out=wide(drop=_freq_ _type_) n= sum= mean= /autoname;
run;
proc transpose data=wide out=tall;
by group;
run;
data tall ;
set tall ;
stat=scan(_name_,-1,'_');
_name_=substrn(_name_,1,length(_name_)-length(stat) -1);
rename _name_=varname;
run;
proc sort data=tall;
by group varname;
run;
proc transpose data=tall out=want(drop=_name_);
by group varname ;
id stat;
var col1;
run;
proc print data=want;
run;
I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |
Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";
I have a database where 12,000 variables are named "A0122_40", "A0122_45", "A0122_50" and so on. I would like to rename them by keeping in the initial name the numbers from 2 to 5. I would then like to create variables adding all the columns with the same name.
CASE 1
If you want to join two tables with same column names, and keep all data:
data class;
set sashelp.class (obs=2);
rename name=A0122_40
Sex=A0122_45
Weight=A0122_50
Height=A0122_55
;run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0122_40 | A0122_45 | Age | A0122_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A0122%'
;
%let num_vars = &sqlobs.;
proc datasets library=&lb nolist nodetails nowarn;
modify &ds;
rename
%do i=1 %to &num_vars;
&&rn_vr&i=new_&&rn_vr&i
%end;
;
%mend renameCols;
%renameCols(work,class);
Result Table:
+--------------+--------------+-----+--------------+--------------+
| new_A0122_40 | new_A0122_45 | Age | new_A0122_55 | new_A0122_50 |
+--------------+--------------+-----+--------------+--------------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+--------------+--------------+-----+--------------+--------------+
As you see, all columns were renamed, exclude Age.
CASE 2
If you want to "append" all A0122_... in one , I suggest the next code:
data class;
set sashelp.class (obs=2);
rename name=A0123_40
Sex=A0123_45
Weight=A0122_50
Height=A0121_55
;
run;
Prepared Table:
+----------+----------+-----+----------+----------+
| A0123_40 | A0123_45 | Age | A0121_55 | A0122_50 |
+----------+----------+-----+----------+----------+
| Alfred | M | 14 | 69 | 112.5 |
| Alice | F | 13 | 56.5 | 84 |
+----------+----------+-----+----------+----------+
Code to rename:
%macro renameCols(lb , ds);
%macro d;
%mend d;
proc sql noprint;
select name
into :rn_vr1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_vars = &sqlobs.;
/*Create a lot of tables with one column from one input*/
data
%do i=1 %to &num_vars;
&&rn_vr&i (rename=(&&rn_vr&i=%scan(&&rn_vr&i,1,_)) keep = &&rn_vr&i )
%end;
;
set &lb..&ds.;
run;
/*Count variable patterns and write it to macro*/
proc sql noprint;
select distinct scan(name,1,'_')
into :aggr_vars1-
from dictionary.columns
where LIBNAME=upcase("&lb") AND MEMNAME=upcase("&ds")
AND NAME LIKE 'A012%'
;
%let num_aggr_vars=&sqlobs.;
/*Append all tables that contains same column name pattern*/
%do i=1 %to &num_aggr_vars;
data _&&aggr_vars&i;
set &&aggr_vars&i:;
n=_n_;
run;
%end;
/*Merge that tables into one*/
data res (drop= n);
merge
%do i=1 %to &num_aggr_vars;
_&&aggr_vars&i
%end;
;
by n;
run;
run;
%mend renameCols;
%renameCols(work,class);
The res table:
+-------+-------+--------+
| A0121 | A0122 | A0123 |
+-------+-------+--------+
| 69 | 112.5 | Alfred |
| 56.5 | 84 | Alice |
| . | . | M |
| . | . | F |
+-------+-------+--------+
Is this what you are looking for:
proc contents data=have out=cols noprint;
run;
proc sql noprint;
select distinct substr(name,1,6) into :colgrps separated by " " from cols;
run;
%macro process;
data want;
set have;
%let ii = 1;
%do %while (%scan(&colgrps, &ii, %str( )) ~= );
%let grp = %scan(&colgrps, &ii, %str( ));
&grp._sum = sum(of &grp.:);
%let ii = %eval(&ii + 1);
%end;
run;
%mend;
%process;
Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;