Group rows in PROC TABULATE - sas

I have the following (fake) crime data of offenders:
/* Some fake-data */
DATA offenders;
INPUT id :$12. crime :4. offenderSex :$1. count :3.;
INFORMAT id $12.;
INFILE DATALINES DSD;
DATALINES;
1,110,f,3
2,32,f,1
3,31,m,1
4,113,m,1
5,110,m,1
6,31,m,1
7,31,m,1
8,110,f,2
9,113,m,1
10,31,m,1
11,113,m,1
12,110,f,1
13,32,m,1
14,31,m,1
15,31,m,1
16,31,m,1
17,110,f,2
18,113,m,2
19,31,m,1
20,31,m,1
21,110,m,4
22,32,f,1
23,31,m,1
24,31,m,1
25,110,f,4
26,110,m,1
27,110,m,1
28,110,m,2
29,32,m,1
30,113,f,1
31,32,m,1
32,31,f,1
33,110,m,1
34,32,f,1
35,113,m,2
36,31,m,1
37,113,m,1
38,110,f,1
39,113,u,2
;
RUN;
proc format;
value crimes 110 = 'Theft'
113 = 'Robbery'
32 = 'Assault'
31 = 'Minor assault';
run;
I want to create a cross table using PROC TABULATE:
proc tabulate;
format crime crimes.;
freq count;
class crime offenderSex;
table crime="Type of crime", offenderSex="Sex of the offender" /misstext="0";
run;
This gives me a table like this:
m f
------------------------------------
Minor assault |
Assault |
Theft |
Robbery |
Now, I'd like to group the different types of crimes:
'Assault' and 'minor assault' should be in a category "Violent crimes" and 'theft' and 'robbery' should be in a category "Crimes against property":
m f
------------------------------------
Minor assault |
Assault |
*Total violent crimes* |
Theft |
Robbery |
*Total property crimes* |
Can anyone explain me how to do this? I tried to use another format for the 'crime'-variable and use "category * crime" within PROC TABULATE, but then it turned out like this, which is not exactly what I want:
m f
-------------------------------------------------------
Violent crimes Minor assault |
Assault |
Property crimes Theft |
Robbery |

Use the all= option within a table dimension :
table group='Category' * (crime="Type of crime" All='Total'), offenderSex="Sex of the offender" /misstext="0";

Related

Proc means output statement

I have a dataset with several variables like the one below:
Data have(drop=x);
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
Run;
In my analysis I need to get some summary stats using PROC MEANS and output the results for each variable into one dataset. I tried doing it with the code below, but it only includes stats from the first variable in the dataset. How can I output the remaining variables into the same dataset?
Proc means data=have n sum mean;
By group;
Output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
Run;
Output:
+-------+----+----------+----------+
| group | n | sum | mean |
+-------+----+----------+----------+
| A | 10 | 4.517081 | 0.451708 |
+-------+----+----------+----------+
| B | 10 | -0.77369 | -0.07737 |
+-------+----+----------+----------+
Desired output:
+----------+-------+----+----------+----------+
| variable | group | n | sum | mean |
+----------+-------+----+----------+----------+
| var1 | A | 10 | 4.517081 | 0.451708 |
+----------+-------+----+----------+----------+
| var1 | B | 10 | -0.77369 | -0.07737 |
+----------+-------+----+----------+----------+
| var2 | A | 10 | 7.947089 | 0.794709 |
+----------+-------+----+----------+----------+
| var2 | B | 10 | 5.003049 | 0.500305 |
+----------+-------+----+----------+----------+
You requested SAS to name the count n, the sum sum and the mean mean.
It can only do that for one variable.
This is the syntax to ask SAS to use different names for the statistics of each variable:
Output out=want(drop=_freq_ _type_)
n(var1 var2)=n1 n2
sum(var1 var2)=sum1 sum2
mean(var1 var2)=mean1 mean2;
To get that output you will need to transpose the data. Either transpose before hand and add the _NAME_ variable to the BY or CLASS statement.
data have;
call streaminit(1);
do x = 1 to 20 by 1;
if x < 11 then group = 'A';
else group = 'B';
var1 = rand('normal',0,1);
var2 = rand('uniform');
output;
end;
run;
proc transpose data=have out=tall;
by group x;
run;
proc means data=tall nway n sum mean;
by group;
class _name_;
output out=want(drop=_freq_ _type_) n=n sum=sum mean=mean;
run;
Or use /autoname and transpose the resulting dataset from one observation per GROUP to multiple observations.
proc means data=have(drop=x) nway n sum mean;
by group;
output out=wide(drop=_freq_ _type_) n= sum= mean= /autoname;
run;
proc transpose data=wide out=tall;
by group;
run;
data tall ;
set tall ;
stat=scan(_name_,-1,'_');
_name_=substrn(_name_,1,length(_name_)-length(stat) -1);
rename _name_=varname;
run;
proc sort data=tall;
by group varname;
run;
proc transpose data=tall out=want(drop=_name_);
by group varname ;
id stat;
var col1;
run;
proc print data=want;
run;

How to split a combination of numeric and characters into multiple columns

I want to split some variable "15to16" into two columns where for that row I want the values 15 and 16 in each of the column entries. Hence, I want to get from this
+-------------+
| change |
+-------------+
| 15to16 |
| 9to8 |
| 6to5 |
| 10to16 |
+-------------+
this
+-------------+-----------+-----------+
| change | from | to |
+-------------+-----------+-----------+
| 15to16 | 15 | 16 |
| 9to8 | 9 | 8 |
| 6to5 | 6 | 5 |
| 10to16 | 10 | 16 |
+-------------+-----------+-----------+
Could someone help me out? Thanks in advance!
data have;
input change $;
cards;
15to16
9to8
6to5
10to16
;
run;
data want;
set have;
from = input(scan(change,1,'to'), 8.);
to = input(scan(change,2,'to'), 8.);
run;
N.B. in this case the scan function is using both t and o as separate delimiters, rather than looking for the word to. This approach still works because scan by default treats multiple consecutive delimiters as a single delimiter.
Regular expressions with the metacharacter () define groups whose contents can be retrieved from capture buffers with PRXPOSN. The capture buffers retrieved in this case would be one or more consecutive decimals (\d+) and converted to a numeric value with INPUT
data have;
input change $20.; datalines;
15to16
9to8
6to5
10to16
run;
data want;
set have;
rx = prxparse('/^\s*(\d+)\s*to\s*(\d+)\s*$/');
if prxmatch (rx, change) then do;
from = input(prxposn(rx,1,change), 12.);
to = input(prxposn(rx,2,change), 12.);
end;
drop rx;
run;
You can get the answer you want by declaring delimiter when you create the dataset. However you did not provide enough information regarding your other variables and how you import them
Data want;
INFILE datalines DELIMITER='to';
INPUT from to;
datalines;
15to16
9to8
6to5
10to16
;
Run;

How to plot a simple lineplot in SAS

My data are structured as follows (these are just sample data as the original data are secret)
id | crime | location | crimedate
------------------------------
1 | Theft | public | 2019-01-04
1 | Theft | public | 2019-02-06
1 | Theft | public | 2019-02-20
1 | Theft | private | 2019-03-10
1 | Theft | private | 2019-03-21
1 | Theft | public | 2019-03-01
1 | Theft | private | 2019-03-14
1 | Theft | public | 2019-06-15
1 | Murder | private | 2019-01-04
1 | Murder | private | 2019-10-20
1 | Murder | private | 2019-11-18
1 | Murder | private | 2019-01-01
1 | Assault | private | 2019-03-19
1 | Assault | private | 2019-01-21
1 | Assault | public | 2019-04-11
1 | Assault | public | 2019-01-10
… | … | … | …
My goal is to create a lineplot (time series plot) showing how the numbers of the three crimes have changed over the year. Therefore on the x-axis I would like to show the monthes (1-12) and on the y-axis the number of crimes in each month. There should be two lines (one for each location).
I started with this code:
DATA new;
SET old;
month=month(datepart(crimedate));
RUN;
PROC sgplot DATA=new;
series x=month y=no_of_crimes / group=location;
run;
But I have no idea, how I can aggregate the number of crimes per month. Could anyone please give me a hint? I have been looking in the internet for a solution, but usually the examples just use data that are already aggregated.
The SG routines will aggregate Y axis values for a VBAR or HBAR statement. The same aggregate information displayed in a SERIES statement would have to be from a apriori aggregate computation, easily done with Proc SUMMARY.
Additionally, to plot the counts for each crime in a separate visual, you would want a BY CRIME statement, or Proc SGPANEL with PANELBY crime.
The crime datetime value does not have to be down transformed to a date value, you can use the appropriate datetime format in the procedures and they will auto-aggregate based on the formatted value.
Example with some simulated crime data:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
* shorter graphs for SO answer;
ods graphics / height=300px;
proc sgplot data=have;
title "VBAR all crimes combined by location";
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc sgpanel data=have;
title "VBAR crime * location";
panelby crime;
vbar crime_dt
/ group=location
groupdisplay=cluster
;
format crime_dt dtmonyy7.;
run;
proc summary data=have noprint;
class crime_dt crime location;
format crime_dt dtmonyy7.;
output out=freqs;
run;
proc sgplot data=freqs;
title "SERIES all crimes,summary _FREQ_ * location";
where _type_ = 5;
series x=crime_dt y=_freq_ / group=location;
xaxis type=discrete;
run;
proc sgpanel data=freqs;
title "SERIES all crimes,summary _FREQ_ * crime * location";
where _type_ = 7;
panelby crime;
series x=crime_dt y=_freq_ / group=location;
rowaxis min=0;
colaxis type=discrete;
run;
If you want to group by location without definition by type of crime:
proc sql noprint;
create table new as
select id,location
, month(crimedate) as month,count(crime) as crime_n
from old
group by id,location,CALCULATED month;
quit;
proc sgplot data=new;
series x=month y=crime_n /group=location;
run;
The result:
To show different series by type of crime you could use sgpanel:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby location;
series x=month y=crime_n /group=crime;
run;
The result is:
One more variant of perfoming this data:
proc sql noprint;
create table new as
select id,crime,location, month(crimedate) as month,count(crime) as crime_n
from old
group by id,crime,location,CALCULATED month;
quit;
proc sgpanel DATA=new;
panelby crime;
series x=month y=crime_n /group=location GROUPDISPLAY=cluster;
run;
The result is:
Of course, you can specify this plots how you want.
To perhaps answer the question more directly, the VLINE or HLINE plots will summarize the data for you, similar to running a proc freq and then proc sgplot with series.
Using Richard's test data, you'll see this is exactly identical to the plot his PROC FREQ -> SERIES gives:
data have;
do precinct = 1 to 10;
do date = '01jan2018'd to '31dec2018'd;
do seq = 1 to 20*ranuni(123);
length crime $10 location $8;
crime = scan('theft,assault,robbery,dnd', ceil(4*ranuni(123)));
location = scan ('public,private', ceil(2*ranuni(123)));
crime_dt = dhms(date,0,0,floor('24:00't*ranuni(123)));
output;
end;
end;
end;
drop date;
format crime_dt datetime19.;
run;
proc sgplot data=have;
vline crime_dt/group=location groupdisplay=cluster;
format crime_dt dtmonyy7.;
run;

Proc REPORT move group value (row header) closer to totals

I have some data that is structured as below. I need to create a table with subtotals, a total column that's TypeA + TypeB and a header that spans the columns as a table title. Also, it would be ideal to show different names in the column headings rather than the variable name from the dataset.
I cobbled together some preliminary code to get the subtotals and total, but not the rest.
data tabletest;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
;
run;
Code that I wrote:
proc report data=tabletest nofs headline headskip;
column referral_total referral_source TypeA TypeB;
define referral_total / group ;
define referral_source / group;
define TypeA / sum ' ';
define TypeB / sum ' ';
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
run;
This is the desired output:
The DEFINE statement has an option NOPRINT that causes the column to not be rendered, however, the variables for it are still available (in a left to right manner) for use in a compute block.
Stacking in the column statement allows you to customize the column headers and spans. In a compute block for non-group columns, the Proc REPORT data vector only allows access to the aggregate values at the detail or total line, so you need to specify .
This sample code shows how the _total column is hidden and the _source cells in the sub- and report- total lines are 'injected' with the hidden _total value. The _source variable has to be lengthened to accommodate the longer values that are in the _total variable.
data tabletest;
* ensure referral_source big enough to accommodate _total || ' TOTAL';
length referral_total $50 referral_source $60;
informat referral_total $50. referral_source $20.;
infile datalines delimiter='|';
input referral_total referral_source TypeA TypeB ;
datalines;
Long Org Name | SubA | 12 | 5
Long Org Name | SubB | 14 | 3
Longer Org Name | SubC | 0 | 1
Longer Org Name | SubD | 4 | 12
Very Long Org | SubE | 3 | 11
Very Long Org | SubF | 9 | 19
Very Long Org | SubG | 1 | 22
run;
proc report data=tabletest;
column
( 'Table 1 - Stacking gives you custom headers and hierarchies'
referral_total
referral_source
TypeA TypeB
TypeTotal
);
define referral_total / group noprint; * hide this column;
define referral_source / group;
define TypeA / sum 'Freq(A)'; * field labels are column headers;
define TypeB / sum 'Freq(B)';
define TypeTotal / computed 'Freq(ALL)'; * specify custom computation;
break after referral_total / summarize style={background=lightblue font_weight=bold };
rbreak after /summarize;
/*
* no thanks, doing this in the _source compute block instead;
compute referral_total;
if _break_ = 'referral_total' then
do;
referral_total = catx(' ', referral_total, 'Total');
end;
else if _break_ in ('_RBREAK_') then
do;
referral_total='Total';
end;
endcomp;
*/
compute referral_source;
* the referral_total value is available because it is left of me. It just happens to be invisible;
* at the break lines override the value that appears in the _source cell, effectively 'moving it over';
select (_break_);
when ('referral_total') referral_source = catx(' ', referral_total, 'Total');
when ('_RBREAK_') referral_source = 'Total';
otherwise;
end;
endcomp;
compute TypeTotal;
* .sum is needed because the left of me are groups and only aggregate values available here;
TypeTotal = Sum(TypeA.sum,TypeB.sum);
endcomp;
run;

SAS proc freq table

I need help in sas proc freq table.
I have two columns : TYPE and STATUS
TYPE has two values : TypeA and TypeB
STATUS has two values : ON and OFF
I need the sas proc freq table output as follows :
--------------------------------------------------
| TYPE | STATUS |
--------------------------------------------------
| | ON | OFF | ON | OFF |
--------------------------------------------------
| TypeA| 33 | 44 | 22 | 55 |
--------------------------------------------------
| TypeB| 11 | 22 | 33 | 44 |
--------------------------------------------------
How can I obtain this ? I have tried:
proc freq data = XYZ;
tables TYPE*STATUS/missing nocol norow nopercent nocum;
run;
PROC TABULATE is probably the easiest way to get that total column. The keyword all gives you a total across a class.
proc tabulate data=xyz;
class type status;
tables type,(status all)*n;
run;