sas how to do frequencies for only certain values - sas

I have some survey data with possible responses, an example would be:
Q1
Person1 Yes
Person2 No
Person3 Missing
Person4 Multiple Marks
Person5 Yes
I need to calculate the frequencies by question, so that only the Yes/No (other questions have varied responses such as frequently, very frequently, etc) are counted in the totals - not the ones with Multiple Marks. Is there a way to exclude these using proc freq or another method?
Outcome:
Yes: 2
No: 1
Total: 3

Using proc freq, I'd do something like this:
proc freq data=have (where=(q1 in ("Yes", "No")));
tables q1 / out=want;
run;
Output:
Q1 Count Percent
No 1 33.333333333
Yes 2 66.666666667
Proc sql:
proc sql;
select
sum(case when q1 eq "Yes" then 1 else 0 end) as Yes
,sum(case when q1 eq "No" then 1 else 0 end) as No
,count(q1) as Total
from have
where q1 in ("Yes", "No");
quit;
Output:
Yes No Total
2 1 3

The best way to do this is using formats.
Rather than storing your data as character strings, you should be storing it as numeric variables. This allows you to use numeric missing values to code those values you don't consider proper responses; using formats allows you to have your cake and eat it to (i.e., allows you to still have those nice pretty response labels).
Here's an example. To understand this, you need to understand SAS special missings. Note the missing statement tells SAS to consider a single "M" in the input as .M (and similar for D and R). I then show two PROC FREQ results, one with the missings excluded, one with them included, to show the difference.
proc format;
value YNQF
1 = 'Yes'
2 = 'No'
. = 'Missing'
.M= 'Multiple Marks'
.D= "Don't Know"
.R= "Refused"
;
quit;
missing M R D;
data have;
input Q1 Q2 Q3;
format q1 q2 q3 YNQF.;
datalines;
1 1 2
2 1 R
. . 1
M 1 1
1 . D
;;;;
run;
proc freq data=have;
tables (q1 q2 q3);
tables (q1 q2 q3)/missing;
run;

Related

SAS_Count Frequency

I want to ask a complicated (for me) question about SAS programming. I think I can explain better by using simple example. So, I have the following dataset:
Group Category
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
I want to count the each category for each group. I can do it by using PROC FREQ. But it is not better way for my dataset. It will be time consuming for me as my dataset is too large and I have a huge number of groups. So, if I use PROC FREQ, firstly I need to create new datasets for each group and then use PROC FREQ for each group. In sum, I need to create the following dataset:
CATEGORIES
Group 1 (first category) 2 3
A 3 2 1
B 2 3 1
So, the number of first category in group A is 3. The number of first category in group B is 2 and so on. I think I can explain it. Thanks for your helps.
There is more than one way to do this in SAS. My bias is proc sql, so:
proc sql;
select grp,
sum(case when category = 1 then 1 else 0 end) as cat_1,
sum(case when category = 2 then 1 else 0 end) as cat_2,
sum(case when category = 3 then 1 else 0 end) as cat_3
from t
group by grp;
Either proc freq or proc summary will do the job of producing frequency counts:
data example;
length group category $1;
input group category;
cards;
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
;
run;
proc freq data=example;
table group*category;
run;
proc summary data=example nway;
class group category;
output out=example_frequency (drop=_type_);
run;
proc summary will produce a dataset in a 'long' format. If you need to transpose it (I'd suggest not doing so: you'll probably find working with the long format easier in most circumstances) you can use proc transpose:
proc transpose data=example_frequency out=example_matrix (drop=_name_);
by group;
id category;
var _freq_;
run;

Calculating percentages and scores

Lets say I have data which looks like:
ID A1Q A2Q B1Q B2Q Continued
23 1 2 2 3
24 1 2 3 3
To understand the table it translates into, Person with ID 23 had answers 1,2,2,4 for the questions A1,A2,B1,B2 respectively. I want to know how to know the percentage of students who answered 1, 2 or 3 in the entire dataset.
I have tried using
PROC FREQ data = test.one;
tables A2Q-A2Q;
tables B1Q-B2Q;
RUN;
But this does not get me what I want. It separately analyzes each question and the output is long. I just need it into one table that tells me this percentage answered 1, this percentage answered 2 and etc.
The output could be:
Question: 1 2 3
Percentage A1Q 40% 40% 20%
Percentage A2Q 60% 20% 20%
Total Percentage 30% 30% 40%
So it would translate such that 40% people chose 1, 40% chose 2, and 30% chose 3 for question A1Q. The total percentage is out of all the people that gave answers, 30% chose 1 30% chose 2 and 40% chose 3.
You'd still need to work on it a little bit and transpose the final results but this could be a start... also if you have lots of questions, consider wrapping this up in a macro program.
data quest;
input ID A1Q A2Q B1Q B2Q;
datalines;
21 2 3 1 2
22 3 2 2 3
23 1 2 2 3
24 1 2 3 3
25 2 1 3 3
run;
options missing = 0;
proc freq data=quest;
table A1Q / nocol nocum nofreq out = freq1(rename=(A1Q=Answer Count=A1Q));
table A2Q / nocol nocum nofreq out = freq2;
table B1Q / nocol nocum nofreq out = freq3;
table B2Q / nocol nocum nofreq out = freq4;
run;
proc sql;
create table results as
select freq1.Answer,
freq1.Percent as pctA1Q,
freq2.Percent as pctA2Q,
freq3.Percent as pctB1Q,
freq4.Percent as pctB2Q
from freq1
left join freq2
on freq1.Answer = freq2.A2Q
left join freq3
on freq1.Answer = freq3.B1Q
left join freq4
on freq1.Answer = freq4.B2Q;
quit;
My suggestion would be to transpose your data and then do a proc freq or proc tabulate. I would recommend proc tabulate so you can format your output, since it looks like you have questions that are grouped.
data long;
set have;
array qs(*) a1q--b2q; *list first and last variable and everything in between will be included;
do i=1 to dim(qs);
question=vname(qs(i));
response=qs(i);
output;
end;
keep id question response;
run;
proc freq data=long;
table question*response/list;
run;

Rolling up data in SAS

Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;

SAS Find Top Combinations in Dataset

Hell everyone --
I have some sales data which looks like this:
data have;
input order_id item $;
cards;
1 A
1 B
2 A
2 C
3 B
4 A
4 B
;
run;
What I'm trying to find out is what are the most popular combinations of items ordered. For example in the above case, there were 2 orders that contained items A&B, 1 order of A&C, and 1 order of B. What would be the best way to output the different combinations along with the numbers of orders placed?
It seems there is no permutation issue, you could try this:
proc sort data=have;
by order_id item;
run;
data temp;
set have;
by order_id;
retain comb;
length comb $4;
comb=cats(comb,item);
if last.order_id then do;
output;
call missing(comb);
end;
run;
proc freq data=temp;
table comb/norow nopercent nocol nocum;
run;
There are many possible approaches to this problem, and I would not presume to say which is the best. Here's a fairly simple method you could use:
Transpose your data so that you only have 1 row for each order, with an indicator variable for each product.
Feed the transposed dataset into proc corr to produce a correlation matrix for the indicator variables, and look for the strongest correlations.

How to sort by formatted values

proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?
I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.
Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c