The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes
You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;
First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;
The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.
Related
ID
GET_DRUG
HOSP
DATE
QTY
A
H111
H111
2021/12/31
3
A
H112
H112
2022/1/10
4
A
H110
H110
2022/1/13
5
A
D110
H110
2022/1/14
6
A
D111
H110
2022/1/16
3
A
H112
H112
2022/1/23
4
A
D113
H110
2022/1/30
5
A
D114
H110
2022/2/13
5
[![
Step(1).Trying to do calculation like this, the initial character of variable "GET_DRUG" is "D" then calculating days with above each row but only keeping DATE_DIFFERENCE<=15 days records.
Step(2).Count distinct variable "HOSP" value and sum variable "QTY" OF Step(1) result.
Step(3).Count frequency of Step(2) result if HOSP NUM>=2 AND QTY_SUM>=10. ](https://i.stack.imgur.com/029Xl.png)](https://i.stack.imgur.com/029Xl.png)
Final answer is "2" including "2021/12/31~2022/1/13" and "2022/1/10~2022/1/14" two combinations.
How to use SAS to calculate like this?
Many thanks.
Here is a SQL method where you merge the data with itself, linking to the D record.
Filter for the date intervals and aggregate by the episode defined by the first four variables.
data have;
infile cards dlm='09'x truncover;
input ID $ GET_DRUG $ HOSP $ DATE : yymmdd10. QTY;
format date date9.;
cards;
A H111 H111 2021/12/31 3
A H112 H112 2022/1/10 4
A H110 H110 2022/1/13 5
A D110 H110 2022/1/14 6
A D111 H110 2022/1/16 3
A H112 H112 2022/1/23 4
A D113 H110 2022/1/30 5
A D114 H110 2022/2/13 5
;
;
;;
run;
proc sql;
create table merged as select a.id, a.get_drug, a.hosp, a.date,
/*count number of distinct hospitals*/
count(distinct b.hosp) as num_distinct_hospitals, /*sum quantity*/
sum(b.qty) as sum_qty
from have as a left join have as b
/*join on same id*/
on a.id=b.id
/*date <15 - note that boundaries are included*/
and b.date between a.date-14 and a.date
/*do not join on same drug, may need to tweak this*/
and a.get_drug ne b.get_drug
/*use drugs that start as D for the first table*/
where substr(a.get_drug, 1, 1)='D'
/*group results by episode - may be useful to create an episode ID instead to simplify merge*/
group by a.id, a.get_drug, a.hosp, a.date;
quit;
proc sql;
create table want as
select count(*) as result
from merged
where num_distinct_hospitals>=2 and sum_qty >= 10;
quit;
There are different solutions to get the result.
Here is one solution with the following steps:
Sort your data by ID and descending date.
Use a data step with first.ID and retain statements in order to find the first row for an ID group and to keep the values of the "D" row. Additionally check that the first row of an ID group is a "D" row.
In the data step you then go through the data and count the distinct HOSP values and calculate the difference of dates.
You can then count the number of cases with your final condition.
Hope, I understood your task correctly.
Dataset a:-
cc dob enrolled
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
SAS Code:-
proc sql;
select distinct count(*) as cust_enrolled ,year(enrolled) as yr
from a
group by yr
order by cust_enrolled desc;
quit;
Result:-
cust_enrolled yr
5 2010
1 2013
1 2004
1 1999
1 2001
1 2007
My query is to get the first row from this result. How can I achieve this?
Typically I would use a having clause testing an aggregate such as freq=max(freq). However, since freq is already an aggregate count(*) that has to be in a sub-select.
Example:
data have;
input cc dob: mmddyy10. enrolled: mmddyy10.;
format dob enrolled mmddyy10.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
proc sql;
create table most_popular_enrollment_year as
select * from
(select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
;
quit;
If there are multiple years with the max number of year enrollment count the query will return multiple rows. If you want the earliest year of those you need another nesting.
proc sql;
create table earliest_most_popular as
select * from
(
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
)
having yr_enroll=min(yr_enroll)
;
quit;
Another way is to sort by yr_enroll and use Proc SQL option OUTOBS=1 to grab the first
proc sql outobs=1;
create table earliest_most_popular as
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
order by yr_enroll
;
reset outobs=max;
You can use the OUTOBS option of PROC SQL to control how many observations the SELECT statement writes to the output destination(s).
First let's convert your listing into an actual dataset.
data have;
input cc dob :mmddyy. enrolled :mmddyy.;
format dob enrolled date9.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
Now let's run your SELECT statement with OUTOBS set to 1. Make sure to give it some criteria for deciding which observation to take when there are ties for the largest count.
proc sql outobs=1;
select year(enrolled) as yr
, count(*) as cust_enrolled
from have
group by yr
order by cust_enrolled desc, yr
;
quit;
Results:
cust_
yr enrolled
----------------------
2010 5
You can use data set options anywhere. SQL doesn't guarantee an order so you often will want logic that's more complicated than simply the first, but if that's what you want using the OBS=1 option is a decent option.
proc sql;
select * from sashelp.class(obs=1);
quit;
If you want something besides the first, use FIRSTOBS and OBS together.
proc sql;
select * from sashelp.class(firstobs=10 obs=10);
quit;
I have data that's tracking a certain eye phenomena. Some patients have it in both eyes, and some patients have it in a single eye. This is what some of the data looks like:
EyeID PatientID STATUS Gender
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
As you can see from the data above, there are 9 patients total and all of them have the particular phenomena in one eye.
I need the count the number of patients with this eye phenomena.
To get the number of total patients in the dataset, I used:
PROC FREQ data=new nlevels;
tables PatientID;
run;
To count the number of patients with this eye phenomena, I used:
PROC SORT data=new out=new1 nodupkey;
by Patientid Status;
run;
proc freq data=new1 nlevels;
tables Status;
run;
However, it gave the correct number of patients with the phenomena (9), but not the correct number without (0).
I now need to calculate the gender distribution of this phenomena. I used:
proc freq data=new1;
tables gender*Status/chisq;
run;
However, in the cross table, it has the correct number of patients who have the phenomena (9), but not the correct number without (0). Does anyone have any thoughts on how to do this chi-square, where if the has this phenomena in at least 1 eye, then they are positive for this phenomena?
Thanks!
PROC FREQ is doing what you told it to: counting the status=0 cases.
In general here you are using sort of blunt tools to accomplish what you're trying to accomplish, when you probably should use a more precise tool. PROC SORT NODUPKEY is sort of overkill for example, and it doesn't really do what you want anyway.
To set up a dataset of has/doesn't have, for example, let's do a few things. First I add one more row - someone who actually doesn't have - so we see that working.
data have;
input eyeID patientID status gender $;
datalines;
1 1 1 M
2 1 0 M
3 2 1 M
4 3 0 M
5 3 1 M
6 4 1 M
7 4 0 M
8 5 1 F
9 6 1 F
10 6 0 F
11 7 1 F
12 8 1 F
13 8 0 F
14 9 1 F
15 10 0 M
;;;;
run;
Now we use the data step. We want a patient-level dataset at the end, where we have eye-level now. So we create a new patient-level status.
data patient_level;
set have;
by patientID;
retain patient_status;
if first.patientID then patient_status =0;
patient_status = (patient_Status or status);
if last.patientID then output;
keep patientID patient_Status gender;
run;
Now, we can run your second proc freq. Also note you have a nice dataset of patients.
title "Patients with/without condition in any eye";
proc freq data=patient_level;
tables patient_status;
run;
title;
You also may be able to do your chi-square analysis, though I'm not a statistician and won't dip my toe into whether this is an appropriate analysis. It's likely better than your first, anyway - as it correctly identifies has/doesn't have status in at least one eye. You may need a different indicator, if you need to know number of eyes.
title "Crosstab of gender by patient having/not having condition";
proc freq data=patient_level;
tables gender*patient_Status/chisq;
run;
title;
If your actual data has every single patient having the condition, of course, it's unlikely a chi-square analysis is appropriate.
Lets say I have data which looks like:
ID A1Q A2Q B1Q B2Q Continued
23 1 2 2 3
24 1 2 3 3
To understand the table it translates into, Person with ID 23 had answers 1,2,2,4 for the questions A1,A2,B1,B2 respectively. I want to know how to know the percentage of students who answered 1, 2 or 3 in the entire dataset.
I have tried using
PROC FREQ data = test.one;
tables A2Q-A2Q;
tables B1Q-B2Q;
RUN;
But this does not get me what I want. It separately analyzes each question and the output is long. I just need it into one table that tells me this percentage answered 1, this percentage answered 2 and etc.
The output could be:
Question: 1 2 3
Percentage A1Q 40% 40% 20%
Percentage A2Q 60% 20% 20%
Total Percentage 30% 30% 40%
So it would translate such that 40% people chose 1, 40% chose 2, and 30% chose 3 for question A1Q. The total percentage is out of all the people that gave answers, 30% chose 1 30% chose 2 and 40% chose 3.
You'd still need to work on it a little bit and transpose the final results but this could be a start... also if you have lots of questions, consider wrapping this up in a macro program.
data quest;
input ID A1Q A2Q B1Q B2Q;
datalines;
21 2 3 1 2
22 3 2 2 3
23 1 2 2 3
24 1 2 3 3
25 2 1 3 3
run;
options missing = 0;
proc freq data=quest;
table A1Q / nocol nocum nofreq out = freq1(rename=(A1Q=Answer Count=A1Q));
table A2Q / nocol nocum nofreq out = freq2;
table B1Q / nocol nocum nofreq out = freq3;
table B2Q / nocol nocum nofreq out = freq4;
run;
proc sql;
create table results as
select freq1.Answer,
freq1.Percent as pctA1Q,
freq2.Percent as pctA2Q,
freq3.Percent as pctB1Q,
freq4.Percent as pctB2Q
from freq1
left join freq2
on freq1.Answer = freq2.A2Q
left join freq3
on freq1.Answer = freq3.B1Q
left join freq4
on freq1.Answer = freq4.B2Q;
quit;
My suggestion would be to transpose your data and then do a proc freq or proc tabulate. I would recommend proc tabulate so you can format your output, since it looks like you have questions that are grouped.
data long;
set have;
array qs(*) a1q--b2q; *list first and last variable and everything in between will be included;
do i=1 to dim(qs);
question=vname(qs(i));
response=qs(i);
output;
end;
keep id question response;
run;
proc freq data=long;
table question*response/list;
run;
I am wanting to count the number of time a certain value appears in a particular column in sas. For example in the following dataset the value 1 appears 3 times
value 2 appears twice, value 3 appears once, value 4 appears 4 times and value 5 appears four times.
Game_ball
1
1
1
2
2
3
4
4
4
5
5
5
5
5
I want the dataset to represented like the following:
Game_ball Count
1 3
2 2
3 1
4 4
5 4
. .
. .
. .
Thanks in advance
As per #Dwal, proc freq is the easiest solution.
Using your sample data,
proc freq data=sample;
table game_ball/out=output;
run;
Or do it in one-pass data step
proc sort data = sample;by game_ball;run;
data output;
set sample;
retain count;
if first.game_ball then count = 0;
count + 1;
if last.game_ball then output;
by game_ball;
run;
Or in SQL
proc sql;
create table output as
select game_ball, count(*) as count
from sample
group by game_ball;
quit;