SAS Count Frequency under some conditions - sas

ID
GET_DRUG
HOSP
DATE
QTY
A
H111
H111
2021/12/31
3
A
H112
H112
2022/1/10
4
A
H110
H110
2022/1/13
5
A
D110
H110
2022/1/14
6
A
D111
H110
2022/1/16
3
A
H112
H112
2022/1/23
4
A
D113
H110
2022/1/30
5
A
D114
H110
2022/2/13
5
[![
Step(1).Trying to do calculation like this, the initial character of variable "GET_DRUG" is "D" then calculating days with above each row but only keeping DATE_DIFFERENCE<=15 days records.
Step(2).Count distinct variable "HOSP" value and sum variable "QTY" OF Step(1) result.
Step(3).Count frequency of Step(2) result if HOSP NUM>=2 AND QTY_SUM>=10. ](https://i.stack.imgur.com/029Xl.png)](https://i.stack.imgur.com/029Xl.png)
Final answer is "2" including "2021/12/31~2022/1/13" and "2022/1/10~2022/1/14" two combinations.
How to use SAS to calculate like this?
Many thanks.

Here is a SQL method where you merge the data with itself, linking to the D record.
Filter for the date intervals and aggregate by the episode defined by the first four variables.
data have;
infile cards dlm='09'x truncover;
input ID $ GET_DRUG $ HOSP $ DATE : yymmdd10. QTY;
format date date9.;
cards;
A H111 H111 2021/12/31 3
A H112 H112 2022/1/10 4
A H110 H110 2022/1/13 5
A D110 H110 2022/1/14 6
A D111 H110 2022/1/16 3
A H112 H112 2022/1/23 4
A D113 H110 2022/1/30 5
A D114 H110 2022/2/13 5
;
;
;;
run;
proc sql;
create table merged as select a.id, a.get_drug, a.hosp, a.date,
/*count number of distinct hospitals*/
count(distinct b.hosp) as num_distinct_hospitals, /*sum quantity*/
sum(b.qty) as sum_qty
from have as a left join have as b
/*join on same id*/
on a.id=b.id
/*date <15 - note that boundaries are included*/
and b.date between a.date-14 and a.date
/*do not join on same drug, may need to tweak this*/
and a.get_drug ne b.get_drug
/*use drugs that start as D for the first table*/
where substr(a.get_drug, 1, 1)='D'
/*group results by episode - may be useful to create an episode ID instead to simplify merge*/
group by a.id, a.get_drug, a.hosp, a.date;
quit;
proc sql;
create table want as
select count(*) as result
from merged
where num_distinct_hospitals>=2 and sum_qty >= 10;
quit;

There are different solutions to get the result.
Here is one solution with the following steps:
Sort your data by ID and descending date.
Use a data step with first.ID and retain statements in order to find the first row for an ID group and to keep the values of the "D" row. Additionally check that the first row of an ID group is a "D" row.
In the data step you then go through the data and count the distinct HOSP values and calculate the difference of dates.
You can then count the number of cases with your final condition.
Hope, I understood your task correctly.

Related

Use computed macro variable as a new column name in existing table

I am generating a report that does a look back over the past 45ish days. The requestor wants the dates as the column headers so I am trying to write a macro loop that just goes through the dates in reverse order to create the columns. I will then write logic to take that rows ID and that columns date to populate the values needed. However, I am having difficulty getting the date variable I have computed turned into the new column header.
The macro loop works and creates the correct date in the variable in each iteration, but how do I take that and make it a new column in the table?
Desired output is this:
ID
Name
08Nov2022
07Nov2022
06Nov2022
1
Cell 2
0
0
0
2
Cell 4
0
0
0
%LET iDayCount=45;
/* Create a new temp table by selecting the values from a permeanent table housing the category IDs, names and details
Call this temp.parent_table*/
%MACRO test;
DATA temp.parent_table;
SET temp.parenet_table;
%LET today=sysfunc(today));
%DO iCounter=0 %TO &iDayCount;
%LET colName=%sysfunc(intnx(day,-&iCounter),date9.);
/* THIS IS WHERE IT GOES OFF THE RAILS */
/* I want to use colName value as a new column in the temp.parent_table*/
&colName = 0;
%END
RUN;
%MEND;
%test;
The log has a note for each iteration:
NOTE: Line generated by the macro variable "COLNAME".
"08NOV2022
Each date in the note is underlined red with the error message:
Error 180-322: Statement is not valid or it is used out of proper order
As always your help is appreciated.
The easiest way to make a report that has date values as column headers is to use PROC REPORT. Store the date values in a variable and use it as an ACROSS variable in the report.
So if you have data like this:
Obs ID Name date value
1 1 Cell2 08NOV2022 1
2 1 Cell2 07NOV2022 2
3 1 Cell2 06NOV2022 3
4 2 Cell4 08NOV2022 4
5 2 Cell4 07NOV2022 5
6 2 Cell4 06NOV2022 6
You can make your report using code like this:
proc report ;
columns id name value,date ;
define id/group;
define name/group;
define value / sum ' ';
define date / across order=internal descending ' ';
run;
Result:

Summarise and calculate the items specifically in the dataset using proc sql

My dataset and attempt
data mydata;
input Category $ Item $;
datalines;
A 1
A 1
A 2
B 3
B 1
;
proc sql;
create table mytable as
select *, count(Category) as Total_No_in_Category, count(Category)-count(item, "3") as No_of_not_3_in_the_same_category from mydata
group by Category;
run;
Result
Category No_of_not_3_in_the_same_category Total_No_in_Category
A 3 3
A 3 3
A 3 3
B 2 2
B 2 1
My expected result
Category No_of_not_3_in_the_same_ category Total_No_in_Category
A 2 3
B 1 2
I wonder how to achieve the expected result using only proc SQL. Thank you so much.
The two argument COUNT(item, "3") function call is not an summary function. That causes all rows from original table to be automatically remerged with the aggregate computation (those count()). The remerge is a proprietary feature of SAS Proc SQL and not part of the ANSI Standard for SQL.
You appear to want the number of unique non-3 item values, so you will need a
COUNT(DISTINCT ...expression...)
in the query. The ...expression... can be a case clause that transforms item="3" to a null value by not having an else part of the case clause.
Example:
create table want as
select
category
, count(*) as freq
, count(distinct case when item ne "3" then item end) as n_unq_item_not_3
from mydata
group by category
;

How can I get the top row out of the following data in sas using proc sql

Dataset a:-
cc dob enrolled
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
SAS Code:-
proc sql;
select distinct count(*) as cust_enrolled ,year(enrolled) as yr
from a
group by yr
order by cust_enrolled desc;
quit;
Result:-
cust_enrolled yr
5 2010
1 2013
1 2004
1 1999
1 2001
1 2007
My query is to get the first row from this result. How can I achieve this?
Typically I would use a having clause testing an aggregate such as freq=max(freq). However, since freq is already an aggregate count(*) that has to be in a sub-select.
Example:
data have;
input cc dob: mmddyy10. enrolled: mmddyy10.;
format dob enrolled mmddyy10.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
proc sql;
create table most_popular_enrollment_year as
select * from
(select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
;
quit;
If there are multiple years with the max number of year enrollment count the query will return multiple rows. If you want the earliest year of those you need another nesting.
proc sql;
create table earliest_most_popular as
select * from
(
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
)
having yr_enroll=min(yr_enroll)
;
quit;
Another way is to sort by yr_enroll and use Proc SQL option OUTOBS=1 to grab the first
proc sql outobs=1;
create table earliest_most_popular as
select * from
(
select count(*) as freq, year(enrolled) as yr_enroll
from have
group by yr_enroll
)
having freq=max(freq)
order by yr_enroll
;
reset outobs=max;
You can use the OUTOBS option of PROC SQL to control how many observations the SELECT statement writes to the output destination(s).
First let's convert your listing into an actual dataset.
data have;
input cc dob :mmddyy. enrolled :mmddyy.;
format dob enrolled date9.;
datalines;
1 10-13-1981 10-13-2001
2 10-17-1984 12-15-2004
3 07-20-1957 12-20-2007
4 10-13-1989 12-24-2010
5 10-13-1996 12-28-2013
6 10-14-1996 12-11-1999
7 10-15-1996 12-24-2010
8 10-16-1996 12-24-2010
9 10-17-1996 12-24-2010
10 10-18-1996 12-24-2010
;
Now let's run your SELECT statement with OUTOBS set to 1. Make sure to give it some criteria for deciding which observation to take when there are ties for the largest count.
proc sql outobs=1;
select year(enrolled) as yr
, count(*) as cust_enrolled
from have
group by yr
order by cust_enrolled desc, yr
;
quit;
Results:
cust_
yr enrolled
----------------------
2010 5
You can use data set options anywhere. SQL doesn't guarantee an order so you often will want logic that's more complicated than simply the first, but if that's what you want using the OBS=1 option is a decent option.
proc sql;
select * from sashelp.class(obs=1);
quit;
If you want something besides the first, use FIRSTOBS and OBS together.
proc sql;
select * from sashelp.class(firstobs=10 obs=10);
quit;

show all values in categorical variable

The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes
You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;
First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;
The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.

SAS software: How to delete observations with more than five zeros for the dependent variable

I have a consumer panel data with weekly recorded spending at a retail store. The unique identifier is household ID. I would like to delete observations if there occurs more than five zeros in spending. That is, the household did not make any purchase for five weeks. Once identified, I will delete all observations associated with the household ID. Does anyone know how I can implement this procedure in SAS? Thanks.
I think proc SQL would be good here.
This could be done in a single step with a more complex subquery but it is probably better to break it down into 2 steps.
Count how many zeroes each household ID has.
Filter to only include household IDs that have 5 or less zeroes.
proc sql;
create table zero_cnt as
select distinct household_id,
sum(case when spending = 0 then 1 else 0 end) as num_zeroes
from original_data
group by household_id;
create table wanted as
select *
from original_data
where household_id in (select distinct household_id from zero_cnt where num_zeroes <= 5);
quit;
Edit:
If the zeroes have to be consecutive then the method of building the list of IDs to exclude is different.
* Sort by ID and date;
proc sort data = original_data out = sorted_data;
by household_id date;
run;
Use the Lag operator: to check the previous spending amounts.
More info on LAG here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
data exclude;
set sorted;
by household_id;
array prev{*} _L1-_L4;
_L1 = lag(spending);
_L2 = lag2(spending);
_L3 = lag3(spending);
_L4 = lag4(spending);
* Create running count for the number of observations for each ID;
if first.household_id; then spend_cnt = 0;
spend_cnt + 1;
* Check if current ID has at least 5 observations to check. If so, add up current spending and previous 4 and output if they are all zero/missing;
if spend_cnt >= 5 then do;
if spending + sum(of prev) = 0 then output;
end;
keep household_id;
run;
Then just use a subquery or match merge to remove the IDs in the 'excluded' dataset.
proc sql;
create table wanted as
select *
from original_data;
where household_id not in(select distinct household_id from excluded);
quit;