Count distinct id case when

Count distinct id case when - sas

I have table tb with these variables: id,year(it take values from 2007 to 2014 ,each year might be more than once) age_category(form 1 to 4)
I need to count distinct id for each year using different aga category.
I tried but not getting the result.
proc sql;
create table new as
select
COUNT( distinct id CASE WHEN year = 2007 and agecat=1 THEN id else 0 END) as yr_2007,
COUNT( distinct id CASE WHEN year = 2008 and agecat=1 THEN id else 0 END) as yr_2008,
COUNT( distinct id CASE WHEN year = 2009 and agecat=1 THEN id else 0 END) as yr_2009
from tb;

Your approach will need to use a construct such as
COUNT(distinct CASE when year=<year> and agecat=1 then id else . end) as year_<year>
Coding-wise this is probably the wrong approach, especially if you want the distinct counts for each agecat. You would need one expression for each year and agecat combination.
The transformation of the year values from being in a column to being in a column-name is a pivot (and data structure change) from data to metadata and leads to headaches. There are procedures such as PROC TABULATE and PROC REPORT for producing output that displays categorical data in columnar form. PROC TRANSPOSE can also be used to perform these pivots.
Sample code:
ods _all_ close;
ods html5 path='c:\temp' file='catcounts.html';
options nocenter nodate nonumber;
data have;
do id = 1 to 1000;
do year = 2007 to 2014;
if ranuni(123) < 0.75 then continue;
agecat = ceil(4*ranuni(123));
output;
end;
end;
run;
* ugly, expression pivot;
proc sql;
create table want(label="Unique ID count by year") as
select
count (distinct case when year=2007 and agecat=1 then id else . end) as yr_2007_age1_idcount,
count (distinct case when year=2008 and agecat=1 then id else . end) as yr_2007_age2_idcount,
count (distinct case when year=2009 and agecat=1 then id else . end) as yr_2007_age3_idcount
/* potentially many more */
from have;
* clean, categorical counts;
proc sql;
create table counts(label="Unique ID count by year and agecat") as
select
year, agecat, count(distinct id) as idCount
from have
group by year, agecat;
proc print noobs data=counts;
title "PRINT: Categorical counts results - output limited with where statement";
where year between 2010 and 2015 and agecat in (1,2);
run;
* pivot categorical, limited to agecat=1;
proc transpose data=counts out=counts_agecat1 prefix=year_;
id year;
var idCount;
where agecat=1;
quit;
proc print noobs data=counts_agecat1;
title "PRINT: ID Counts for agecat=1";
title2 "Pivot made with TRANSPOSE and limited with where statement";
run;
proc transpose data=counts out=counts_wide prefix=count_ delimiter=_;
id year agecat;
var idCount;
where agecat in (1,2);
quit;
proc print noobs data=counts_wide;
title "PRINT: ID counts for each year and agecat";
title2 "Pivot made with TRANSPOSE and ID year agecat and limited with where statement";
run;
proc tabulate data=counts;
title "Tabulate: Display the pre-computed categorical counts";
class year agecat;
var idCount;
table year='',agecat*idCount=''*sum=''*f=best12./nocellmerge;
run;
proc tabulate data=counts;
title "Tabulate: Display the pre-computed categorical counts";
class year agecat;
var idCount;
table agecat,year=''*idCount=''*sum=''*f=best12./nocellmerge;
run;
proc report data=counts;
title "Report: Display the pre-computed categorical counts";
column year agecat,idCount ;
define year / group;
define agecat / across;
run;
ods _all_ close;

Related

Deleting and adding specific row/ column in SAS output

I have the following data
DATA HAVE;
input year dz $8. area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
run;
After using proc freq
proc freq data=have;
table area*dz/ list nocum ;
run;
I get the below output
In this output
I want to delete the 'dz', what can I do to delete this column?
I want a row in the end that gives 'total', what can I do to get a 'total' row?
Thank you!

There must be a better way of doing this, but the following code creates the desired table:
data have;
input year dz $8. area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
run;
ods output List=list;
proc freq data=have;
table area*dz / list;
run;
data stage1;
set list(keep= area frequency percent CumFrequency CumPercent) end=eof;
area_char = put(area,best.-l); /* Convert it to char to add the Total row */
if eof then do;
call symputx("cumFreq", cumfrequency);
call symputx("cumPerc", cumpercent);
end;
drop area;
run;
data want;
retain area frequency percent; /* Put the variables in the desired order */
set stage1(rename=(area_char=area) drop=cumfrequency cumpercent) end=eof;
output;
if eof then do; /* Manually create the Total row */
area = "Total";
Frequency = &cumfreq.;
Percent = &cumperc.;
output;
end;
run;
Output (want table):

You should subset your data with a where clause and use a title statement if a important partitioning variable is to be removed from output. If you didn't subset how would your audience know if a count contained say episodes of stroke and ministroke if ministroke was also in the data.
Compute the frequencies with freq and use a reporting procedure (print, report, tababulate) that summarizes to show a total line.
Example:
data have;
input year dz $ area;
cards;
2000 stroke 08
2000 stroke 06
2000 stroke 06
;
proc freq noprint data=have;
where dz = 'stroke';
table area / out=freqs;
run;
title 'Stroke dz';
title2 'print';
proc print data=freqs noobs label;
var area;
sum count percent;
run;
title2 'report';
proc report data=freqs;
columns area count percent;
define area / display;
define count / analysis;
rbreak after / summarize;
run;
title2 'tabulate';
proc tabulate data=freqs;
class area;
var count percent;
table area all, count percent;
run;

Thank you all for your valuable responses. The following code gives me the desired output in a concise way
proc freq data=HAVE;
tables area / list nocum out=a;
run;
proc sql;
create table b as
select * from a
union
select
'Total' as area,
sum(count) as count,
sum(percent) as percent
FROM a
;
quit;
proc print data=b; run;

how to vertically sum a range of dynamic variables in sas?

I have a dataset in SAS in which the months would be dynamically updated each month. I need to calculate the sum vertically each month and paste the sum below, as shown in the image.
Proc means/ proc summary and proc print are not doing the trick for me.
I was given the following code before:
`%let month = month name;
%put &month.;
data new_totals;
set Final_&month. end=end;
&month._sum + &month._final;
/*feb_sum + &month._final;*/
output;
if end then do;
measure = 'Total';
&month._final = &month._sum;
/*Feb_final = feb_sum;*/
output;
end;
drop &month._sum;
run; `
The problem is this has all the months hardcoded, which i don't want. I am not too familiar with loops or arrays, so need a solution for this, please.
enter image description here

It may be better to use a reporting procedure such as PRINT or REPORT to produce the desired output.
data have;
length group $20;
do group = 'A', 'B', 'C';
array month_totals jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over month_totals;
month_totals = 10 + floor(rand('uniform', 60));
end;
output;
end;
run;
ods excel file='data_with_total_row.xlsx';
proc print noobs data=have;
var group ;
sum jan2020--dec2019;
run;
proc report data=have;
columns group jan2020--dec2019;
define group / width=20;
rbreak after / summarize;
compute after;
group = 'Total';
endcomp;
run;
ods excel close;
Data structure
The data sets you are working with are 'difficult' because the date aspect of the data is actually in the metadata, i.e. the column name. An even better approach, in SAS, is too have a categorical data with columns
group (categorical role)
month (categorical role)
total (continuous role)
Such data can be easily filtered with a where clause, and reporting procedures such as REPORT and TABULATE can use the month variable in a class statement.
Example:
data have;
length group $20;
do group = 'A', 'B', 'C';
do _n_ = 0 by 1 until (month >= '01feb2020'd);
month = intnx('month', '01jan2018'd, _n_);
total = 10 + floor(rand('uniform', 60));
output;
end;
end;
format month monyy5.;
run;
proc tabulate data=have;
class group month;
var total;
table
group all='Total'
,
month='' * total='' * sum=''*f=comma9.
;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;
proc report data=have;
column group (month,total);
define group / group;
define month / '' across order=data ;
define total / '' ;
where intck('month', month, '01feb2020'd) between 0 and 13;
run;

Here is a basic way. Borrowed sample data from Richard.
data have;
length group $20;
do group = 'A', 'B';
array months jan2020 jan2019 feb2020 feb2019 mar2019 apr2019 may2019 jun2019 jul2019 aug2019 sep2019 oct2019 oct2019 nov2019 dec2019;
do over months;
months = 10 + floor(rand('uniform', 60, 1));
end;
output;
end;
run;
proc summary data=have;
var _numeric_;
output out=temp(drop=_:) sum=;
run;
data want;
set have temp (in=t);
if t then group='Total';
run;

SAS Demographic Table

I have been trying to create a demographic table like below this but I can't seem append the different tables. Please advise on where I can make adjustments in the code.
Group A Group B
chort 1 cohort 2 cohort 3 subtotal cohort 4 cohort 5 cohort 6 subtotal
Age
n
mean
sd
median
min
Gender
n
female
male
Race
n
white
asian
hispanic
black
My Code:
PROC FORMAT;
value content
1=' '
2='Age'
3='Gender'
4='Race'
value sex
1=' n'
2=' female'
3=' male';
value race
1=' n'
2=' white'
3=' asian'
4=' hispanic'
5=' black';
value stat
1=' n'
2=' Mean'
3=' Std. Dev.'
4=' Median'
5=' Minimum';
RUN;
DATA testtest;
SET test.test(keep = id group cohort age gender race);
RUN;
data tottest;
set testtest;
output;
if prxmatch('m/COHORT 1|COHORT 2|COHORT 3/oi', cohort) then do;
cohort='Subtotal';
output;
end;
if prxmatch('m/COHORT 4|COHORT 5|COHORT 6/oi', cohort) then do;
cohort='Subtotal';
output;
end;
run;
data count;
if 0 then set testtest nobs=npats;
call symput('npats',put(npats,1.));
stop;
run;
proc freq data=tottest;
tables cohort /out=patk0 noprint;
tables cohort*sex /out=sex0 noprint;
tables cohort*race /out=race0 noprint;
run;
PROC MEANS DATA = testtest n mean std min median;
class cohort;
VAR age;
RUN;
I know that I would have to transpose it and out it in a report. But before I do that, how do I get the variable out of my proc means, proc freq, etc?

Calculating correlation and covariance for a event window in SAS

I have to calculate the correlation and covariance for my daily sales values for an event window. The event window is of 45 day period and my data looks like -
store_id date sales
5927 12-Jan-07 3,714.00
5927 12-Jan-07 3,259.00
5927 14-Jan-07 3,787.00
5927 14-Jan-07 3,480.00
5927 17-Jan-07 3,646.00
5927 17-Jan-07 3,316.00
4978 18-Jan-07 3,530.00
4978 18-Jan-07 3,103.00
4978 18-Jan-07 3,026.00
4978 21-Jan-07 3,448.00
Now, for every store_id, date combination, I need to go back 45 days (there is more data for each combination in my original data set) calculate the correlation between sales and lag(sales) i.e. autocorrelation of degree one. As you can see, the date column is not continuous. So something like (date - 45) is not going to work.
I have gotten till this part -
data ds1;
set ds;
by store_id;
LAG_SALE = lag(sales);
IF FIRST.store_idTHEN DO;
LAG_SALE = .;
END;
run;
For calculating correlation and covariances -
proc corr data=ds1 outp=Corr
by store_id date;
cov; /** include covariances **/
var sales lag_sale;
run;
But how do I insert the event window for each date, store_id combination? My final output should look something like this -
id date corr cov
5927 12-Jan-07 ... ...
5927 14-Jan-07 ... ...

Here is what I've come up with:
First I convert the date to a SAS date, which is the number of days since Jan. 1 1960:
data ds;
set ds (rename=(date=old_date));
date = input(old_date, date11.);
drop old_date;
run;
Then compute lag_sale (I am using the same calculation you used in the question, but make sure this is what you want to do. For some observations the lag sale is the previous recorded date, but for some it is the same store_id and date, just a different observation.):
proc sort data=ds; by store_id; run;
data ds;
set ds;
by store_id;
lag_sale = lag(sales);
if first.store_id then lag_sale = .;
run;
Then set up the final data set:
data final;
length store_id 8 date 8 cov 8 corr 8;
if _n_ = 0;
run;
Then create a macro which takes a store_id and date and runs proc corr. The first part of the macro selects only the data with that store_id and within the past 45 days of the date. Then it runs proc corr. Then it formats proc corr how you want it and appends the results to the "final" data set.
%macro corr(store_id, date);
data ds2;
set ds;
where store_id = &store_id and %eval(&date-45) <= date <=&date
and lag_sale ne .;
run;
proc corr noprint data=ds2 cov outp=corr;
by store_id;
var sales lag_sale;
run;
data corr2;
set corr;
where _type_ in ('CORR', 'COV') and _name_ = 'sales';
retain cov;
date = &date;
if _type_ = 'COV' then cov = lag_sale;
else do;
corr = lag_sale;
output;
end;
keep store_id date corr cov;
run;
proc append base=final data=corr2 force; run;
%mend corr;
Finally run the macro for each store_id/date combination.
proc sort data=ds out=ds3 nodupkey;
by store_id date;
run;
data _null_;
set ds3;
call execute('%corr('||store_id||','||date||');');
run;
proc sort data=final;
by store_id date;
run;

Bar Graph by Month - SAS EG

I am trying to create a bar graph in SAS Enterprise Guide. The graph is Savings by Month.
The input Data is
Ref Date Savings
A 03JUN2013 1000
A 08JUN2013 2000
A 08JUL2013 1500
A 08AUG2013 300
A 08NOV2013 100
B 09DEC2012 500
B 09MAY2013 400
B 19MAY2013 5999
B 09OCT2013 511
C 15OCT2013 1200
C 01NOV2013 1500
The first step I do is to convert the date into month. The I use PROC MEANS to calculate total savings by month by Ref.
Then I create a bar graph. The issue I am getting is the bar graph is not in a sequential order as it should be. Like it is AUG13 JUl13 JUN13 .. etc. instead of JUN JUL AUG.
PROC SQL;
CREATE TABLE SAVINGS_11 AS
SELECT
PUT(DATE,monname3.) AS MONTH,
(DATE) FORMAT=MONNAME3. AS MONTH1,
MONTH(DATE) AS MONTH2,
PUT(DATE,MONYY5.) AS MONTH3,
(DATE) FORMAT=MONYY5. AS MONTH4,
DATE,
REF,
SAVINGS
FROM INPUT;
QUIT;
/* -------------------------------------------------------------------
Sort data set
------------------------------------------------------------------- */
PROC SORT
DATA=SAVINGS_11(KEEP=SAVINGS MONTH MONTH1 MONTH2 MONTH3 MONTH4 REF)
OUT=SORT1;
BY REF;
RUN;
/* -------------------------------------------------------------------
Run the Means Procedure
------------------------------------------------------------------- */
TITLE;
TITLE1 "Summary";
TITLE2 "Results";
FOOTNOTE;
PROC MEANS DATA=SORT1
NOPRINT
CHARTYPE
NOLABELS
NWAY
SUM NONOBS ;
VAR SAVINGS;
CLASS MONTH / ORDER=DATA ASCENDING;
BY REF;
ID MONTH1 MONTH2 MONTH3 MONTH4;
OUTPUT OUT=MEANSUMMARY
SUM()=
/ AUTONAME AUTOLABEL WAYS INHERIT
;
RUN;
/* -------------------------------------------------------------------
End of task code.
------------------------------------------------------------------- */
RUN; QUIT;
TITLE; FOOTNOTE;
PROC SORT
DATA=MEANSUMMARY(KEEP=MONTH MONTH2 "SAVINGS_Sum"n REF)
OUT=SORT2
;
BY REF MONTH2;
RUN;
Axis1
STYLE=1
WIDTH=1
MINOR=NONE
;
Axis2
STYLE=1
WIDTH=1
;
TITLE;
TITLE1 "Bar Chart";
FOOTNOTE;
PROC GCHART DATA=SORT2
;
VBAR
MONTH
/
SUMVAR="SAVINGS_Sum"n
CLIPREF
FRAME LEVELS=ALL
TYPE=SUM
INSIDE=SUM
COUTLINE=BLACK
RAXIS=AXIS1
MAXIS=AXIS2
;
BY REF;
/* -------------------------------------------------------------------
End of task code.
------------------------------------------------------------------- */
RUN; QUIT;
TITLE; FOOTNOTE;
Whatever format I use, the end result is not in a sequential order. Please help.

Your problem is that you're converting the date value to a character variable. MONTH, at least, should be a formatted date variable, not a character variable; so this line:
PUT(DATE,monname3.) AS MONTH,
should be
DATE AS MONTH FORMAT=monname3.,
Most procedures (like PROC MEANS and PROC GPLOT) will respect formats and group by same-formatted values. I don't completely understand why you have 5 month variables all containing different versions of the same thing, so perhaps there are better ways to do what you're doing here.
In particular, if you have SAS 9.2 or later, SGPLOT will probably do this entire process for you without any of the summarization steps.

Apart from what Joe mentioned above, you also need to include the key word DISCRETE in your VBAR statement if you want to be able to see all months for each reference on the x-axis where they made a saving (Note: this wil generate warning messages if some references do not have any savings in some months).
You could try the following code which I believe produces the output you are after:
PROC SQL;
CREATE TABLE DATA_TO_PLOT AS
SELECT
REF
,INPUT(PUT(date,YYMMN6.),YYMMN6.) FORMAT =DATE9. AS MONTH
,SUM(Savings) AS MONTHLY_SAVINGS
FROM INPUT
GROUP BY 1,2
ORDER BY 1,2 ;
QUIT;
Axis1 STYLE=1 WIDTH=1 MINOR=NONE;
Axis2 STYLE=1 WIDTH=1;
TITLE;
TITLE1 "Bar Chart";
PROC GCHART DATA=DATA_TO_PLOT;
VBAR MONTH
/ SUMVAR=MONTHLY_SAVINGS
CLIPREF
FRAME TYPE=SUM
COUTLINE=BLACK
RAXIS=AXIS1
MAXIS=AXIS2
INSIDE=SUM
DISCRETE
;
FORMAT MONTH MONYY7.;
BY Ref;
RUN; QUIT;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Count distinct id case when - sas

Related

Deleting and adding specific row/ column in SAS output

how to vertically sum a range of dynamic variables in sas?

SAS Demographic Table

Calculating correlation and covariance for a event window in SAS

Bar Graph by Month - SAS EG

Categories

Resources