Calculate Skewness in PROC REPORT - sas

I have the following sample data with I'm creating a crosstab for:
data have1;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
PROC PRINT; RUN;
Proc rank data=have1 ties=mean out=ranksout groups=2;
var stake;
ranks stakeRank;
run;
PROC TABULATE DATA=ranksout NOSEPS;
VAR stake;
class stakerank;
TABLE stakerank, stake*N;
TABLE stakerank, stake*(N Mean Skewness);
RUN;
I want to replicate what I'm doing in PROC TABULATE in PROC REPORT as I need to add p-values for a Difference in Means test and a few other things. However, it seems that Skewness is not a built-in function in Proc Report. How can I calculate this?
PROC REPORT DATA=ranksout NOWINDOWS;
COLUMN stakerank stake, (n mean);
DEFINE stakerank / GROUP id 'Rank for Variable Stake' ORDER=INTERNAL;
DEFINE stake / ANALYSIS '';
define n/format=8. ;
RUN;
Thanks for any help at all on this

It can be done as follows.
Adding an extra intermediate variable to the rankouts1 table:
proc sql;
create table withCubedDeviationsas
select *,
((stake - (select avg(stake) from ranksout1 where stakeRank = main.stakeRank and winnerRank = main.winnerRank))/(select std(stake) from ranksout1 where stakeRank = main.stakeRank and winnerRank = main.winnerRank)) **3 format=8.2 as cubeddeviations
from ranksout1 main;
quit;
PROC REPORT DATA=withCubedDeviationsNOWINDOWS out=report;
COLUMN stakerank winnerrank, ( N stake=avg cubeddeviations skewness);
DEFINE stakerank / GROUP ORDER=INTERNAL '';
DEFINE winnerrank / ACROSS ORDER=INTERNAL '';
DEFINE cubeddeviations / analysis 'SumCD' noprint;
DEFINE N / 'Bettors';
DEFINE avg / analysis mean 'Avg' format=8.2;
DEFINE skewness / computed format=8.2 'Skewness';
COMPUTE skewness;
_C5_ = _C4_ * (_C2_ / ((_C2_ -1) * (_C2_ - 2)));
_C9_ = _C8_ * (_C6_ / ((_C6_ -1) * (_C6_ - 2)));
ENDCOMP;
RUN;
Why didn't they just add Skewness to the list of statistics that are allowed in a PROC REPORT?

Related

Isolate Patients with 2 diagnoses but diagnosis data is on different lines

I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;

SAS: Data Step. By Processing

How can I aggregate the following sample data to give customer-level calculations? I'm using a data step with 'by processing', but I'm not sure whether or not I should break this up into two data steps or not.
I need to extract the first type, first price, a count of types, a count of unique prices, a count for soccer bets and a count for baseball bets for each player.
I can't seem to combine both the type and price in the same data step.
data have;
input username $ betdate : datetime. stake type $ price sport $;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 SGL 5 SOCCER
player1 04NOV2008:09:03:44 30 SGL 4 SOCCER
player2 07NOV2008:14:03:33 120 SGL 5 SOCCER
player1 05NOV2008:09:00:00 50 SGL 4 SOCCER
player1 05NOV2008:09:05:00 30 DBL 3 BASEBALL
player1 05NOV2008:09:00:05 20 DBL 4 BASEBALL
player2 09NOV2008:10:05:10 10 DBL 5 BASEBALL
player2 15NOV2008:15:05:33 35 DBL 5 BASEBALL
player1 15NOV2008:15:05:33 35 TBL 5 BASEBALL
player1 15NOV2008:15:05:33 35 SGL 4 BASEBALL
run;
proc print;run;
proc sort data=have; by username dateonly betdate type price; run;
data want;
set have;
retain typecount pricecount firsttype firstprice soccercount baseballcount;
by username dateonly betdate;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
if first.username then soccercount=0;
if first.username then baseballcount=0;
if index(upcase(sport),'SOCCER') and eventtime <=5 then soccercount+1;
else if eventtime <=5 then baseballcount+1;
if first.username and eventtime =1 then firsttype=type;
else if eventtime =1 then firsttype=type;
if first.username and eventtime =1 then firstprice=price;
else if eventtime =1 then firstprice=price;
if first.username then typecount=0;
if first.type then typecount+1;
if first.username then pricecount=0;
if first.price and eventtime <=5 then pricecount+1;
IF last.username THEN OUTPUT;
keep username soccercount baseballcount firsttype firstprice typecount pricecount;
run;
proc print;run;
this should do want you've requested within one datastep:
proc sort data =have; by by username dateonly betdate; run;
data want(drop= betdate dateonly stake type price sport TYPELIST PRICELIST) ;
set have;
LENGTH TYPELIST PRICELIST $200; *ARBITRARY LARGE LENGTH;
retain firsttype firstprice TYPELIST typecount PRICELIST pricecount soccercount baseballcount;
by username dateonly betdate;
if first.username then do ;
firsttype=type;
firstprice=PRICE;
typecount=0; pricecount=0; soccercount=0; baseballcount=0;
TYPELIST=""; PRICELIST="";
END;
if index(upcase(sport),'SOCCER') then soccercount+1;
if index(upcase(sport),'BASEBALL') then baseballcount+1;
IF find(TYPELIST,TYPE,'it')=0 THEN TYPELIST=CATX("|",TYPELIST,TYPE);
IF findc(PRICELIST,PRICE,'it')=0 THEN PRICELIST=CATX("|",PRICELIST,PRICE);
IF last.username THEN DO;
typecount=LENGTH(TYPELIST)-LENGTH(COMPRESS(TYPELIST,"|"))+1;
pricecount=LENGTH(PRICELIST)-LENGTH(COMPRESS(PRICELIST,"|"))+1;
OUTPUT;
END;
run;
proc print data=want;run;

SAS: Compute value of column under an ACROSS variable (Nested/Derived/Pseudo-Column)

I can't seem to include a computed variable in a PROC REPORT. It works fine when the computed variable is a headline column, but when it forms part of an ACROSS group, I can't get it to work. I've only got so far as to be able to reference the columns direcly, which only gives me the result for a single ACROSS group, not both.
data have1;
input username $ betdate : datetime. stake winnings winner;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 0
player1 04NOV2008:09:03:44 100 40 1
player2 07NOV2008:14:03:33 120 -120 0
player1 05NOV2008:09:00:00 50 15 1
player1 05NOV2008:09:05:00 30 5 1
player1 05NOV2008:09:00:05 20 10 1
player2 09NOV2008:10:05:10 10 -10 0
player2 09NOV2008:10:05:40 15 -15 0
player2 09NOV2008:10:05:45 15 -15 0
player2 09NOV2008:10:05:45 15 45 1
player2 15NOV2008:15:05:33 35 -35 0
player1 15NOV2008:15:05:33 35 15 1
player1 15NOV2008:15:05:33 35 15 1
run;
PROC PRINT; RUN;
Proc rank data=have1 ties=mean out=ranksout1 groups=2;
var stake winner;
ranks stakeRank winnerRank;
run;
PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
ENDCOMP;
RUN;
I don't understand how a variable connected to an across group can be calculated. This only calculates the value of 'discountedstake' for column 'C4' and it doesn't make sense to do it again for column 7.
How can I include the value of that computed variable in each group?
PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
_C7_ = _C6_ -1;
ENDCOMP;
RUN;
You just need to mention each column you want calculated. You might be able to do this with an array if you have many of them, or do it in a data step/view ahead of time.

SAS: Add asterix in Compute block (Traffic Lighting)

I have a PROC REPORT output and want to add an asterick based on the value of the cell being less than 1.96. I don't want colours, just an asterick after the number. Can this be done with a format, or do I need an 'IF/ELSE' clause in the COMPUTE block?
data have1;
input username $ betdate : datetime. stake winnings winner;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 0
player1 04NOV2008:09:03:44 100 40 1
player2 07NOV2008:14:03:33 120 -120 0
player1 05NOV2008:09:00:00 50 15 1
player1 05NOV2008:09:05:00 30 5 1
player1 05NOV2008:09:00:05 20 10 1
player2 09NOV2008:10:05:10 10 -10 0
player2 09NOV2008:10:05:40 15 -15 0
player2 09NOV2008:10:05:45 15 -15 0
player2 09NOV2008:10:05:45 15 45 1
player2 15NOV2008:15:05:33 35 -35 0
player1 15NOV2008:15:05:33 35 15 1
player1 15NOV2008:15:05:33 35 15 1
run;
PROC PRINT; RUN;
Proc rank data=have1 ties=mean out=ranksout1 groups=2;
var stake winner;
ranks stakeRank winnerRank;
run;
proc sql;
create table withCubedDeviations as
select *,
((stake - (select avg(stake) from ranksout1 where stakeRank = main.stakeRank and winnerRank = main.winnerRank))/(select std(stake) from ranksout1 where stakeRank = main.stakeRank and winnerRank = main.winnerRank)) **3 format=8.2 as cubeddeviations
from ranksout1 main;
quit;
PROC REPORT DATA=withCubedDeviations NOWINDOWS out=report;
COLUMN stakerank winnerrank, ( N stake=avg cubeddeviations skewness);
DEFINE stakerank / GROUP ORDER=INTERNAL '';
DEFINE winnerrank / ACROSS ORDER=INTERNAL '';
DEFINE cubeddeviations / analysis 'SumCD' noprint;
DEFINE N / 'Bettors';
DEFINE avg / analysis mean 'Avg' format=8.2;
DEFINE skewness / computed format=8.2 'Skewness';
COMPUTE skewness;
_C5_ = _C4_ * (_C2_ / ((_C2_ -1) * (_C2_ - 2)));
_C9_ = _C8_ * (_C6_ / ((_C6_ -1) * (_C6_ - 2)));
ENDCOMP;
RUN;
This is just an example, so this won't make statistical sense, but if the value for SKEWNESS is greater than 1 I need to put a single asterick, two asterix if it's greater than 5 and three asterix if the value is greater than ten. Also, if the asterix could be in superscript that would be even better.
I've been testing the following, but to no avail:
PROC FORMAT;
picture onestar . = " " low - high = "9.9999^{super *}";*^{super***};
picture twostar . = " " low - high = "9.9999^{super **}";*^{super***};
picture threestar . = " " low - high = "9.9999^{super ***}";*^{super***};
run;
PROC REPORT DATA=withCubedDeviations NOWINDOWS out=report;
COLUMN stakerank winnerrank, ( N stake=avg cubeddeviations);
DEFINE stakerank / GROUP ORDER=INTERNAL '';
DEFINE winnerrank / ACROSS ORDER=INTERNAL '';
DEFINE cubeddeviations / analysis 'SumCD' noprint;
DEFINE N / 'Bettors';
DEFINE avg / mean 'Avg' format=8.2;
compute avg;
if _C3_ > 1.96 then call define('_C3_','format','onestar.');
endcomp;
RUN;
Thanks for any help.
I think this will do what you need:
proc format;
picture skewaskf
-1 <-<0 = '00009.99' (mult=100 prefix='-')
0-<1 = '00009.99' (mult=100)
1-<5 = '00009.99*'(mult=100)
5-<10= '00009.99**'(mult=100)
10-high='00009.99***'(mult=100);
quit;
Extend for the negatives further.

SAS PROC TABULATE: Colour based on cell value

I have two cross-tabs being output in SAS: one for Time0 and one for Time1. I am interesting in comparing the change in values in each of the cells in the first crosstab with those in second.
Is there a clever way to change the background colour of a cell based on a comparison with an equivalent cell in another cross-tab? If not, and I create a variable with the change in the variable between Time0 and Time1, how can I change the cell colour of the crosstab depending on whether a value is positive or negative? Is it possible to put a colour gradient in increments of 5% if the cell contains a percentage change?
I have some sample data as follows:
data have;
input username $ betdate : datetime. stake;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90
player1 04NOV2008:09:03:44 30
player2 07NOV2008:14:03:33 120
player1 05NOV2008:09:00:00 50
player1 05NOV2008:09:05:00 30
player1 05NOV2008:09:00:05 20
player2 09NOV2008:10:05:10 10
player2 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
run;
proc sort data=have; by username betdate; run;
data have;
set have;
by username dateOnly betdate;
retain eventTime;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
run;
proc sql;
create table playerStats as
select
distinct username,
(select distinct avg(stake) from have where username = main.username and eventTime <= 1) format comma10.2 as bet1AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 2) format comma10.2 as bet2AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 3) format comma10.2 as bet3AvgStake
from have main;
quit;
Proc rank data=playerStats ties=mean out=customerStats groups=2;
var bet1AvgStake bet2AvgStake;
ranks bet1AvgStakeRank bet2AvgStakeRank;
run;
PROC TABULATE DATA=customerStats NOSEPS;
VAR bet1AvgStake bet2AvgStake;
class bet1AvgStakeRank;
TABLE bet1AvgStakeRank, bet1AvgStake*(N Mean);
TABLE bet1AvgStakeRank, bet2AvgStake*(N Mean);
RUN;
I would like to see a red cell when the value in each cell in the second crosstab is lower than the equivalent cell in the first and a green cell when the value is higher.
Thanks for any help on this.
I don't think you can do all that in a single proc, but you certainly can do part 2 if I understand properly. It's called "Traffic Lighting" more generally, to help with googling for more detailed information; for example, this paper has some examples of how to do so.
Generally, the concept is that you create a format, the label of which is a color:
proc format;
value betfmt
low - -5= 'red'
-5 >-> 0 = 'lightred'
0 - 5 ='lightgreen'
5 >- high = 'green'; *or hex values like 'cxFF0099';
quit;
Then use that format in the proc tabulate:
proc tabulate data=yourdata;
var bets;
tables bets/style=[background=betfmt.];
run;
It does need to be based on the current cell, though; you can't calculate based on another cell without using PROC REPORT.