I have the following two tables. One has a couple of bet results and the other has a number of 'dummy' bets the need to be added. I want to get mean of the original sample, the mean of the sample with the dummy bets added and then perform both a Chi Squared test of the difference between the columns and a Kruskal Wallis test on the difference between the rows.
I'm having an issue with tabulating the data to product the mean for both categories.
data A;
input username $ betdate : datetime. stake winnings node $;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 X
player1 04NOV2008:09:03:44 100 40 L
player2 07NOV2008:14:03:33 120 -120 W
player1 05NOV2008:09:00:00 50 15 L
player1 05NOV2008:09:05:00 30 5 W
player1 05NOV2008:09:00:05 20 10 L
player2 09NOV2008:10:05:10 10 -10 W
player2 15NOV2008:15:05:33 35 -35 W
player1 15NOV2008:15:05:33 35 15 L
player1 15NOV2008:15:05:33 35 15 L
run;
proc sql; create table B(toAdd num,node char(100)); quit;
proc sql; insert into B (toAdd, node)
values(5, 'X')
values(3, 'L')
values(7, 'W') ;
quit;
I want to show the mean without dummy bets and the mean with the dummy bets included. I'm added the dummy bets as follows:
proc sort data=A out=A; by node; run;
data A;
modify A B;
by node;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;
The problem is when I tabulate the data, because there isn't really two distinct categories, it's not showing me what I want.
proc tabulate data=A;
class node dummy;
var stake winnings;
table node="",stake="" * (Mean="")*(dummy="" ALL);
run;
I'm using the dummy bets to create a mean that's based on a large 'N'. I would just do this in PROC Report and calculate the mean manually with the larger 'N' as a numerator, but I need to perform a Kruskal Wallis and Chi-Squared test. It's easier to have the dummy bets with a stake of zero to keep things simple and maintain the correct counts in each category. Moreover, it's non-trivial to calculate the standard error on-the-fly (or back it out of the result created by PROC TABULATE) without having the dummy bets in each category.
How can I just show the result of PROC TABULATE above, but without the 0, 1 and ALL categories as the entries when the dummy is 1 are meaningless? Ideally, I'd like to see 'WITHOUT DUMMIES' as 0 and 'WITH DUMMIES' as 1 and display the result of the ALL column as the 'WTIH DUMMIES' = 1 category. I can then proceed to performin the KRUSKAL WALLIS on the 'NODE' class variable and the CHI-SQUARED on the dummy class variable because as it stands, I can't perform these tests with only the 0 category and the 1 category as classes in the tests.
If I could copy all the rows that are in category dummy = 0 into the category dummy = 1 it would solve the problem, I think.
Your 'if I could' is the right idea, largely. You need to fix your data to reflect the groupings you want; dummy=0 should be only nondummy bets, dummy=1 should be dummy AND nondummy bets, if I understand correctly. So you need to output the dummy=0 rows twice, once with dummy=1 and once with dummy=0.
Something like:
data A;
modify A B;
by node;
output;
dummy=1;
output;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;
Related
During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.
Consider following exemplary SAS dataset with following layout.
Price Num_items
100 10
120 15
130 20
140 25
150 30
I want to group them into 4 categories by defining a new variable called cat such that the new dataset looks as follows:
Price Num_items Cat
100 10 1
120 15 1
130 20 2
140 25 3
150 30 4
Also I want to group them so that they have about equal number of items (For example in above grouping Group 1 has 25, Group 2 has 20 ,Group 3 has 25 and Group 4 has 30 observations). Note that the price column is sorted in ascending order (that is required).
I am struggling to start with SAS for above. So any help would be appreciated. I am not looking for a complete solution but pointers towards preparing a solution would help.
Cool problem, subtly complex. I agree with #J_Lard that a data step with some retainment would likely be the quickest way to accomplish this. If I understand your problem correctly, I think the code below would give you some ideas as to how you want to solve it. Note that depending on the num_items, and group_target, your mileage will vary.
Generate similar, but larger data set.
data have;
do price=50 to 250 by 10;
/*Seed is `_N_` so we'll see the same random item count.*/
num_items = ceil(ranuni(_N_)*10)*5;
output;
end;
run;
Categorize.
/*Desired group size specification.*/
%let group_target = 50;
data want;
set have;
/*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
if _N_=1 then do;
cat + 1;
cat_num_items + num_items;
end;
else do;
/*If the item count for a new price puts the category count above the target, apply logic.*/
if cat_num_items + num_items > &group_target. then do;
/*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
cat+1;
cat_num_items = num_items;
end;
/*Otherwise keep it in the currnet category and increment category count.*/
else cat_num_items + num_items;
end;
/*Otherwise keep the item count in the current category and increment category count.*/
else cat_num_items + num_items;
end;
drop cat_num_items;
run;
Check.
proc sql;
create table check_want as
select cat,
sum(num_items) as cat_count
from want
group by cat;
quit;
I can't seem to include a computed variable in a PROC REPORT. It works fine when the computed variable is a headline column, but when it forms part of an ACROSS group, I can't get it to work. I've only got so far as to be able to reference the columns direcly, which only gives me the result for a single ACROSS group, not both.
data have1;
input username $ betdate : datetime. stake winnings winner;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 0
player1 04NOV2008:09:03:44 100 40 1
player2 07NOV2008:14:03:33 120 -120 0
player1 05NOV2008:09:00:00 50 15 1
player1 05NOV2008:09:05:00 30 5 1
player1 05NOV2008:09:00:05 20 10 1
player2 09NOV2008:10:05:10 10 -10 0
player2 09NOV2008:10:05:40 15 -15 0
player2 09NOV2008:10:05:45 15 -15 0
player2 09NOV2008:10:05:45 15 45 1
player2 15NOV2008:15:05:33 35 -35 0
player1 15NOV2008:15:05:33 35 15 1
player1 15NOV2008:15:05:33 35 15 1
run;
PROC PRINT; RUN;
Proc rank data=have1 ties=mean out=ranksout1 groups=2;
var stake winner;
ranks stakeRank winnerRank;
run;
PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
ENDCOMP;
RUN;
I don't understand how a variable connected to an across group can be calculated. This only calculates the value of 'discountedstake' for column 'C4' and it doesn't make sense to do it again for column 7.
How can I include the value of that computed variable in each group?
PROC REPORT DATA=ranksout1 NOWINDOWS out=report;
COLUMN stakerank winnerrank, (N stake=stakemean discountedstake);
DEFINE stakerank / GROUP '' ORDER=INTERNAL;
DEFINE winnerrank / ACROSS '' ORDER=INTERNAL;
DEFINE stake / analysis sum noprint;
DEFINE stakemean / analysis sum;
DEFINE discountedstake / computed format=8.2 'discountedstake';
COMPUTE discountedstake;
_C4_ = _C3_ -1;
_C7_ = _C6_ -1;
ENDCOMP;
RUN;
You just need to mention each column you want calculated. You might be able to do this with an array if you have many of them, or do it in a data step/view ahead of time.
I have two cross-tabs being output in SAS: one for Time0 and one for Time1. I am interesting in comparing the change in values in each of the cells in the first crosstab with those in second.
Is there a clever way to change the background colour of a cell based on a comparison with an equivalent cell in another cross-tab? If not, and I create a variable with the change in the variable between Time0 and Time1, how can I change the cell colour of the crosstab depending on whether a value is positive or negative? Is it possible to put a colour gradient in increments of 5% if the cell contains a percentage change?
I have some sample data as follows:
data have;
input username $ betdate : datetime. stake;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90
player1 04NOV2008:09:03:44 30
player2 07NOV2008:14:03:33 120
player1 05NOV2008:09:00:00 50
player1 05NOV2008:09:05:00 30
player1 05NOV2008:09:00:05 20
player2 09NOV2008:10:05:10 10
player2 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
run;
proc sort data=have; by username betdate; run;
data have;
set have;
by username dateOnly betdate;
retain eventTime;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
run;
proc sql;
create table playerStats as
select
distinct username,
(select distinct avg(stake) from have where username = main.username and eventTime <= 1) format comma10.2 as bet1AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 2) format comma10.2 as bet2AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 3) format comma10.2 as bet3AvgStake
from have main;
quit;
Proc rank data=playerStats ties=mean out=customerStats groups=2;
var bet1AvgStake bet2AvgStake;
ranks bet1AvgStakeRank bet2AvgStakeRank;
run;
PROC TABULATE DATA=customerStats NOSEPS;
VAR bet1AvgStake bet2AvgStake;
class bet1AvgStakeRank;
TABLE bet1AvgStakeRank, bet1AvgStake*(N Mean);
TABLE bet1AvgStakeRank, bet2AvgStake*(N Mean);
RUN;
I would like to see a red cell when the value in each cell in the second crosstab is lower than the equivalent cell in the first and a green cell when the value is higher.
Thanks for any help on this.
I don't think you can do all that in a single proc, but you certainly can do part 2 if I understand properly. It's called "Traffic Lighting" more generally, to help with googling for more detailed information; for example, this paper has some examples of how to do so.
Generally, the concept is that you create a format, the label of which is a color:
proc format;
value betfmt
low - -5= 'red'
-5 >-> 0 = 'lightred'
0 - 5 ='lightgreen'
5 >- high = 'green'; *or hex values like 'cxFF0099';
quit;
Then use that format in the proc tabulate:
proc tabulate data=yourdata;
var bets;
tables bets/style=[background=betfmt.];
run;
It does need to be based on the current cell, though; you can't calculate based on another cell without using PROC REPORT.
I have some sample data as follows, and want to calculate the number of winning or losing bets in a row.
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
PROC PRINT; RUN;
proc sort data=have;
by username betdate;
run;
DM "log; clear;";
data want;
set have;
by username dateOnly betdate;
retain calendarTime eventTime cumulativeDailyProfit profitableFlag;
if first.username then calendarTime = 0;
if first.dateOnly then calendarTime + 1;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
if first.username then cumulativeDailyProfit = 0;
if first.dateOnly then cumulativeDailyProfit = 0;
if first.betdate then cumulativeDailyProfit + stake;
if winnings > 0 then winner = 1;
if winnings <= 0 then winner = 0;
PROC PRINT; RUN;
For example, the first four bets four player1 are winners, so the first four rows in this column should show 1,2,3,4 (at this point, four wins in a row). The fifth is a loser, so should show -1, followed by 1,2. The following three rows (for player 3, should show -1, -2, -3 as the customer has had three bets in a row. How can I calculate the value of this column in the data step? How can I also have a column for the largest number of winning bets in a row (to date) and the maximum number of losing bets the customer has had to date in each row?
Thanks for any help.
To do a running total like this, you can use BY with NOTSORTED and still leverage the first.<var> functionality. For example:
data have;
input winlose $;
datalines;
win
win
win
win
lose
lose
win
lose
win
win
lose
;;;;
run;
data want;
set have;
by winlose notsorted;
if first.winlose and winlose='win' then counter=1;
else if first.winlose then counter=-1;
else if winlose='win' then counter+1;
else counter+(-1);
run;
Each time 'win' changes to 'lose' or the reverse, it resets the first.winlose variable to 1.
Once you have done this, you can either use a double DoW loop to append maximums, or perhaps more easily just get this value in a dataset and then add it on via a second datastep (or proc sql) to append your desired variables.