Stata: Concatenate string variable on by condition - stata

I'm creating a variable called result in the following sample data which shows a 'W' if a bet was a win and 'L' if it was a loss.
How can I concatenate this variable with itself on a row by row basis in strict order by timestamp for each username?
clear
input str16 username str40 betdate winnings
player1 "12NOV2008:19:04:01" -10
player1 "12NOV2008:12:03:44" 50
player2 "07NOV2008:14:03:33" -50
player2 "05NOV2008:09:00:00" -100
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
cap drop result
generate result = "L"
replace result = "W" if (winnings >0)
cap drop resulthistory
generate resulthistory = ""
replace resulthistory = concat(resulthisory + result), by(USERNAME timestamp)

Readers should note that the last line of the question is fantasy syntax; the rest would work.
This may be what you seek. Note that as you read in the data afresh, the variables you capture drop can't exist.
clear
input str16 username str40 betdate winnings
player1 "12NOV2008:19:04:01" -10
player1 "12NOV2008:12:03:44" 50
player2 "07NOV2008:14:03:33" -50
player2 "05NOV2008:09:00:00" -100
end
gen double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
gen result = cond(winnings > 0, "W", "L")
bysort username (timestamp): gen resulthistory = result[1]
by username : replace resulthistory = resulthistory[_n-1] + result if _n > 1
by username : replace resulthistory = resulthistory[_N]
list

Related

Stata: First occurrences, sum of unique occurrences with a by variable

The following sample data has variables describing bets by a number of players.
How can I calculate each player's first bettype, first betprice, the number of soccer bets, the number of baseball bets, the number of unique prices per customer and the number of unique bet types per username?
clear
input str16 username str40 betdate stake str16 bettype betprice str16 sport
player1 "12NOV2008 12:04:33" 90 SGL 5 SOCCER
player1 "04NOV2008:09:03:44" 30 SGL 4 SOCCER
player2 "07NOV2008:14:03:33" 120 SGL 5 SOCCER
player1 "05NOV2008:09:00:00" 50 SGL 4 SOCCER
player1 "05NOV2008:09:05:00" 30 DBL 3 BASEBALL
player1 "05NOV2008:09:00:05" 20 DBL 4 BASEBALL
player2 "09NOV2008:10:05:10" 10 DBL 5 BASEBALL
player2 "15NOV2008:15:05:33" 35 DBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 TBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 SGL 4 BASEBALL
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
generate double dateonly=date(betdate,"DMY hms")
format dateonly %td
generate firsttype
generate firstprice
generate soccercount
generate baseballcount
generate uniquebettypecount
generate uniquebetpricecount
This is a bit close to the margin, as a "please give me the code" question, with no attempt at your own solutions.
The first type and price are
bysort username (timestamp) : gen firsttype = bettype[1]
bysort username (timestamp) : gen firstprice = betprice[1]
The number of soccer and baseball bets is
egen soccercount = total(sport == "SOCCER"), by(username)
egen baseballcount = total(sport == "BASEBALL"), by(username)
The number of distinct [not unique!] bet types is
bysort username bettype : gen work = _n == 1
egen uniquebettypecount = total(work), by(username)
and the other problem is just the same (but replace work). Another way to do that is
egen work = tag(username bettype)
egen uniquebettypecount = total(work), by(username)
What is characteristic of all these variables is that the same value is repeated for all values within each group. For example, firsttype has the same value for each occurrence of each distinct username. Often you will want to use each value just once. A key to that is the egen function tag() just used, for example
egen usertag = tag(username)
followed by uses of if usertag when needed. (if usertag is a useful idiom for if usertag == 1.)
Some reading suggestions:
On by: http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
On egen: http://www.stata.com/help.cgi?egen
On distinct observations (and why the word "unique" is misleading): http://www.stata-journal.com/sjpdf.html?articlenum=dm0042

Class variable in PROC TABULATE. Need alternative to ALL command

I have the following two tables. One has a couple of bet results and the other has a number of 'dummy' bets the need to be added. I want to get mean of the original sample, the mean of the sample with the dummy bets added and then perform both a Chi Squared test of the difference between the columns and a Kruskal Wallis test on the difference between the rows.
I'm having an issue with tabulating the data to product the mean for both categories.
data A;
input username $ betdate : datetime. stake winnings node $;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 X
player1 04NOV2008:09:03:44 100 40 L
player2 07NOV2008:14:03:33 120 -120 W
player1 05NOV2008:09:00:00 50 15 L
player1 05NOV2008:09:05:00 30 5 W
player1 05NOV2008:09:00:05 20 10 L
player2 09NOV2008:10:05:10 10 -10 W
player2 15NOV2008:15:05:33 35 -35 W
player1 15NOV2008:15:05:33 35 15 L
player1 15NOV2008:15:05:33 35 15 L
run;
proc sql; create table B(toAdd num,node char(100)); quit;
proc sql; insert into B (toAdd, node)
values(5, 'X')
values(3, 'L')
values(7, 'W') ;
quit;
I want to show the mean without dummy bets and the mean with the dummy bets included. I'm added the dummy bets as follows:
proc sort data=A out=A; by node; run;
data A;
modify A B;
by node;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;
The problem is when I tabulate the data, because there isn't really two distinct categories, it's not showing me what I want.
proc tabulate data=A;
class node dummy;
var stake winnings;
table node="",stake="" * (Mean="")*(dummy="" ALL);
run;
I'm using the dummy bets to create a mean that's based on a large 'N'. I would just do this in PROC Report and calculate the mean manually with the larger 'N' as a numerator, but I need to perform a Kruskal Wallis and Chi-Squared test. It's easier to have the dummy bets with a stake of zero to keep things simple and maintain the correct counts in each category. Moreover, it's non-trivial to calculate the standard error on-the-fly (or back it out of the result created by PROC TABULATE) without having the dummy bets in each category.
How can I just show the result of PROC TABULATE above, but without the 0, 1 and ALL categories as the entries when the dummy is 1 are meaningless? Ideally, I'd like to see 'WITHOUT DUMMIES' as 0 and 'WITH DUMMIES' as 1 and display the result of the ALL column as the 'WTIH DUMMIES' = 1 category. I can then proceed to performin the KRUSKAL WALLIS on the 'NODE' class variable and the CHI-SQUARED on the dummy class variable because as it stands, I can't perform these tests with only the 0 category and the 1 category as classes in the tests.
If I could copy all the rows that are in category dummy = 0 into the category dummy = 1 it would solve the problem, I think.
Your 'if I could' is the right idea, largely. You need to fix your data to reflect the groupings you want; dummy=0 should be only nondummy bets, dummy=1 should be dummy AND nondummy bets, if I understand correctly. So you need to output the dummy=0 rows twice, once with dummy=1 and once with dummy=0.
Something like:
data A;
modify A B;
by node;
output;
dummy=1;
output;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;

SAS PROC TABULATE: Colour based on cell value

I have two cross-tabs being output in SAS: one for Time0 and one for Time1. I am interesting in comparing the change in values in each of the cells in the first crosstab with those in second.
Is there a clever way to change the background colour of a cell based on a comparison with an equivalent cell in another cross-tab? If not, and I create a variable with the change in the variable between Time0 and Time1, how can I change the cell colour of the crosstab depending on whether a value is positive or negative? Is it possible to put a colour gradient in increments of 5% if the cell contains a percentage change?
I have some sample data as follows:
data have;
input username $ betdate : datetime. stake;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90
player1 04NOV2008:09:03:44 30
player2 07NOV2008:14:03:33 120
player1 05NOV2008:09:00:00 50
player1 05NOV2008:09:05:00 30
player1 05NOV2008:09:00:05 20
player2 09NOV2008:10:05:10 10
player2 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
player1 15NOV2008:15:05:33 35
run;
proc sort data=have; by username betdate; run;
data have;
set have;
by username dateOnly betdate;
retain eventTime;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
run;
proc sql;
create table playerStats as
select
distinct username,
(select distinct avg(stake) from have where username = main.username and eventTime <= 1) format comma10.2 as bet1AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 2) format comma10.2 as bet2AvgStake,
(select distinct avg(stake) from have where username = main.username and eventTime <= 3) format comma10.2 as bet3AvgStake
from have main;
quit;
Proc rank data=playerStats ties=mean out=customerStats groups=2;
var bet1AvgStake bet2AvgStake;
ranks bet1AvgStakeRank bet2AvgStakeRank;
run;
PROC TABULATE DATA=customerStats NOSEPS;
VAR bet1AvgStake bet2AvgStake;
class bet1AvgStakeRank;
TABLE bet1AvgStakeRank, bet1AvgStake*(N Mean);
TABLE bet1AvgStakeRank, bet2AvgStake*(N Mean);
RUN;
I would like to see a red cell when the value in each cell in the second crosstab is lower than the equivalent cell in the first and a green cell when the value is higher.
Thanks for any help on this.
I don't think you can do all that in a single proc, but you certainly can do part 2 if I understand properly. It's called "Traffic Lighting" more generally, to help with googling for more detailed information; for example, this paper has some examples of how to do so.
Generally, the concept is that you create a format, the label of which is a color:
proc format;
value betfmt
low - -5= 'red'
-5 >-> 0 = 'lightred'
0 - 5 ='lightgreen'
5 >- high = 'green'; *or hex values like 'cxFF0099';
quit;
Then use that format in the proc tabulate:
proc tabulate data=yourdata;
var bets;
tables bets/style=[background=betfmt.];
run;
It does need to be based on the current cell, though; you can't calculate based on another cell without using PROC REPORT.

SAS running total

I have some sample data as follows, and want to calculate the number of winning or losing bets in a row.
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
PROC PRINT; RUN;
proc sort data=have;
by username betdate;
run;
DM "log; clear;";
data want;
set have;
by username dateOnly betdate;
retain calendarTime eventTime cumulativeDailyProfit profitableFlag;
if first.username then calendarTime = 0;
if first.dateOnly then calendarTime + 1;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
if first.username then cumulativeDailyProfit = 0;
if first.dateOnly then cumulativeDailyProfit = 0;
if first.betdate then cumulativeDailyProfit + stake;
if winnings > 0 then winner = 1;
if winnings <= 0 then winner = 0;
PROC PRINT; RUN;
For example, the first four bets four player1 are winners, so the first four rows in this column should show 1,2,3,4 (at this point, four wins in a row). The fifth is a loser, so should show -1, followed by 1,2. The following three rows (for player 3, should show -1, -2, -3 as the customer has had three bets in a row. How can I calculate the value of this column in the data step? How can I also have a column for the largest number of winning bets in a row (to date) and the maximum number of losing bets the customer has had to date in each row?
Thanks for any help.
To do a running total like this, you can use BY with NOTSORTED and still leverage the first.<var> functionality. For example:
data have;
input winlose $;
datalines;
win
win
win
win
lose
lose
win
lose
win
win
lose
;;;;
run;
data want;
set have;
by winlose notsorted;
if first.winlose and winlose='win' then counter=1;
else if first.winlose then counter=-1;
else if winlose='win' then counter+1;
else counter+(-1);
run;
Each time 'win' changes to 'lose' or the reverse, it resets the first.winlose variable to 1.
Once you have done this, you can either use a double DoW loop to append maximums, or perhaps more easily just get this value in a dataset and then add it on via a second datastep (or proc sql) to append your desired variables.

SAS Data Step: Concatenate string to variable on-the-fly

I have the following sample data:
data have;
input username $ betdate : datetime. winnings;
retain username dateonly bedate result;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 -10
player1 12NOV2008:19:03:44 50
player2 07NOV2008:14:03:33 -50
player2 05NOV2008:09:00:00 -100
run;
PROC PRINT; RUN;
proc sort data=have;
by username betdate;
run;
data want;
set have;
by username dateOnly betdate;
retain username dateonly bedate winnings winner resulthistory;
if winnings > 0 then winner = 'W';
if winnings <= 0 then winner = 'L';
if first.winlose then resulthistory=winner;
else if first.betdate then resulthistory=resulthistory||winner;
PROC PRINT; RUN;
I want a cumulative result history in the last column. For player1, this will be 'WL'; for player2 it should be 'LL'. I've declared the resulthistory variable in the second data step, but can't seem to concatenate the new result onto the resulthistory variable if it's the same username. Is the problem that I'm working with a string variable or that I'm trying to reference something from a previous row?
Thanks for any help on this.
A few issues- firstly, the concatenation action (resulthistory=resulthistory||winner) was padded with blanks, meaning that "winner" was chopped off the end of the string
There was also a non-existent variable (winlose), a typo (bedate), and an unnecessary retain statement in first data step. See updated code below:
data have;
input username $ betdate : datetime. winnings;
dateOnly = datepart(betdate);
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 -10
player1 12NOV2008:19:03:44 50
player2 07NOV2008:14:03:33 -50
player2 05NOV2008:09:00:00 -100
run;
proc sort data=have;
by username dateonly betdate;
run;
data want;
set have;
format resulthistory $5.;
by username dateOnly betdate;
retain resulthistory;
if winnings > 0 then winner = 'W';
else if winnings <= 0 then winner = 'L';
if first.dateonly then resulthistory=winner;
else resulthistory=cats(resulthistory,winner);
run;
PROC PRINT; RUN;