Stata: First occurrences, sum of unique occurrences with a by variable - stata

The following sample data has variables describing bets by a number of players.
How can I calculate each player's first bettype, first betprice, the number of soccer bets, the number of baseball bets, the number of unique prices per customer and the number of unique bet types per username?
clear
input str16 username str40 betdate stake str16 bettype betprice str16 sport
player1 "12NOV2008 12:04:33" 90 SGL 5 SOCCER
player1 "04NOV2008:09:03:44" 30 SGL 4 SOCCER
player2 "07NOV2008:14:03:33" 120 SGL 5 SOCCER
player1 "05NOV2008:09:00:00" 50 SGL 4 SOCCER
player1 "05NOV2008:09:05:00" 30 DBL 3 BASEBALL
player1 "05NOV2008:09:00:05" 20 DBL 4 BASEBALL
player2 "09NOV2008:10:05:10" 10 DBL 5 BASEBALL
player2 "15NOV2008:15:05:33" 35 DBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 TBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 SGL 4 BASEBALL
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
generate double dateonly=date(betdate,"DMY hms")
format dateonly %td
generate firsttype
generate firstprice
generate soccercount
generate baseballcount
generate uniquebettypecount
generate uniquebetpricecount

This is a bit close to the margin, as a "please give me the code" question, with no attempt at your own solutions.
The first type and price are
bysort username (timestamp) : gen firsttype = bettype[1]
bysort username (timestamp) : gen firstprice = betprice[1]
The number of soccer and baseball bets is
egen soccercount = total(sport == "SOCCER"), by(username)
egen baseballcount = total(sport == "BASEBALL"), by(username)
The number of distinct [not unique!] bet types is
bysort username bettype : gen work = _n == 1
egen uniquebettypecount = total(work), by(username)
and the other problem is just the same (but replace work). Another way to do that is
egen work = tag(username bettype)
egen uniquebettypecount = total(work), by(username)
What is characteristic of all these variables is that the same value is repeated for all values within each group. For example, firsttype has the same value for each occurrence of each distinct username. Often you will want to use each value just once. A key to that is the egen function tag() just used, for example
egen usertag = tag(username)
followed by uses of if usertag when needed. (if usertag is a useful idiom for if usertag == 1.)
Some reading suggestions:
On by: http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
On egen: http://www.stata.com/help.cgi?egen
On distinct observations (and why the word "unique" is misleading): http://www.stata-journal.com/sjpdf.html?articlenum=dm0042

Related

SAS formatting datalines

Ok my last question I am having a hard time formatting this
data practice;
input
Datalines;
employee_id Name gender years dept salary Birthday
1 Mitchell, Jane A f 6 shoe 22,450 12/30/1960
2 Miller, Frances T f 8 appliance . 11/27/1965
3 Evans, Richard A m 9 appliance 42,900 02/15/1973
4 Fair, Suzanne K f 3 clothing 29,700 03/09/1958
5 Meyers, Thomas D m 5 appliance 33,700 10/22/1961
6 Rogers, Steven F m 3 shoe 27,000 09/12/1960
7 Anderson, Frank F m 5 clothing 33,000 03/09/1958
10 Baxter, David T m 2 shoe 23,900 11/25/1966
11 Wood, Brenda L f 3 clothing 33,000 01/14/1962
12 Wheeler, Vickie M f 7 appliance 31,500 12/23/1975
13 Hancock, Sharon T f 1 clothing 21,000 01/17/1972
14 Looney, Roger M m 10 appliance 31,500 06/09/1973
15 Fry, Marie E f 6 clothing 29,700 05/25/1967
;
run;quit;
Proc print data=practice;
run;quit;
Ok my question is there a way to do this without having to count each individual space? Even when I do count the data still does not properly print out what am I doing wrong? Thanks in advance this should be my last question afterwards I should be ready for this final.
If you don't assign a character length, SAS will use the length of the first value it encounters and assign it to all the values in that column. You can use the statement length var $w; before your data lines statement to set your own length. Using the option dsd tells SAS to use comma as your variable delimiter, read strings enclosed in quotation marks as a single variable, and to strip them off before saving the variable. If using blank spaces as your delimiter, make sure there are no blank spaces in front of each row below the dataline statement.
data practice;
infiles datalines dsd;
length Name $50. dept $9.;
input employee_id Name $ gender $ years dept $ salary $ Birthday MMDDYY10.;
format Birthday MMDDYY10.;
Datalines;
1, "Mitchell, Jane A", f, 6, shoe, "22,450", 12/30/1960
2, "Miller, Frances T", f, 8, appliance, , 11/27/1965
;
run;
Proc print data=practice;
run;quit;

Stata: Concatenate string variable on by condition

I'm creating a variable called result in the following sample data which shows a 'W' if a bet was a win and 'L' if it was a loss.
How can I concatenate this variable with itself on a row by row basis in strict order by timestamp for each username?
clear
input str16 username str40 betdate winnings
player1 "12NOV2008:19:04:01" -10
player1 "12NOV2008:12:03:44" 50
player2 "07NOV2008:14:03:33" -50
player2 "05NOV2008:09:00:00" -100
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
cap drop result
generate result = "L"
replace result = "W" if (winnings >0)
cap drop resulthistory
generate resulthistory = ""
replace resulthistory = concat(resulthisory + result), by(USERNAME timestamp)
Readers should note that the last line of the question is fantasy syntax; the rest would work.
This may be what you seek. Note that as you read in the data afresh, the variables you capture drop can't exist.
clear
input str16 username str40 betdate winnings
player1 "12NOV2008:19:04:01" -10
player1 "12NOV2008:12:03:44" 50
player2 "07NOV2008:14:03:33" -50
player2 "05NOV2008:09:00:00" -100
end
gen double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
gen result = cond(winnings > 0, "W", "L")
bysort username (timestamp): gen resulthistory = result[1]
by username : replace resulthistory = resulthistory[_n-1] + result if _n > 1
by username : replace resulthistory = resulthistory[_N]
list

Class variable in PROC TABULATE. Need alternative to ALL command

I have the following two tables. One has a couple of bet results and the other has a number of 'dummy' bets the need to be added. I want to get mean of the original sample, the mean of the sample with the dummy bets added and then perform both a Chi Squared test of the difference between the columns and a Kruskal Wallis test on the difference between the rows.
I'm having an issue with tabulating the data to product the mean for both categories.
data A;
input username $ betdate : datetime. stake winnings node $;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90 X
player1 04NOV2008:09:03:44 100 40 L
player2 07NOV2008:14:03:33 120 -120 W
player1 05NOV2008:09:00:00 50 15 L
player1 05NOV2008:09:05:00 30 5 W
player1 05NOV2008:09:00:05 20 10 L
player2 09NOV2008:10:05:10 10 -10 W
player2 15NOV2008:15:05:33 35 -35 W
player1 15NOV2008:15:05:33 35 15 L
player1 15NOV2008:15:05:33 35 15 L
run;
proc sql; create table B(toAdd num,node char(100)); quit;
proc sql; insert into B (toAdd, node)
values(5, 'X')
values(3, 'L')
values(7, 'W') ;
quit;
I want to show the mean without dummy bets and the mean with the dummy bets included. I'm added the dummy bets as follows:
proc sort data=A out=A; by node; run;
data A;
modify A B;
by node;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;
The problem is when I tabulate the data, because there isn't really two distinct categories, it's not showing me what I want.
proc tabulate data=A;
class node dummy;
var stake winnings;
table node="",stake="" * (Mean="")*(dummy="" ALL);
run;
I'm using the dummy bets to create a mean that's based on a large 'N'. I would just do this in PROC Report and calculate the mean manually with the larger 'N' as a numerator, but I need to perform a Kruskal Wallis and Chi-Squared test. It's easier to have the dummy bets with a stake of zero to keep things simple and maintain the correct counts in each category. Moreover, it's non-trivial to calculate the standard error on-the-fly (or back it out of the result created by PROC TABULATE) without having the dummy bets in each category.
How can I just show the result of PROC TABULATE above, but without the 0, 1 and ALL categories as the entries when the dummy is 1 are meaningless? Ideally, I'd like to see 'WITHOUT DUMMIES' as 0 and 'WITH DUMMIES' as 1 and display the result of the ALL column as the 'WTIH DUMMIES' = 1 category. I can then proceed to performin the KRUSKAL WALLIS on the 'NODE' class variable and the CHI-SQUARED on the dummy class variable because as it stands, I can't perform these tests with only the 0 category and the 1 category as classes in the tests.
If I could copy all the rows that are in category dummy = 0 into the category dummy = 1 it would solve the problem, I think.
Your 'if I could' is the right idea, largely. You need to fix your data to reflect the groupings you want; dummy=0 should be only nondummy bets, dummy=1 should be dummy AND nondummy bets, if I understand correctly. So you need to output the dummy=0 rows twice, once with dummy=1 and once with dummy=0.
Something like:
data A;
modify A B;
by node;
output;
dummy=1;
output;
do i = 1 to toAdd;
stake = 0;
stakediff = -1;
dummy = 1;
output;
end;
run;

properties of households from individual data

I want to create new variable HHage which is the age of head of household reported by HID. In the dataset, the head of household is coded by P1. The dataset looks like this:
Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
I tried the egen command but I get an error pertaining to numlist. The command I used was:
egen hhage = anyvalue(age), values(integer 1,2 to 26)
// create the example data
clear
input ///
str2 Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
end
// check whether there is only 1 household head per household
bys HID : gen byte flag = -(Personid == "P1")
bys HID (flag): replace flag = sum(flag)
assert flag == -1
drop flag
// create hhage
gen hhage = Age if Personid == "P1"
bys HID (hhage): replace hhage = sum(hhage)
list , sepby(HID)
The excellent answer from #Maarten Buis explains that you can do this without egen. This answer focuses on using egen for this kind of problem.
What is allowed as a numlist is a minor issue here; the major issue is that the egen function anyvalue() is of little help. Its documentation explains that
anyvalue(varname), values(integer numlist) may not be combined with by. It takes the value of varname if varname is equal to any integer value in a supplied numlist and is missing otherwise.
This would be legal syntax
egen hhage = anyvalue(age), values(1/26)
but Stata would copy ages 1 to 26 to the new variable and ignore the others, observation by observation, regardless of household and who is head of household. That is not what you want.
One egen solution for this might be
egen hhage = total(age * (Personid == "P1")), by(HHID)
The expression Personid == "P1" evaluates to 1 when true and 0 when false. So the age of the household head appears in the total and other values of age are ignored in so far as they contribute 0 to the total.
The by() option is undocumented but will work. Stata encourages you to do this instead:
bysort HHID : egen hhage = tota(age * (Personid == "P1"))
This solution assumes that
Personid is a string variable. If it is a numeric variable, the expression Personid == "P1" should be replaced by something like Personid == 1 using 1 or whatever other integer code is appropriate.
There is one head of household per household. That can be checked directly by something like
egen hhcount = total(Personid == "P1"), by(HHID)
See also http://www.stata-journal.com/article.html?article=dm0055 for a review of technique in this territory.
Note that in principle you could go something like
egen work = anyvalue(age) if Personid == "P1", values(0/200)
allowing any age imaginable so long as the person is head of household. Then you could fix that by
egen hhage = total(work), by(HHID)
However, I can see no point in that solution.

SAS running total

I have some sample data as follows, and want to calculate the number of winning or losing bets in a row.
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
PROC PRINT; RUN;
proc sort data=have;
by username betdate;
run;
DM "log; clear;";
data want;
set have;
by username dateOnly betdate;
retain calendarTime eventTime cumulativeDailyProfit profitableFlag;
if first.username then calendarTime = 0;
if first.dateOnly then calendarTime + 1;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
if first.username then cumulativeDailyProfit = 0;
if first.dateOnly then cumulativeDailyProfit = 0;
if first.betdate then cumulativeDailyProfit + stake;
if winnings > 0 then winner = 1;
if winnings <= 0 then winner = 0;
PROC PRINT; RUN;
For example, the first four bets four player1 are winners, so the first four rows in this column should show 1,2,3,4 (at this point, four wins in a row). The fifth is a loser, so should show -1, followed by 1,2. The following three rows (for player 3, should show -1, -2, -3 as the customer has had three bets in a row. How can I calculate the value of this column in the data step? How can I also have a column for the largest number of winning bets in a row (to date) and the maximum number of losing bets the customer has had to date in each row?
Thanks for any help.
To do a running total like this, you can use BY with NOTSORTED and still leverage the first.<var> functionality. For example:
data have;
input winlose $;
datalines;
win
win
win
win
lose
lose
win
lose
win
win
lose
;;;;
run;
data want;
set have;
by winlose notsorted;
if first.winlose and winlose='win' then counter=1;
else if first.winlose then counter=-1;
else if winlose='win' then counter+1;
else counter+(-1);
run;
Each time 'win' changes to 'lose' or the reverse, it resets the first.winlose variable to 1.
Once you have done this, you can either use a double DoW loop to append maximums, or perhaps more easily just get this value in a dataset and then add it on via a second datastep (or proc sql) to append your desired variables.