I have the following data of events by players in a game. Using this data I would like to know how many times each player played for their team across the data set. When I group the data, I count the number of instances of the player column but that can be too high ie they have 3 events in a game but that should only count as them having played once. I can't find a way to get if they played or not per match and not all their events when grouping.
Desired result would be :
Team
Player
No of Times Played
AAA
P1
3
AAA
P2
2
As P1 played on 1/1, 1/2 and 1/4)
As P2 only played on 1/1 and 1/2 and not 1/4)
Here is the source data:
Team
Date
Player
Event
AAA
1/1/23
P1
Shoot
AAA
1/1/23
P2
Miss
AAA
1/1/23
P1
Pass
AAA
1/1/23
P3
Score
AAA
1/1/23
P5
Miss
AAA
1/1/23
P1
Shoot
AAA
1/2/23
P6
Shoot
AAA
1/2/23
P1
Miss
AAA
1/2/23
P3
Pass
AAA
1/2/23
P4
Miss
AAA
1/2/23
P7
Miss
AAA
1/2/23
P1
Shoot
AAA
1/4/23
P1
Score
AAA
1/4/23
P2
Shoot
AAA
1/4/23
P4
Miss
BBB
1/1/23
P1
Miss
BBB
1/1/23
P3
Miss
BBB
1/1/23
P1
Pass
BBB
1/1/23
P6
Score
BBB
1/3/23
P5
Miss
BBB
1/3/23
P3
Shoot
BBB
1/3/23
P2
Shoot
BBB
1/4/23
P1
Score
BBB
1/4/23
P3
Pass
Group by but counts the number of rows, not unique instances
Click select the Team, Date and Player columns, right click, remove duplicates
Click select the Team and Player columns, right click group by, and use defaults
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Removed Duplicates" = Table.Distinct(Source, {"Team", "Player", "Date"}),
#"Grouped Rows" = Table.Group(#"Removed Duplicates", {"Team", "Player"}, {{"Count", each Table.RowCount(_), Int64.Type}})
in #"Grouped Rows"
Do a group and replace the code as follows:
= Table.Group(#"Changed Type", {"Team", "Player"}, {{"Count", each List.Count( List.Distinct( _[Date])) , Int64.Type}})
Related
Is there a way I could replace a row value to its previous row by each group?
Below is the before and after data set. Product for each type - C needs to be changed as type - L for each customer when the ID is same it has the highest amount.
Before
ObsCust LINK_ID Type Product Amount
1 1 12432 L A 23
2 1 12432 C B 0
3 2 23213 L C 234
4 2 23145 L D 25
5 2 23145 C E 0
6 3 21311 L F 34
7 3 21324 L G 45
8 3 21324 L H 35
9 3 21324 C I 0
After
Cust LINK_ID Type Product Amount
1 12432 L A 234
1 12432 C A -
2 23213 L C 23,212
2 23145 L D 335
2 23145 C D -
3 21311 L F 323
3 21324 L G 2,344
3 21324 L H 34
3 21324 C G -
Thank you!
if i understand correctly, you want to have product value for C Type be the product associated with the highest amount in L Types. If this is correct one possible way is to use the following. First the product with the highest amount for L-Type within each group of customers and IDs are calculated as follows:
note that the original dataset is assumed to be named "example".
proc sql;
create table L_Type as
select cust, LINK_ID, product, amount
from example
where type = 'L' and amount = max(amount)
group by cust, LINK_ID
;
quit;
then product calculated above is coded for c type in the original example.
proc sql;
select
e.cust
, e.LINK_ID
, e.type
, case when e.type = 'C' then b.product end as product
, e.amount
from example e left join L_Type b
on e.cust = b.cust and e.LINK_ID = b.LINK_ID
;
quit;
So you have a couple processing tasks to do:
Have you considered all the edge cases ?
For a customer find the row(s) with the maximum amount.
Is one of them type L ?
No, do nothing
Yes, track the Product and LinkId as follows
Is there more than one 'maximal' row ?
No, track the Product & LinkId from the one row
Yes, Is there more than one Product in the rows ?
No, track the Product value
Is there more than one LinkId ?
No, track the LinkId
Yes, Which LinkIds?
Track all the different LinkIds
Track one of these: first, lowest, highest, last LinkId
Yes, now what ?
Log an error ?
Track one of the Product values because only one can be used, which one ?
first occurring ?
lowest value ?
highest value ?
last occurring ?
For the tracked LinkIds (there might not be any) apply the tracked Product to the rows that are type C (or perhaps type not L)
i need concatenate column values (B) in measure
Table1:
A B
1 RED
2 GREEN
3 BlUE
4 RED
5 BLACK
in measure = RED GREEN BLUE RED BLACK
How can i do this?
You can use CONCATENATEX() function.
Here is a simple example:
B values =
CONCATENATEX(
VALUES('Table'[B]),
'Table'[B],
" "
)
Result:
The following sample data has variables describing bets by a number of players.
How can I calculate each player's first bettype, first betprice, the number of soccer bets, the number of baseball bets, the number of unique prices per customer and the number of unique bet types per username?
clear
input str16 username str40 betdate stake str16 bettype betprice str16 sport
player1 "12NOV2008 12:04:33" 90 SGL 5 SOCCER
player1 "04NOV2008:09:03:44" 30 SGL 4 SOCCER
player2 "07NOV2008:14:03:33" 120 SGL 5 SOCCER
player1 "05NOV2008:09:00:00" 50 SGL 4 SOCCER
player1 "05NOV2008:09:05:00" 30 DBL 3 BASEBALL
player1 "05NOV2008:09:00:05" 20 DBL 4 BASEBALL
player2 "09NOV2008:10:05:10" 10 DBL 5 BASEBALL
player2 "15NOV2008:15:05:33" 35 DBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 TBL 5 BASEBALL
player1 "15NOV2008:15:05:33" 35 SGL 4 BASEBALL
end
generate double timestamp=clock(betdate,"DMY hms")
format timestamp %tc
generate double dateonly=date(betdate,"DMY hms")
format dateonly %td
generate firsttype
generate firstprice
generate soccercount
generate baseballcount
generate uniquebettypecount
generate uniquebetpricecount
This is a bit close to the margin, as a "please give me the code" question, with no attempt at your own solutions.
The first type and price are
bysort username (timestamp) : gen firsttype = bettype[1]
bysort username (timestamp) : gen firstprice = betprice[1]
The number of soccer and baseball bets is
egen soccercount = total(sport == "SOCCER"), by(username)
egen baseballcount = total(sport == "BASEBALL"), by(username)
The number of distinct [not unique!] bet types is
bysort username bettype : gen work = _n == 1
egen uniquebettypecount = total(work), by(username)
and the other problem is just the same (but replace work). Another way to do that is
egen work = tag(username bettype)
egen uniquebettypecount = total(work), by(username)
What is characteristic of all these variables is that the same value is repeated for all values within each group. For example, firsttype has the same value for each occurrence of each distinct username. Often you will want to use each value just once. A key to that is the egen function tag() just used, for example
egen usertag = tag(username)
followed by uses of if usertag when needed. (if usertag is a useful idiom for if usertag == 1.)
Some reading suggestions:
On by: http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
On egen: http://www.stata.com/help.cgi?egen
On distinct observations (and why the word "unique" is misleading): http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
I want to create new variable HHage which is the age of head of household reported by HID. In the dataset, the head of household is coded by P1. The dataset looks like this:
Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
I tried the egen command but I get an error pertaining to numlist. The command I used was:
egen hhage = anyvalue(age), values(integer 1,2 to 26)
// create the example data
clear
input ///
str2 Personid HID Age
P1 100 12
P2 100 45
P1 101 16
P1 102 35
P2 102 24
P3 102 26
end
// check whether there is only 1 household head per household
bys HID : gen byte flag = -(Personid == "P1")
bys HID (flag): replace flag = sum(flag)
assert flag == -1
drop flag
// create hhage
gen hhage = Age if Personid == "P1"
bys HID (hhage): replace hhage = sum(hhage)
list , sepby(HID)
The excellent answer from #Maarten Buis explains that you can do this without egen. This answer focuses on using egen for this kind of problem.
What is allowed as a numlist is a minor issue here; the major issue is that the egen function anyvalue() is of little help. Its documentation explains that
anyvalue(varname), values(integer numlist) may not be combined with by. It takes the value of varname if varname is equal to any integer value in a supplied numlist and is missing otherwise.
This would be legal syntax
egen hhage = anyvalue(age), values(1/26)
but Stata would copy ages 1 to 26 to the new variable and ignore the others, observation by observation, regardless of household and who is head of household. That is not what you want.
One egen solution for this might be
egen hhage = total(age * (Personid == "P1")), by(HHID)
The expression Personid == "P1" evaluates to 1 when true and 0 when false. So the age of the household head appears in the total and other values of age are ignored in so far as they contribute 0 to the total.
The by() option is undocumented but will work. Stata encourages you to do this instead:
bysort HHID : egen hhage = tota(age * (Personid == "P1"))
This solution assumes that
Personid is a string variable. If it is a numeric variable, the expression Personid == "P1" should be replaced by something like Personid == 1 using 1 or whatever other integer code is appropriate.
There is one head of household per household. That can be checked directly by something like
egen hhcount = total(Personid == "P1"), by(HHID)
See also http://www.stata-journal.com/article.html?article=dm0055 for a review of technique in this territory.
Note that in principle you could go something like
egen work = anyvalue(age) if Personid == "P1", values(0/200)
allowing any age imaginable so long as the person is head of household. Then you could fix that by
egen hhage = total(work), by(HHID)
However, I can see no point in that solution.
I have this raw data from source:
Category Product Price
C1 P1 1
C1 P2 1
C1 P3 4
C2 P4 2
C2 P5 10
C2 P6 12
I want to visualise a Power BI table that shows the Category average within the same structure:
Category Product Price
C1 P1 1
C1 P2 1
C1 Avg_C1 3
C1 P3 4
C2 P4 2
C2 Avg_C2 8
C2 P5 10
C2 P6 12
Many thanks if you show me a solution.
Just to reformat the question...
I have this raw data from source:
**Category Product Price**
C1 P1 1
C1 P2 1
C1 P3 4
C2 P4 2
C2 P5 10
C2 P6 12
I want to visualise a Power BI table that shows the Category average within the same structure:
**Category Product Price**
C1 P1 1
C1 P2 1
C1 Avg_C1 3
C1 P3 4
C2 P4 2
C2 Avg_C2 8
C2 P5 10
C2 P6 12
You can create a Matrix visual in which you place the Category and Product columns on Rows and for Values you use a Measure that would be like this:
PriceWithAverage =
VAR CurrentCategory =
MAX ( ProductTable[Category] )
RETURN
IF (
ISFILTERED ( ProductTable[Product] ),
MAX ( ProductTable[Price] ),
CALCULATE (
AVERAGE ( ProductTable[Price] ),
FILTER ( ProductTable, ProductTable[Category] = CurrentCategory )
)
)
Let us know if that works for you
Best
David