Generate sum variable for subgroups - stata

I have here a Steam Dataset which includes individual steam user and their playtimes(overall) and the games they played. I further divided the player in hardcore(=1) and casual player (=0). Overall I want to test how various factors have influence on the overall playtime of the players, but now I want to build 2 regressions, one for hardcore players and one for casual players(because I think that the effect of every factor can differ between those two). But in order to do that, I need the sum of the overall playtime from the 2 subgroups. I tried egen playtime_type = sum(playtime_sum), by (hightype), but the outcome just doesn't make sense. How can I aggregate the sum of playtime only for each subgroup?
Here is a example from the dataset
steamid playtime_sum hightype
76561197960265729 0 0
76561197960265730 45 0
76561197960265730 45 0
76561197960265730 45 0
76561197960265733 1710 0
76561197960265733 1710 0
76561197960265733 1710 0
76561197960265733 1710 0
76561197960265733 1710 0
76561197960265738 11 0
76561197960265738 11 0
76561197960265738 11 0

What makes sense and what does not make sense is highly subjective and it is always better if you explain what the output was and what you had expected instead.
My guess is that you want to use total() instead of sum().
egen playtime_type = total(playtime_sum), by(hightype)

Related

KDB moving percentile using Swin function

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!
What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

Two sided test: proc ttest?

I need to calculate the two-sided ttest in SAS.
I generally use the proc ttest adding side=2 but I am not sure if this test works fine or if another way should be preferred to it.
An example of data is the following:
Score Segment Obs Class_obs
1 0 500 15
1 1 500 34
2 0 234 23
2 1 766 65
Where the p-value is calculated per each score. Segment means that a condition is met (e.g. score higher than 60. 0 means ‘lower than 60’ while 1 means ‘higher than 60’).
Obs is the number of observations in each segment by score. Class obs is the number of obs that satisfy a specific condition on the overall population.
Happy to share more info if it needs.

Adding column based on ID in another data

data1 is data from 1990 and it looks like
Panelkey Region income
1 9 30
2 1 20
4 2 40
data2 is data from 2000 and it looks like
Panelkey Region income
3 2 40
2 1 30
1 1 20
I want to add a column of where each person lived in 1990.
Panelkey Region income Region1990
3 2 40 .
2 1 30 1
1 1 20 9
How can I do this on Stata?
The following code will deal with panels that live in multiple regions in the same year by choosing the region with larger income. This would make sense if income was proportional to fraction of the year spent in a region. Same income ties will be broken arbitrarily using the highest region's value. Other types of aggregation might make sense (take a look at the -collapse- command).
Note that I tweaked your data by inserting second rows for the last observation in each year:
clear
input Panelkey Region income
1 9 30
2 1 20
4 2 40
4 10 80
end
rename (Region income) =1990
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
save "data1990.dta", replace
clear
input Panelkey Region income
3 2 40
2 1 30
1 1 20
1 9 20
end
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
merge 1:1 Panelkey using "data1990.dta", keep(match master) nogen
list, clean noobs

how to calculate market value weighted price in stata [duplicate]

I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.

PROC RANK by score: minimum number of a counts of target variable

I have used SAS PROC RANK to rank a population based on score and create groups of equal size. I would like to create groups such that there is a minimum number of target variable (Goods and Bads) in each bin. Is there a way to do that using PROC RANK? I understand that the size of each bin would be different.
For example in the table below, I have created 10 groups based on a certain score. As you can see the Non cures in the lower deciles are sparse. I would like to create groups such there there are at least 10 Non cures in each group.
Cures and Non cures are based on same variable: Cure = 1 and Cure = 0.
Decile cures non cures
0 262 94
1 314 44
2 340 19
3 340 13
4 353 10
5 373 5
6 308 3
7 342 3
8 440 4
9 305 3