Merging observations which occur at the same time into one observation

Merging observations which occur at the same time into one observation - stata

I have a dataset which looks like this:
loan client time interest loan amount country
1 1 w 0.1 500.000 USA
2 1 x 0.2 250.000 Germany
3 2 y 0.1 300.000 France
4 2 y 0.15 400.000 France
5 2 y 0.2 100.000 France
6 3 z 0.1 50.000 England
. . . . . .
. . . . . .
I observe different clients at different point in times. For some observations I observe the same client at the same time receiving 2 different loans (in this example observing client 2 at time y 3 times). I want to bundle these observations together meaning that I want to replace loans "3", "4" & "5" with one observation "3", summing the loan amount, averaging the interest rate and using the entries for all other variables from observation "3". I am wondering how I can perform these operations using Stata.

This ended up working for me:
collapse (mean) interest (sum) loanamount (firstnm) country, by (client time)

Related

Average of percent of column totals in DAX

I have a fact table named meetings containing the following:
- staff
- minutes
- type
I then created a summarized table with the following:
TableA =
SUMMARIZECOLUMNS (
'meetings'[staff]
, 'meetings'[type]
, "SumMinutesByStaffAndType", SUM( 'meetings'[minutes] )
)
This makes a pivot table with staff as rows and columns as types.
For this pivottable I need to calculate each cell as a percent of the column total. For each staff I need the average of their percents. There are only 5 meeting types so I need the sum of these percents divided by 5.
I don't know how to divide one number grouped by two columns by another number grouped by one column. I'm coming from the SQL world so my DAX is terrible and I'm desperate for advice.
I tried creating another summarized table to get the sum of minutes for each type.
TableB =
SUMMARIZECOLUMNS (
'meetings'[type]
, "SumMinutesByType", SUM( 'meetings'[minutes] )
)
From there I want 'TableA'[SumMinutesByStaffAndType] / 'TableB'[SumMinutesByType].
TableC =
SUMMARIZECOLUMNS (
'TableA'[staff],
'TableB'[type],
DIVIDE ( 'TableA'[SumMinutesByType], 'TableB'[SumMinutesByType]
)
"A single value for column 'Minutes' in table 'Min by Staff-Contact' cannot be determined. This can happen when a measure formula refers to a column that contains many values without specifying an aggregation such as min, max, count, or sum to get a single result."
I keep arriving at this error which leads me to believe I'm not going about this the "Power BI way".
I have tried making measures and creating matrices on the reports view. I've tried using the group by feature in the Query Editor. I even tried both measures and aggregate tables. I'm likely overcomplicating it and way off the mark so any help is greatly appreciated.
Here's an example of what I'm trying to do.
## Input/First table
staff minutes type
--------- --------- -----------
Bill 5 TELEPHONE
Bill 10 FACE2FACE
Bill 5 INDIRECT
Bill 5 EMAIL
Bill 10 OTHER
Gary 10 TELEPHONE
Gary 5 EMAIL
Gary 5 OTHER
Madison 20 FACE2FACE
Madison 5 INDIRECT
Madison 15 EMAIL
Rob 5 FACE2FACE
Rob 5 INDIRECT
Rob 20 TELEPHONE
Rob 45 FACE2FACE
## Second table with SUM of minutes, Grand Total is column total.
Row Labels EMAIL FACE2FACE INDIRECT OTHER TELEPHONE
------------- ------- ----------- ---------- ------- -----------
Bill 5 10 5 10 5
Gary 5 5 10
Madison 15 20 5
Rob 50 5 20
Grand Total 25 80 15 15 35
## Third table where each of the above cells is divided by its column total.
Row Labels EMAIL FACE2FACE INDIRECT OTHER TELEPHONE
------------- ------- ----------- ------------- ------------- -------------
Bill 0.2 0.125 0.333333333 0.666666667 0.142857143
Gary 0.2 0 0 0.333333333 0.285714286
Madison 0.6 0.25 0.333333333 0 0
Rob 0 0.625 0.333333333 0 0.571428571
Grand Total 25 80 15 15 35
## Final table with the sum of the rows in the third table divided by 5.
staff AVERAGE
--------- -------------
Bill 29.35714286
Gary 16.38095238
Madison 23.66666667
Rob 30.5952381
Please let me know if I can clarify an aspect.

You can make use of the built in functions like %Row total in Power BI, Please find the snapshot below
If this is not what you are looking for, kindly let me know (I have used your Input table)

Generating a correlation table between observations in Stata

This is a problem that I have never encountered before, hence, I don't even know where to start.
I have an unbalanced panel data set (different products sold at different stores across weeks) and would like to run correlations on sales between each product combination. The requirement is, however, a correlation is only to be calculated using the sales values of two products appearing together in the same store and week. That is to say, some weeks or some stores may sell only either of the two given products, so we just want to disregard those instances.
The number of observations in my data set is 400,000 but among them I have only 50 products sold, so the final correlation matrix would be 50*50=2500 with 1250 unique correlation values. Does it makes sense?
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
The correlation table should be something like this [fyi, instead of the correlation values I put square brackets to illustrate the values to be used]. Please note that I cannot run a correlation for AF because there is only one store/week combination.
A B C
A 1 [10,20; 10,30] [10,23; 10,30]
B 1 [20,23; 30,30]
C 1

You calculate correlations between pairs of variables; but what you regard as pairs of variables are not so in the present data layout. So, you need a reshape. The principle is shown by
clear
input str2 product sales store week
A 10 1 1
B 20 1 1
C 23 1 1
A 10 2 1
B 30 2 1
C 30 2 1
F 43 2 1
end
reshape wide sales , i(store week) j(product) string
rename sales* *
list
+----------------------------------+
| store week A B C F |
|----------------------------------|
1. | 1 1 10 20 23 . |
2. | 2 1 10 30 30 43 |
+----------------------------------+
pwcorr A-F
| A B C F
-------------+------------------------------------
A | .
B | . 1.0000
C | . 1.0000 1.0000
F | . . . .
The results look odd only because your toy example won't allow otherwise. So A doesn't vary in your example and the correlation isn't defined. The correlation between B and C is perfect because there are two data points different in both B and C.
A different problem is that a 50 x 50 correlation matrix is unwieldy. How to get friendlier output depends on what you want to use it for.

how to calculate market value weighted price in stata [duplicate]

I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?

This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+

This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total

Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.

Convert Daily Returns to Monthly Returns in Stata

I am using Stata and I have 6 years of daily returns for stocks that individuals hold in their portfolios. I would like to aggregate the daily returns to monthly portfolio returns. In some instances, the individual may hold more than one stock in the portfolio. I am struggling with writing the code to do this.
For a visual, my data looks like this:
I would like the results to look like this:
Where individual 2's portfolio return for the month of December 1996 is calculated as: 0.3 * 0.0031 + 0.7 * 0.0076 = 0.00625.
I have tried the collapse command such as
collapse Return, by (ID Year Month)
but this does not provide the same return that I calculated out in Excel.
I am able to make a weighted portfolio return for all the days using
bysort ID year month: egen wt_return = stock_weight * monthly_return
But this gives me daily returns. My trouble is then aggregating them into one return for the corresponding month.
As for the specifics, I would like to calculate the monthly portfolio return as the product of 1 + the weighted daily returns. As a last resort, the mean return for the month could work.

You don't show monthly portfolio return for person 2 in 1991. Your initial example data doesn't show stock weights but the desired example
data does. Your variable Monthly Return is not reproducible. You should take time to verify your question is clear when posting.
It's supposed be clear to the public who will read it, not only to you.
I didn't bother checking if your computations are correct but below is what I
understand you want. The procedure is simply to compute a weighted return and then
add them up by person year month groups. (I assume the stock weights apply to stocks on a daily basis, which is what your example data implies.)
clear all
set more off
input ///
perid year month day str3 stockid return stockw
1 1991 1 1 "ABC" .01 1
1 1991 1 2 "ABC" .02 1
1 1991 1 3 "ABC" -.01 1
1 1991 1 31 "ABC" .004 1
1 1996 12 31 "ABC" .002 1
2 1991 1 1 "ABC" .01 .3
2 1991 1 2 "ABC" .02 .3
2 1996 12 31 "ABC" .004 .3
2 1991 1 1 "XYZ" .001 .7
2 1991 1 2 "XYZ" .004 .7
2 1996 12 31 "XYZ" .021 .7
end
* create weighted return
gen returnw = return * stockw
sort perid year month day
list, sepby(perid year month day)
* sum weighted returns by person, year, month
collapse (sum) returnw, by (perid year month)
list, sepby(perid)
If you want collapse to sum, then you must indicate it with the (sum) (although I'm not clear if this is what you want). By default, it computes the mean. Read help collapse thouroughly.

Using if qualifier with egen in Stata

I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?

This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+

This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total

Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js