data1 is data from 1990 and it looks like
Panelkey Region income
1 9 30
2 1 20
4 2 40
data2 is data from 2000 and it looks like
Panelkey Region income
3 2 40
2 1 30
1 1 20
I want to add a column of where each person lived in 1990.
Panelkey Region income Region1990
3 2 40 .
2 1 30 1
1 1 20 9
How can I do this on Stata?
The following code will deal with panels that live in multiple regions in the same year by choosing the region with larger income. This would make sense if income was proportional to fraction of the year spent in a region. Same income ties will be broken arbitrarily using the highest region's value. Other types of aggregation might make sense (take a look at the -collapse- command).
Note that I tweaked your data by inserting second rows for the last observation in each year:
clear
input Panelkey Region income
1 9 30
2 1 20
4 2 40
4 10 80
end
rename (Region income) =1990
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
save "data1990.dta", replace
clear
input Panelkey Region income
3 2 40
2 1 30
1 1 20
1 9 20
end
bysort Panelkey (income Region): keep if _n==_N
isid Panelkey
merge 1:1 Panelkey using "data1990.dta", keep(match master) nogen
list, clean noobs
Related
I have data for different companies. The data stops at day 10 for one of the companies (Company 1), day 6 for the others. If Company 1 is selected with other companies, I want to show the average so that the data runs until day 10, but using day 7, 8, 9, 10 values for Company 1 and day 6 values for others.
I'd want to just fill down days 8-10 for other companies with the day 6 value, but that would look misleading on the graph. So I need a DAX equation with some magic in it.
As an example, I have companies:
Company 1
Company 2
Company 3
etc. as a filter
And a table like:
Company
Date
Day of Month
Count
Company 1
1.11.2022
1
10
Company 1
2.11.2022
2
20
Company 1
3.11.2022
3
21
Company 1
4.11.2022
4
30
Company 1
5.11.2022
5
40
Company 1
6.11.2022
6
50
Company 1
7.11.2022
7
55
Company 1
8.11.2022
8
60
Company 1
9.11.2022
9
62
Company 1
10.11.2022
10
70
Company 1
11.11.2022
11
NULL
Company 2
1.11.2022
1
15
Company 2
2.11.2022
2
25
Company 2
3.11.2022
3
30
Company 2
4.11.2022
4
34
Company 2
5.11.2022
5
45
Company 2
6.11.2022
6
100
Company 2
7.11.2022
7
NULL
Every date has a row, but for days over 6/10 the count is NULL. If Company 1 or Company 2 is chosen separately, I'd like to show the count as is. If they are chosen together, I'd like the average of the two so that:
Day 5: AVG(40,45)
Day 6: AVG(50,100)
Day 7: AVG(55,100)
Day 8: AVG(60,100)
Day 9: AVG(62,100)
Day 10: AVG(70,100)
Any ideas?
You want something like this?
Create a Matriz using your:
company_table_dim (M)
calendar_Days_Table(N)
So you will have a new table of MXN Rows
Go to PowerQuery Order DATA and FillDown your QTY column
(= Table.FillDown(#"Se expandió Fact_Table",{"QTY"}))
So your last known QTY will de filled til the end of Time_Table for any company filters
Cons: Consider your new Matriz MXN it could be millions of rows to calculate
Greetings
enter image description here
I have a panel data set with an id, date, and multiple variables. I'm trying to get the skewness and std dev of "var1" listed by id for a certain date range. I know those items are in the summary detail for "var1", but can't seem to find a way to get it listed by id for my specified date range.
Any help would be greatly appreciated!
Here is an example that may start you on your path.
. webuse pig
(Longitudinal analysis of pig weights)
. xtset id week
panel variable: id (strongly balanced)
time variable: week, 1 to 9
delta: 1 unit
. bysort id: egen sk = skew(weight) if inrange(week,3,8)
(144 missing values generated)
. list if id==1, clean
id week weight sk
1. 1 1 24 .
2. 1 2 32 .
3. 1 3 39 .0709604
4. 1 4 42.5 .0709604
5. 1 5 48 .0709604
6. 1 6 54.5 .0709604
7. 1 7 61 .0709604
8. 1 8 65 .0709604
9. 1 9 72 .
My data has some problem. The survey is conducted on housing unit. So the two rows with the same person ID might not actually indicate the same person.
I want to assign different ID for actually different person.
Let's say I have this data.
id yearmonth age
1 200001 12
1 200002 12
1 200003 14
1 200004 14
1 200005 14
3rd row is definitely different person. Its age increase by 2.
So I want to change ID like
id yearmonth age
1 200001 12
1 200002 12
10 200003 14
10 200004 14
10 200005 14
How can I do this? I think I can change the ID of 3rd row by writing
bysort id (yearmonth): replace id=id*10 if age[_n-1]>age+1 | age[_n-1]+1<age
(where I multiply by 10 because all IDs have the same number of numbers, so that multiplying by 10 won't give any duplicate)
But how can I change all subsequent rows?
Building on what you have, something like this might do what you want.
bysort id (yearmonth): generate idchange = age[_n-1]>age+1 | age[_n-1]+1<age
bysort id (yearmonth): generate numchange = sum(idchange)
replace id = 10*id + (idchange-1) if idchange>0
Note that this will handle the case where one original id has two or more changes detected. For up to 10 changes, anyhow.
id yearmonth age
2 200001 12
2 200002 14
2 200003 15
2 200004 18
2 200005 18
I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.
My data set is in this format as mentioned below:
NEWID
Age
H_PERS
Income
OCCU
FAMTYPE
REGION
Metro(Yes/No)
Exp_alcohol
population sample-(This is the weighted population each new id represents) etc.
I would like to generate a summarized view like below:
average expenditure value (This should be sum of (exp_alcohol/population sample))
% of population sample across Region Metro and each demographic variable
Please help me with your ideas.
Since I can't see your data set and your description was not very clear, I'm going to guess that you have data that looks something like this and you would like add some new variables that summarizes your data...
data alcohol;
input NEWID Age H_PERS Income OCCU $ FAMTYPE $ REGION $ Metro $
Exp_alcohol population_sample;
datalines;
1234 32 4 65000 abc m CA Yes 2 4
5678 23 5 35000 xyz s WA Yes 3 6
9923 34 3 49000 def d OR No 3 9
8844 26 4 54000 gdp m CA No 1 5
;
run;
data summar;
set alcohol;
retain TotalAvg_expend metro_count total_pop;
Divide = exp_alcohol/population_sample;
TotalAvg_expend + Divide;
total_pop + population_sample;
if metro = 'Yes' then metro_count + population_sample;
percent_metro = (metro_count/total_pop)*100;
drop NEWID Age H_PERS Income OCCU FAMTYPE REGION Divide;
run;
Output:
Exp_ population_ TotalAvg_ metro_ total_ percent_
Metro alcohol sample expend count pop metro
Yes 2 4 0.50000 4 4 100.000
Yes 3 6 1.00000 10 10 100.000
No 3 9 1.33333 10 19 52.632
No 1 5 1.53333 10 24 41.667