calculating the investment rate of unbalanced panel data from firm level dataset - stata

I am doing a project using firm level dataset (unbalanced panel data). I have around 200,000 firms for 10 years. However, the start and end of each firm period differ: some firms start at 1990 and finish at 2000 and others start at 2005 and finish at 2015. I would like to calculate the investment rate using tangible fixed asset (TFA) which is basically (TFA(t)-TFA(t-1))/TFA(t-1) for each firm in Stata. Would you help me on this issue?
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID int dec31year double TFA
1 18992 1638309000
1 19358 1430424000
1 19723 2618977000
1 20088 2.799e+09
1 20453 3507431000
1 20819 4219361000
1 21184 4347613000
1 21549 3.9619e+09
1 21914 5100955000
1 22280 5404411000
2 19358 1.5479e+10
2 19723 1.3219e+10
2 20088 1.3387e+10
2 20453 1.4867e+10
2 20819 1.636e+10
2 21184 1.6547e+10
2 21549 1.6146e+10
2 21914 1.4011e+10
2 22280 1.3141e+10
2 22645 1.3311e+10
3 19358 3.201e+09
3 19723 2.945e+09
3 20088 2.955e+09
3 20453 2.630e+09
3 20819 2.375e+09
3 21184 2.233e+09
3 21549 2.166e+09
3 21914 2.177e+09
3 22280 2.015e+09
3 22645 2.122e+09
4 18992 1425000
4 19358 395837000
4 19723 385710000
4 20088 98745000
4 20453 20387000
4 20819 1636000
4 21184 1499000
4 21549 1365000
4 21914 1439000
4 22280 92866000
5 18992 4.5909e+10
5 19358 4.6606e+10
5 19723 4.5531e+10
5 20088 4.5645e+10
5 20453 4.627e+10
5 20819 4.6155e+10
5 21184 4.5847e+10
5 21549 4.5774e+10
5 21914 4.7443e+10
5 22280 4.7853e+10
6 19358 232641000
6 19723 231892000
6 20088 190669000
6 20453 227862000
6 20819 288878000
6 21184 302291000
6 21549 694925000
6 21914 8.190e+08
6 22280 7.730e+08
6 22645 6.480e+08
7 19358 1288758000
7 19723 1217425000
7 20088 1121128000
7 20453 1033546000
7 20819 964263000
7 21184 1020210000
7 21549 1087107000
7 21914 1272572000
7 22280 1310794000
7 22645 1227395000
8 19358 2463088000
8 19723 2630901000
8 20088 2811077000
8 20453 3041447000
8 20819 3257302000
8 21184 4388377000
8 21549 4427479000
8 21914 4741731000
8 22280 4845817000
8 22645 5005846000
9 19083 609320000
9 19448 619372000
9 19813 618904000
9 20178 853070000
9 20544 838932000
9 20909 785931000
9 21274 773765000
9 21639 760809000
9 22005 760693000
9 22370 860146000
10 18992 1617674000
10 19358 1590728000
10 19723 1554051000
10 20088 1445113000
10 20453 1351322000
10 20819 1224924000
10 21184 1081895000
10 21549 133179000
10 21914 114626000
10 22280 110914000
end
format %td dec31year
. * Example generated by -dataex-. To install: ssc install dataex
. clear
. input long ID int dec31year double TFA
ID dec31y~r TFA
1. 44 19389 857299000
2. 44 19754 1230192000
3. 44 20119 1474218000
4. 44 20484 1517779000
5. 44 20850 1542684000
6. 44 21184 1522782000
7. 44 21549 1577352000
8. 44 21914 1642480000
9. 44 22280 1506011000
10. 44 22645 1564853000
11. end
. format %td dec31year

Thanks for the data example.
. gen year = year(dec31)
. tsset ID year
Panel variable: ID (weakly balanced)
Time variable: year, 2011 to 2021
Delta: 1 unit
. gen wanted = D.TFA/L.TFA
(10 missing values generated)
. su wanted
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
wanted | 90 3.778748 29.86207 -.9197528 276.7804

Related

Subtract Value at Aggregate by Quarter

Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -3 from each quarter in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1 Only
2021 Only
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query. My data detail is actually in months but I'm only listing one date per quarter for brevity.
Data:
Group
Date
Value
1
01/01/2020
10
1
04/01/2020
8
1
07/01/2020
18
1
10/01/2020
2
1
01/01/2021
12
1
04/01/2021
3
1
07/01/2021
7
1
10/01/2021
2
2
01/01/2020
10
2
04/01/2020
8
2
07/01/2020
18
2
10/01/2020
2
2
01/01/2021
12
2
04/01/2021
3
2
07/01/2021
7
2
10/01/2021
2
Result:
Group
Qtr/Year
Value
1
Q1-2020
10
1
Q2-2020
8
1
Q3-2020
18
1
Q4-2020
2
1
2020
38
1
Q1-2021
9
1
Q2-2021
0
1
Q3-2021
4
1
Q4-2021
0
1
2021
13
2
Q1-2020
10
2
Q2-2020
8
2
Q3-2020
18
2
Q4-2020
2
2
2020
2
2
Q1-2021
12
2
Q2-2021
3
2
Q3-2021
7
2
Q4-2021
2
2
2021
24
You issue can be solved by using Matrix Table, and also to add new column to process value before create the table:
First, add a new column using following formula:
Revised value =
var newValue = IF(YEAR(Sheet1[Date])=2021,Sheet1[Value]-3,Sheet1[Value])
return
IF(newValue <0,0,newValue)
Second, create the matrix table for the desired outcome:

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

How to reshape a variable to wide in my dataset?

I am trying to reshape a variable to wide but not getting proper way to do so.
I have the day wise count dataset for each SSUID and i would like to reshape the day to wide to show the count for each SSUID in aggregate.
Dataset:
ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
I tried some code but getting the error:
count variable not constant within SSUID variable
My code:
reshape wide day, i(ssuid) j(count)
I would like to get the following result:
ssuid day1 day2 day3 day4 day5 day6
1226 3 7 5 7 7 6
1227 3 6 7 4 . .
1228 4 4 6 7 5 .
1229 3 6 6 6 5 .
The following works for me:
clear
input ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
end
reshape wide count, i(ssuid) j(day)
rename count# day#
list
+-------------------------------------------------+
| ssuid day1 day2 day3 day4 day5 day6 |
|-------------------------------------------------|
1. | 1226 3 7 5 7 7 6 |
2. | 1227 3 6 7 4 . . |
3. | 1228 4 4 6 7 5 . |
4. | 1229 3 6 6 6 5 . |
+-------------------------------------------------+

add 'id' in pandas dataframe

I have a dataframe, the DOCUMENT_ID is the unique id which will contain multiple words from WORD column. I need to add ids for the each word within that document.
I need to add
DOCUMENT_ID WORD COUNT
0 262056708396949504 4
1 262056708396949504 DVD 1
2 262056708396949504 Girls 1
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
5 262056708396949504 Hurricane 1
6 262056708396949504 Katrina 1
7 262056708396949504 Mardi 1
8 262056708396949504 Wild 1
10 262056708396949504 donated 1
11 262056708396949504 generated 1
13 262056708396949504 revenues 1
15 262056708396949504 themed 1
17 262056708396949504 torwhore 1
18 262056708396949504 victims 1
20 262167541718319104 18
21 262167541718319104 CCUFoodMan 1
22 262167541718319104 CCUinvolved 1
23 262167541718319104 Congrats 1
24 262167541718319104 Having 1
25 262167541718319104 K 1
29 262167541718319104 blast 1
30 262167541718319104 blasty 1
31 262167541718319104 carebrighton 1
32 262167541718319104 hurricane 1
34 262167541718319104 started 1
37 262197573421502464 21
My expected outcome:
DOCUMENT_ID WORD COUNT WORD_ID
0 262056708396949504 4 1
1 262056708396949504 DVD 1 2
2 262056708396949504 Girls 1 3
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
.........
20 262167541718319104 18 1
21 262167541718319104 CCUFoodMan 1 2
22 262167541718319104 CCUinvolved 1 3
I have added for empty cells also but can be ignored.
Answer
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Explanation
Let's build a DataFrame.
import pandas as pd
df = pd.DataFrame({'DOCUMENT_ID' : [262056708396949504, 262056708396949504, 262056708396949504, 262056708396949504, 262167541718319104, 262167541718319104, 262167541718319104], 'WORD' : ['DVD', 'Girls', 'Gras', 'Gone', 'DVD', 'Girls', "Gone"]})
df
DOCUMENT_ID WORD
0 262056708396949504 DVD
1 262056708396949504 Girls
2 262056708396949504 Gras
3 262056708396949504 Gone
4 262167541718319104 DVD
5 262167541718319104 Girls
6 262167541718319104 Gone
Given that your words are nested within unique Document_ID, we need a group by operation.
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Output:
DOCUMENT_ID WORD WORD_ID
0 262056708396949504 DVD 1
1 262056708396949504 Girls 2
2 262056708396949504 Gras 3
3 262056708396949504 Gone 4
4 262167541718319104 DVD 1
5 262167541718319104 Girls 2
6 262167541718319104 Gone 3

How to resample dates with Pandas item by item?

My objective is to add rows in pandas in order to replace missing data with previous data and resample dates at the same time. Example :
This is what I have :
date wins losses
2015-12-19 11 5
2015-12-20 17 8
2015-12-20 10 6
2015-12-21 15 1
2015-12-25 11 5
2015-12-26 6 10
2015-12-27 10 6
2015-12-28 4 12
2015-12-29 8 11
And this is what I want :
wins losses
date
2015-12-19 11.0 5.0
2015-12-20 10.0 6.0
2015-12-21 15.0 1.0
2015-12-22 15.0 1.0
2015-12-23 15.0 1.0
2015-12-24 15.0 1.0
2015-12-25 11.0 5.0
2015-12-26 6.0 10.0
2015-12-27 10.0 6.0
2015-12-28 4.0 12.0
2015-12-29 8.0 11.0
And this is my code :
resamp = df.set_index('date').resample('D', how='last', fill_method='ffill')
It works !
But I want to do the same thing with 22 million lines (pandas), with different dates, and different IDs..
This dataframe contains two productID (1 and 2). I want to do the same previous exercice and keep the time serie data of every productID..
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-04 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-07 2 12
2015-12-09 2 11
This is my code :
df['date'] = pd.to_datetime(df['createdAt'])
df.set_index('date').resample('D', how='last', fill_method='ffill')
This is what I have if I use the same code ! I don't want a groupby with my dates.
createdAt productId popularity
date
2015-12-01 2015-12-01 2 5
2015-12-02 2015-12-02 2 5
2015-12-03 2015-12-03 2 10
2015-12-04 2015-12-04 2 6
2015-12-05 2015-12-05 2 6
2015-12-06 2015-12-06 2 6
2015-12-07 2015-12-07 2 12
2015-12-08 2015-12-08 2 12
2015-12-09 2015-12-09 2 11
This is what I want !
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-03 1 8
2015-12-04 1 6
2015-12-05 1 6
2015-12-06 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-02 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-05 2 6
2015-12-06 2 6
2015-12-07 2 12
2015-12-08 2 12
2015-12-09 2 11
What to do ?
Thank you
Try this, it should works :)
print df.set_index('date').groupby('productId', group_keys=False).apply(lambda
df: df.resample('D').ffill()).reset_index()
This produces what you said you wanted.
print df.groupby('productId', group_keys=False).apply(lambda df: df.resample('D').ffill()).reset_index()
createdAt productId popularity
0 2015-12-01 1 5
1 2015-12-02 1 8
2 2015-12-03 1 8
3 2015-12-04 1 6
4 2015-12-05 1 6
5 2015-12-06 1 6
6 2015-12-07 1 9
7 2015-12-01 2 5
8 2015-12-02 2 5
9 2015-12-03 2 10
10 2015-12-04 2 6
11 2015-12-05 2 6
12 2015-12-06 2 6
13 2015-12-07 2 12
14 2015-12-08 2 12
15 2015-12-09 2 11