I'm a new user of Stata and I'm trying to understand how it executes commands. I'm facing trouble in restructuring data from its present format to a panel data format.
I'm using firm level micro-data which, for example, contain firm id, last avail year (latest year for which data was collected from that firm) and turnover (REV_LAY-0 = turnover from last avail year - 0, REV_LAY-1 = turnover from last avail year - 1 and so on).
The present data format is the following:
The required panel format looks like this:
In SAS, I would do the following in a loop:
if last_avail_yr=2016 then do;
rev_2016=rev_lay-0;
rev_2015=rev_lay-1;
rev_2014=rev_lay-2;
rev_2013=rev_lay-3;
end;
But I'm not quite sure how to do it Stata. I tried using an if statement with a forvalues loop to achieve a similar result, but it didn't work out well.
Example data can be found below:
MARK BvD_ID LAST_AVAIL_YR REV_LAY0 REV_LAY1 REV_LAY2 REV_LAY3 REV_LAY4
437 ESA22001721 2016 27689 32097 28992 35868 36493
438 ESF23212103 2015 26786 52095 33023 29493 40368
439 ESB45426806 2012 22072 14864 12877 15330 6403
440 ESA45039294 2015 26700 23387 21104 21272 20002
441 ESB76638790 2016 27480 24303 10699 . .
Can anyone help me with the Stata code for this problem?
rev_lay-0 and so on are not valid names in Stata, so I assume they would be named rev_lay_0 and so on. Given that, the following should do the trick:a
reshape long rev_lay_, i(firm_id last_avail_yr) j(id)
by firm_id last_avail_yr: gen yr = last_avail_yr - _n + 1
keep firm_id last_avail_yr rev_lay_ yr
reshape wide rev_lay_, i(firm_id last_avail_yr) j(yr)
Although the accepted answer gives the OP what was asked for, the desired data layout is not very useful in Stata. A reshape long alone would produce a simple layout which is much, much better for most data management, all graphics and all statistical modelling undertaken with panel data in Stata:
clear
input MARK str11 BvD_ID LAST_AVAIL_YR REV_LAY0 REV_LAY1 REV_LAY2 REV_LAY3 REV_LAY4
437 ESA22001721 2016 27689 32097 28992 35868 36493
438 ESF23212103 2015 26786 52095 33023 29493 40368
439 ESB45426806 2012 22072 14864 12877 15330 6403
440 ESA45039294 2015 26700 23387 21104 21272 20002
441 ESB76638790 2016 27480 24303 10699 . .
end
reshape long REV_LAY , i(BvD_ID)
gen YEAR = LAST_AVAIL_YR - _j
drop if missing(REV_LAY)
drop _j LAST
list, sepby(BvD_ID)
+-------------------------------------+
| BvD_ID MARK REV_LAY YEAR |
|-------------------------------------|
1. | ESA22001721 437 27689 2016 |
2. | ESA22001721 437 32097 2015 |
3. | ESA22001721 437 28992 2014 |
4. | ESA22001721 437 35868 2013 |
5. | ESA22001721 437 36493 2012 |
|-------------------------------------|
6. | ESA45039294 440 26700 2015 |
7. | ESA45039294 440 23387 2014 |
8. | ESA45039294 440 21104 2013 |
9. | ESA45039294 440 21272 2012 |
10. | ESA45039294 440 20002 2011 |
|-------------------------------------|
11. | ESB45426806 439 22072 2012 |
12. | ESB45426806 439 14864 2011 |
13. | ESB45426806 439 12877 2010 |
14. | ESB45426806 439 15330 2009 |
15. | ESB45426806 439 6403 2008 |
|-------------------------------------|
16. | ESB76638790 441 27480 2016 |
17. | ESB76638790 441 24303 2015 |
18. | ESB76638790 441 10699 2014 |
|-------------------------------------|
19. | ESF23212103 438 26786 2015 |
20. | ESF23212103 438 52095 2014 |
21. | ESF23212103 438 33023 2013 |
22. | ESF23212103 438 29493 2012 |
23. | ESF23212103 438 40368 2011 |
+-------------------------------------+
Related
My dataset looks like the following:
identification number
year
indicator
Data
1112000
2000
JKL_ADS
511
1112001
2001
JKL_ADS
517
1112002
2002
JKL_ADS
721
1112003
2003
JKL_ADS
925
1112004
2004
JKL_ADS
1092
1112000
2000
KLS_DSAK
351
1112001
2001
KLS_DSAK
631
1112002
2002
KLS_DSAK
732
1112003
2003
KLS_DSAK
823
1112004
2004
KLS_DSAK
1092
I want to reshape wide so it looks like this instead:
identification number
year
JKL_ADS
KLS_DSAK
1112000
2000
511
351
1112001
2001
517
631
1112002
2002
721
732
1112003
2003
925
823
1112004
2004
1092
1092
This is a fairly standard application. You didn't give example data in recommended form, so the details here may need modification by you.
Contrary to the question, indicator serves as an argument to j().
* Example generated by -dataex-. For more info, type help dataex
clear
input long identificationnumber int year str8 indicator int data
1112000 2000 "JKL_ADS" 511
1112001 2001 "JKL_ADS" 517
1112002 2002 "JKL_ADS" 721
1112003 2003 "JKL_ADS" 925
1112004 2004 "JKL_ADS" 1092
1112000 2000 "KLS_DSAK" 351
1112001 2001 "KLS_DSAK" 631
1112002 2002 "KLS_DSAK" 732
1112003 2003 "KLS_DSAK" 823
1112004 2004 "KLS_DSAK" 1092
end
. reshape wide data , i(id year) j(indicator) string
(j = JKL_ADS KLS_DSAK)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 10 -> 5
Number of variables 4 -> 4
j variable (2 values) indicator -> (dropped)
xij variables:
data -> dataJKL_ADS dataKLS_DSAK
-----------------------------------------------------------------------------
. rename (data*) (*)
. l
+--------------------------------------+
| identi~r year JKL_ADS KLS_DSAK |
|--------------------------------------|
1. | 1112000 2000 511 351 |
2. | 1112001 2001 517 631 |
3. | 1112002 2002 721 732 |
4. | 1112003 2003 925 823 |
5. | 1112004 2004 1092 1092 |
+--------------------------------------+
I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+
I am working with a Stata dataset that tracks a company's contract year.
However, systematically I am missing a year:
Is there a code I could quickly run through to replace the missing year with the year from the previous observation?
The following works for me:
clear
input var year
564 2029
597 2029
653 .
342 2041
456 2041
end
replace year = year[_n-1] if missing(year)
list
+------------+
| var year |
|------------|
1. | 564 2029 |
2. | 597 2029 |
3. | 653 2029 |
4. | 342 2041 |
5. | 456 2041 |
+------------+
I have problems in reshaping data from wide to long format:
I have no identifier variable for the wide variables.
My dataset is quite wide. I do have about 7000 variables.
The number of variables per ID is not constant, so for some IDs I have 5 and for others I have 10 variables.
I was hoping that this Stata FAQ could help me, but unfortunately this does not work properly (see following code snippets).
So I do have data that looks like the following example:
clear
input str45 Year
"2010"
"2011"
"2012"
"2014"
end
input str45 A101Meas0010
"1.50"
"1.70"
"1.71"
"1.71"
input str45 A101Meas0020
"50"
"60"
"65"
"64"
input str45 A101Meas0020A
"51"
"62"
"64"
"68"
input str45 FE123Meas0010
"1.60"
"1.75"
"1.92"
"1.94"
input str45 FE123Meas0020
"60"
"72"
"88"
"92"
list
+-------------------------------------------------------------+
| Year A10~0010 A10~0020 A1~0020A FE1~0010 FE1~0020 |
|-------------------------------------------------------------|
1. | 2010 1.50 50 51 1.60 60 |
2. | 2011 1.70 60 62 1.75 72 |
3. | 2012 1.71 65 64 1.92 88 |
4. | 2014 1.71 64 68 1.94 92 |
+-------------------------------------------------------------+
The final table I want to achieve would look something like this:
+--------------------------------------------------+
| Year ID Meas0010 Meas0020 Meas0020A |
|--------------------------------------------------|
1. | 2010 A101 1.50 50 . |
2. | 2010 FE123 1.60 51 60 |
3. | 2011 A101 1.70 60 . |
4. | 2011 FE123 1.75 62 72 |
5. | 2012 A101 1.71 65 . |
6. | 2012 FE123 1.92 64 88 |
7. | 2014 A101 1.71 64 . |
8. | 2014 FE123 1.94 68 92 |
+--------------------------------------------------+
I tried following code snippet close to the example from the Stata FAQ, but this throws an error:
unab vars : *Meas*
local stubs : subinstr local vars "Meas0010" "", all
local stubs : subinstr local stubs "Meas0020" "", all
local stubs : subinstr local stubs "Meas0020A" "", all
reshape long "`stubs'", i(Year) j(Measurement) string
(note: j = Meas0010 Meas0020 Meas0020A)
(note: A101AMeas0010 not found)
variable A101Meas0010 not found
r(111);
Any ideas how to reshape this? I never had to reshape such an odd structure before.
Additional Question: In the example above I did have to specify the Measurement-Names Meas0010, Meas0020 and Meas0020A. Is it possible to automate this as well? All measurement names start with the keyword Meas, so the variable names are always of the structure _ID+MeasName, e.g. A101Meas0020A stands for ID A101 and Measurement Meas0020A.
The annoying thing is: I do know how to do this in MATLAB, but I am forced to use Stata here.
Your variable name structure is a little awkward, but there is a syntax to match. It's better covered in the help for reshape, and is only barely mentioned in the FAQ you cite (which I wrote, so I can be emphatic that it's intended as a supplement to the help, not the first line of documentation).
Your example yields to
clear
input str4 (Year A101Meas0010 A101Meas0020 A101Meas0020A FE123Meas0010 FE123Meas0020)
"2010" "1.50" "50" "51" "1.60" "50"
"2011" "1.70" "60" "62" "1.75" "60"
"2012" "1.71" "65" "64" "1.92" "65"
"2014" "1.71" "64" "68" "1.94" "64"
end
reshape long #Meas0010 #Meas0020 #Meas0020A, i(Year) j(ID) string
destring, replace
sort Year ID
list, sepby(Year)
+-----------------------------------------------+
| Year ID Meas0010 Meas0020 Me~0020A |
|-----------------------------------------------|
1. | 2010 A101 1.5 50 51 |
2. | 2010 FE123 1.6 50 . |
|-----------------------------------------------|
3. | 2011 A101 1.7 60 62 |
4. | 2011 FE123 1.75 60 . |
|-----------------------------------------------|
5. | 2012 A101 1.71 65 64 |
6. | 2012 FE123 1.92 65 . |
|-----------------------------------------------|
7. | 2014 A101 1.71 64 68 |
8. | 2014 FE123 1.94 64 . |
+-----------------------------------------------+
It seems bizarre that your example enters everything as string: note the destring in my code.
Without access to your dataset, I'd say that you should be able to find the more general syntax without automation. You know that there are at most about 10 measurements in the fullest case. In any event you are already showing the syntax tricks needed to remove strings you don't need.
Say I have a data set of country GDPs formatted like this:
---------------------------------
| Year | Country A | Country B |
| 1990 | 128 | 243 |
| 1991 | 130 | 212 |
| 1992 | 187 | 207 |
How would I use Stata's reshape command to change this into a long table with country-year rows, like the following?
----------------------
| Country| Year | GDP |
| A | 1990 | 128 |
| A | 1991 | 130 |
| A | 1992 | 187 |
| B | 1990 | 243 |
| B | 1991 | 212 |
| B | 1992 | 207 |
It is recommended that you try to solve the problem on your own first. Although you might have tried, you show no sign that you did. For future questions, please post the code you attempted, and why it didn't work for you.
The following gives what you ask for:
clear all
set more off
input ///
Year CountryA CountryB
1990 128 243
1991 130 212
1992 187 207
end
list
reshape long Country, i(Year) j(country) string
rename Country GDP
order country Year GDP
sort country Year
list, sep(0)
Note: you need the string option here because your stub suffixes are strings (i.e. "A" and "B"). See help reshape for the details.