Reshape from wide to long without Identifier - stata

I have problems in reshaping data from wide to long format:
I have no identifier variable for the wide variables.
My dataset is quite wide. I do have about 7000 variables.
The number of variables per ID is not constant, so for some IDs I have 5 and for others I have 10 variables.
I was hoping that this Stata FAQ could help me, but unfortunately this does not work properly (see following code snippets).
So I do have data that looks like the following example:
clear
input str45 Year
"2010"
"2011"
"2012"
"2014"
end
input str45 A101Meas0010
"1.50"
"1.70"
"1.71"
"1.71"
input str45 A101Meas0020
"50"
"60"
"65"
"64"
input str45 A101Meas0020A
"51"
"62"
"64"
"68"
input str45 FE123Meas0010
"1.60"
"1.75"
"1.92"
"1.94"
input str45 FE123Meas0020
"60"
"72"
"88"
"92"
list
+-------------------------------------------------------------+
| Year A10~0010 A10~0020 A1~0020A FE1~0010 FE1~0020 |
|-------------------------------------------------------------|
1. | 2010 1.50 50 51 1.60 60 |
2. | 2011 1.70 60 62 1.75 72 |
3. | 2012 1.71 65 64 1.92 88 |
4. | 2014 1.71 64 68 1.94 92 |
+-------------------------------------------------------------+
The final table I want to achieve would look something like this:
+--------------------------------------------------+
| Year ID Meas0010 Meas0020 Meas0020A |
|--------------------------------------------------|
1. | 2010 A101 1.50 50 . |
2. | 2010 FE123 1.60 51 60 |
3. | 2011 A101 1.70 60 . |
4. | 2011 FE123 1.75 62 72 |
5. | 2012 A101 1.71 65 . |
6. | 2012 FE123 1.92 64 88 |
7. | 2014 A101 1.71 64 . |
8. | 2014 FE123 1.94 68 92 |
+--------------------------------------------------+
I tried following code snippet close to the example from the Stata FAQ, but this throws an error:
unab vars : *Meas*
local stubs : subinstr local vars "Meas0010" "", all
local stubs : subinstr local stubs "Meas0020" "", all
local stubs : subinstr local stubs "Meas0020A" "", all
reshape long "`stubs'", i(Year) j(Measurement) string
(note: j = Meas0010 Meas0020 Meas0020A)
(note: A101AMeas0010 not found)
variable A101Meas0010 not found
r(111);
Any ideas how to reshape this? I never had to reshape such an odd structure before.
Additional Question: In the example above I did have to specify the Measurement-Names Meas0010, Meas0020 and Meas0020A. Is it possible to automate this as well? All measurement names start with the keyword Meas, so the variable names are always of the structure _ID+MeasName, e.g. A101Meas0020A stands for ID A101 and Measurement Meas0020A.
The annoying thing is: I do know how to do this in MATLAB, but I am forced to use Stata here.

Your variable name structure is a little awkward, but there is a syntax to match. It's better covered in the help for reshape, and is only barely mentioned in the FAQ you cite (which I wrote, so I can be emphatic that it's intended as a supplement to the help, not the first line of documentation).
Your example yields to
clear
input str4 (Year A101Meas0010 A101Meas0020 A101Meas0020A FE123Meas0010 FE123Meas0020)
"2010" "1.50" "50" "51" "1.60" "50"
"2011" "1.70" "60" "62" "1.75" "60"
"2012" "1.71" "65" "64" "1.92" "65"
"2014" "1.71" "64" "68" "1.94" "64"
end
reshape long #Meas0010 #Meas0020 #Meas0020A, i(Year) j(ID) string
destring, replace
sort Year ID
list, sepby(Year)
+-----------------------------------------------+
| Year ID Meas0010 Meas0020 Me~0020A |
|-----------------------------------------------|
1. | 2010 A101 1.5 50 51 |
2. | 2010 FE123 1.6 50 . |
|-----------------------------------------------|
3. | 2011 A101 1.7 60 62 |
4. | 2011 FE123 1.75 60 . |
|-----------------------------------------------|
5. | 2012 A101 1.71 65 64 |
6. | 2012 FE123 1.92 65 . |
|-----------------------------------------------|
7. | 2014 A101 1.71 64 68 |
8. | 2014 FE123 1.94 64 . |
+-----------------------------------------------+
It seems bizarre that your example enters everything as string: note the destring in my code.
Without access to your dataset, I'd say that you should be able to find the more general syntax without automation. You know that there are at most about 10 measurements in the fullest case. In any event you are already showing the syntax tricks needed to remove strings you don't need.

Related

Stata: Scale x variable by lagged y variable

I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+

Restructuring variables in Stata

I'm a new user of Stata and I'm trying to understand how it executes commands. I'm facing trouble in restructuring data from its present format to a panel data format.
I'm using firm level micro-data which, for example, contain firm id, last avail year (latest year for which data was collected from that firm) and turnover (REV_LAY-0 = turnover from last avail year - 0, REV_LAY-1 = turnover from last avail year - 1 and so on).
The present data format is the following:
The required panel format looks like this:
In SAS, I would do the following in a loop:
if last_avail_yr=2016 then do;
rev_2016=rev_lay-0;
rev_2015=rev_lay-1;
rev_2014=rev_lay-2;
rev_2013=rev_lay-3;
end;
But I'm not quite sure how to do it Stata. I tried using an if statement with a forvalues loop to achieve a similar result, but it didn't work out well.
Example data can be found below:
MARK BvD_ID LAST_AVAIL_YR REV_LAY0 REV_LAY1 REV_LAY2 REV_LAY3 REV_LAY4
437 ESA22001721 2016 27689 32097 28992 35868 36493
438 ESF23212103 2015 26786 52095 33023 29493 40368
439 ESB45426806 2012 22072 14864 12877 15330 6403
440 ESA45039294 2015 26700 23387 21104 21272 20002
441 ESB76638790 2016 27480 24303 10699 . .
Can anyone help me with the Stata code for this problem?
rev_lay-0 and so on are not valid names in Stata, so I assume they would be named rev_lay_0 and so on. Given that, the following should do the trick:a
reshape long rev_lay_, i(firm_id last_avail_yr) j(id)
by firm_id last_avail_yr: gen yr = last_avail_yr - _n + 1
keep firm_id last_avail_yr rev_lay_ yr
reshape wide rev_lay_, i(firm_id last_avail_yr) j(yr)
Although the accepted answer gives the OP what was asked for, the desired data layout is not very useful in Stata. A reshape long alone would produce a simple layout which is much, much better for most data management, all graphics and all statistical modelling undertaken with panel data in Stata:
clear
input MARK str11 BvD_ID LAST_AVAIL_YR REV_LAY0 REV_LAY1 REV_LAY2 REV_LAY3 REV_LAY4
437 ESA22001721 2016 27689 32097 28992 35868 36493
438 ESF23212103 2015 26786 52095 33023 29493 40368
439 ESB45426806 2012 22072 14864 12877 15330 6403
440 ESA45039294 2015 26700 23387 21104 21272 20002
441 ESB76638790 2016 27480 24303 10699 . .
end
reshape long REV_LAY , i(BvD_ID)
gen YEAR = LAST_AVAIL_YR - _j
drop if missing(REV_LAY)
drop _j LAST
list, sepby(BvD_ID)
+-------------------------------------+
| BvD_ID MARK REV_LAY YEAR |
|-------------------------------------|
1. | ESA22001721 437 27689 2016 |
2. | ESA22001721 437 32097 2015 |
3. | ESA22001721 437 28992 2014 |
4. | ESA22001721 437 35868 2013 |
5. | ESA22001721 437 36493 2012 |
|-------------------------------------|
6. | ESA45039294 440 26700 2015 |
7. | ESA45039294 440 23387 2014 |
8. | ESA45039294 440 21104 2013 |
9. | ESA45039294 440 21272 2012 |
10. | ESA45039294 440 20002 2011 |
|-------------------------------------|
11. | ESB45426806 439 22072 2012 |
12. | ESB45426806 439 14864 2011 |
13. | ESB45426806 439 12877 2010 |
14. | ESB45426806 439 15330 2009 |
15. | ESB45426806 439 6403 2008 |
|-------------------------------------|
16. | ESB76638790 441 27480 2016 |
17. | ESB76638790 441 24303 2015 |
18. | ESB76638790 441 10699 2014 |
|-------------------------------------|
19. | ESF23212103 438 26786 2015 |
20. | ESF23212103 438 52095 2014 |
21. | ESF23212103 438 33023 2013 |
22. | ESF23212103 438 29493 2012 |
23. | ESF23212103 438 40368 2011 |
+-------------------------------------+

Quintiles with different quantity of observations

I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.

How can I export a two-way table?

I have created a two-way summary table in Stata, but I am struggling to output my results.
Using the auto.dta sample dataset as an example, I am trying to build a table that displays the means and standard deviations of mpg, by two other variables (expensive and foreign).
My code currently looks as follows:
sysuse auto.dta, replace
gen expensive = (price > 5000)
The table that I would like to display can be created by either of the two commands below:
tabulate expensive foreign, sum(mpg)
Means, Standard Deviations and Frequencies of Mileage (mpg)
| Car type
expensive | Domestic Foreign | Total
-----------+----------------------+----------
0 | 22.137931 28.875 | 23.594595
| 4.3648281 4.8825491 | 5.2305696
| 29 8 | 37
-----------+----------------------+----------
1 | 16.913043 22.428571 | 19
| 3.4629604 6.4416229 | 5.4467115
| 23 14 | 37
-----------+----------------------+----------
Total | 19.826923 24.772727 | 21.297297
| 4.7432972 6.6111869 | 5.7855032
| 52 22 | 74
table expensive foreign, c(mean mpg sd mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
I can also closely approximate the same results using collapse, but this does not calculate row and column totals.
My issue is that neither the tabulate (with the sum option) command nor the table command seem friendly to output. I have tried converting to matrices, but tabulate with the sum option does not allow the matcell option and table seems similarly uncooperative.
I'm familiar with tabstat, esttab etc., but was not able to create the two-way table that I need with any of those packages. Any help would be really appreciated.
The community-contributed command asdoc does exactly that:
. asdoc table expensive foreign, c(mean mpg sd mpg count mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
Click to Open File: Myfile.doc
Alternatively, one could use the community-contributed command tabout:
. tabout expensive foreign using table1.txt, c(mean mpg) sum replace
Table output written to: table1.txt
Car type
Domestic Foreign Total
Mean mpg Mean mpg Mean mpg
expensive
0 22.1 28.9 23.6
1 16.9 22.4 19.0
Total 19.8 24.8 21.3
. tabout expensive foreign using table2.txt, c(sd mpg) sum replace
Table output written to: table2.txt
Car type
Domestic Foreign Total
Sd mpg Sd mpg Sd mpg
expensive
0 4.4 4.9 5.2
1 3.5 6.4 5.4
Total 4.7 6.6 5.8
. tabout expensive foreign using table3.txt, c(count mpg) sum replace
Table output written to: table3.txt
Car type
Domestic Foreign Total
Count mpg Count mpg Count mpg
expensive
0 29.0 8.0 37.0
1 23.0 14.0 37.0
Total 52.0 22.0 74.0
an easy solution is to use collapse to get a dataset that reproduces your desired table, and then export the dataset as a csv
example
collapse (sum) mpg, by(expensive foreign)
and then
export delimited using mydata.csv

Automatic labelling of newly created indicator variables in Stata

I have a variable called region, which has 22 elements. Here is the output of tabulate region:
region of place of work | Freq. Percent Cum.
---------------------------+-----------------------------------
tyne & wear | 6 1.20 1.20
rest of northern region | 12 2.40 3.60
south yorkshire | 9 1.80 5.40
west yorkshire | 23 4.60 10.00
rest of yorks & humberside | 9 1.80 11.80
east midlands | 42 8.40 20.20
east anglia | 12 2.40 22.60
central london | 41 8.20 30.80
inner london (not central) | 23 4.60 35.40
outer london | 19 3.80 39.20
rest of south east | 97 19.40 58.60
south west | 46 9.20 67.80
west midlands metropolitan | 29 5.80 73.60
rest of west midlands | 14 2.80 76.40
greater manchester | 15 3.00 79.40
merseyside | 2 0.40 79.80
rest of north west | 31 6.20 86.00
wales | 12 2.40 88.40
strathclyde | 23 4.60 93.00
rest of scotland | 27 5.40 98.40
northern ireland | 8 1.60 100.00
---------------------------+-----------------------------------
Total | 500 100.00
I have created indicator variables from it using tab region, gen(region_). This creates 22 new variables, from region_1 to region_22. I do want the indicator variables to have simple names like region_1, etc (e.g. they are more easy to call using region_*). The problem is the label of the variable, which is something like region==south west. I want it to be south west.
I have looked at dummieslab (SSC) but it focuses on adding labels to the new variable names. None of these solutions work either. Do you know an automatic way to obtain this? Even a simple function like eliminating specific words from labels (getting rid of the region== bit) would work. I can't find anything like it.
Loop over the variables and zap unwanted text from the label each time. This basic functionality is documented at help macro and in the corresponding manual entry.
foreach v of var region_* {
local lbl : var label `v'
local lbl : subinstr local lbl "region==" "", all
local lbl = trim("`lbl'")
label var `v' "`lbl'"
}
For a canned solution, see labvarch from labutil on SSC, which you can install with
ssc inst labutil