Automatic labelling of newly created indicator variables in Stata - stata

I have a variable called region, which has 22 elements. Here is the output of tabulate region:
region of place of work | Freq. Percent Cum.
---------------------------+-----------------------------------
tyne & wear | 6 1.20 1.20
rest of northern region | 12 2.40 3.60
south yorkshire | 9 1.80 5.40
west yorkshire | 23 4.60 10.00
rest of yorks & humberside | 9 1.80 11.80
east midlands | 42 8.40 20.20
east anglia | 12 2.40 22.60
central london | 41 8.20 30.80
inner london (not central) | 23 4.60 35.40
outer london | 19 3.80 39.20
rest of south east | 97 19.40 58.60
south west | 46 9.20 67.80
west midlands metropolitan | 29 5.80 73.60
rest of west midlands | 14 2.80 76.40
greater manchester | 15 3.00 79.40
merseyside | 2 0.40 79.80
rest of north west | 31 6.20 86.00
wales | 12 2.40 88.40
strathclyde | 23 4.60 93.00
rest of scotland | 27 5.40 98.40
northern ireland | 8 1.60 100.00
---------------------------+-----------------------------------
Total | 500 100.00
I have created indicator variables from it using tab region, gen(region_). This creates 22 new variables, from region_1 to region_22. I do want the indicator variables to have simple names like region_1, etc (e.g. they are more easy to call using region_*). The problem is the label of the variable, which is something like region==south west. I want it to be south west.
I have looked at dummieslab (SSC) but it focuses on adding labels to the new variable names. None of these solutions work either. Do you know an automatic way to obtain this? Even a simple function like eliminating specific words from labels (getting rid of the region== bit) would work. I can't find anything like it.

Loop over the variables and zap unwanted text from the label each time. This basic functionality is documented at help macro and in the corresponding manual entry.
foreach v of var region_* {
local lbl : var label `v'
local lbl : subinstr local lbl "region==" "", all
local lbl = trim("`lbl'")
label var `v' "`lbl'"
}
For a canned solution, see labvarch from labutil on SSC, which you can install with
ssc inst labutil

Related

Stata: Scale x variable by lagged y variable

I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+

Quintiles with different quantity of observations

I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.

Reshape from wide to long without Identifier

I have problems in reshaping data from wide to long format:
I have no identifier variable for the wide variables.
My dataset is quite wide. I do have about 7000 variables.
The number of variables per ID is not constant, so for some IDs I have 5 and for others I have 10 variables.
I was hoping that this Stata FAQ could help me, but unfortunately this does not work properly (see following code snippets).
So I do have data that looks like the following example:
clear
input str45 Year
"2010"
"2011"
"2012"
"2014"
end
input str45 A101Meas0010
"1.50"
"1.70"
"1.71"
"1.71"
input str45 A101Meas0020
"50"
"60"
"65"
"64"
input str45 A101Meas0020A
"51"
"62"
"64"
"68"
input str45 FE123Meas0010
"1.60"
"1.75"
"1.92"
"1.94"
input str45 FE123Meas0020
"60"
"72"
"88"
"92"
list
+-------------------------------------------------------------+
| Year A10~0010 A10~0020 A1~0020A FE1~0010 FE1~0020 |
|-------------------------------------------------------------|
1. | 2010 1.50 50 51 1.60 60 |
2. | 2011 1.70 60 62 1.75 72 |
3. | 2012 1.71 65 64 1.92 88 |
4. | 2014 1.71 64 68 1.94 92 |
+-------------------------------------------------------------+
The final table I want to achieve would look something like this:
+--------------------------------------------------+
| Year ID Meas0010 Meas0020 Meas0020A |
|--------------------------------------------------|
1. | 2010 A101 1.50 50 . |
2. | 2010 FE123 1.60 51 60 |
3. | 2011 A101 1.70 60 . |
4. | 2011 FE123 1.75 62 72 |
5. | 2012 A101 1.71 65 . |
6. | 2012 FE123 1.92 64 88 |
7. | 2014 A101 1.71 64 . |
8. | 2014 FE123 1.94 68 92 |
+--------------------------------------------------+
I tried following code snippet close to the example from the Stata FAQ, but this throws an error:
unab vars : *Meas*
local stubs : subinstr local vars "Meas0010" "", all
local stubs : subinstr local stubs "Meas0020" "", all
local stubs : subinstr local stubs "Meas0020A" "", all
reshape long "`stubs'", i(Year) j(Measurement) string
(note: j = Meas0010 Meas0020 Meas0020A)
(note: A101AMeas0010 not found)
variable A101Meas0010 not found
r(111);
Any ideas how to reshape this? I never had to reshape such an odd structure before.
Additional Question: In the example above I did have to specify the Measurement-Names Meas0010, Meas0020 and Meas0020A. Is it possible to automate this as well? All measurement names start with the keyword Meas, so the variable names are always of the structure _ID+MeasName, e.g. A101Meas0020A stands for ID A101 and Measurement Meas0020A.
The annoying thing is: I do know how to do this in MATLAB, but I am forced to use Stata here.
Your variable name structure is a little awkward, but there is a syntax to match. It's better covered in the help for reshape, and is only barely mentioned in the FAQ you cite (which I wrote, so I can be emphatic that it's intended as a supplement to the help, not the first line of documentation).
Your example yields to
clear
input str4 (Year A101Meas0010 A101Meas0020 A101Meas0020A FE123Meas0010 FE123Meas0020)
"2010" "1.50" "50" "51" "1.60" "50"
"2011" "1.70" "60" "62" "1.75" "60"
"2012" "1.71" "65" "64" "1.92" "65"
"2014" "1.71" "64" "68" "1.94" "64"
end
reshape long #Meas0010 #Meas0020 #Meas0020A, i(Year) j(ID) string
destring, replace
sort Year ID
list, sepby(Year)
+-----------------------------------------------+
| Year ID Meas0010 Meas0020 Me~0020A |
|-----------------------------------------------|
1. | 2010 A101 1.5 50 51 |
2. | 2010 FE123 1.6 50 . |
|-----------------------------------------------|
3. | 2011 A101 1.7 60 62 |
4. | 2011 FE123 1.75 60 . |
|-----------------------------------------------|
5. | 2012 A101 1.71 65 64 |
6. | 2012 FE123 1.92 65 . |
|-----------------------------------------------|
7. | 2014 A101 1.71 64 68 |
8. | 2014 FE123 1.94 64 . |
+-----------------------------------------------+
It seems bizarre that your example enters everything as string: note the destring in my code.
Without access to your dataset, I'd say that you should be able to find the more general syntax without automation. You know that there are at most about 10 measurements in the fullest case. In any event you are already showing the syntax tricks needed to remove strings you don't need.

Reshaping when year and countries are both columns

I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18
The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+

Stata table: how to compute difference column without adding a new variable?

In a panel data set, I'm using
table Region TIME if TIME==2014 | TIME==2020 | TIME==2030 | TIME==2040, contents(sum BF ) row
to create the following table:
------------------------------------------
| TIME
Region | 2014 2020 2030 2040
----------+-------------------------------
701 | 26751 27941 29944 31477
702 | 10456 11354 12723 13788
704 | 41550 44481 49340 53273
706 | 44976 47535 51940 55573
709 | 43258 44398 46612 48191
711 | 6580 7011 7539 7856
713 | 9036 10139 11776 13194
714 | 3091 3284 3563 3750
716 | 9144 9730 10724 11543
719 | 5719 6292 7258 8036
720 | 11509 12161 13188 13919
722 | 21403 22344 23839 25006
723 | 4927 5094 5345 5447
728 | 2460 2576 2761 2906
|
Total | 240860 254340 276552 293959
------------------------------------------
I'd like to add a fifth column, which displays the difference between the year 2014 and 2040 in %.
Question: is this possible WITHOUT adding a new variable to the dataset? For instance by letting the fifth column being derived from a formula?
If not, how do I easily compute a new variable, taking account of the long format of the panel data set?
This isn't possible within table.
Your variable could be something like
egen total2014 = total(BF / (TIME == 2014)), by(Region)
egen total2040 = total(BF / (TIME == 2040)), by(Region)
gen pcdiff = 100 * (total2040 - total2014)/total2014
after which you can tabulate its (mean) value for each region. See Section 10 in http://www.stata-journal.com/sjpdf.html?articlenum=dm0055 for the first trick here.
You may need to go outside table for the tabulation, but if all else fails, collapse to a new dataset of totals and means.