Contracting dataset to only hold unique values for a variable

Contracting dataset to only hold unique values for a variable - stata

Assuming I have the following dataset as a toy example:
clear
input str32 Country Population_1 Population_2
"United States of America" 3999 .
"United States of America" . 3447
"Afghanistan" 544 .
"Afghanistan" . 727
"Belgium" 7546 .
"Belgium" . 992
"China" 10000 .
"China" . 12000
end
I want to shrink the dataset so that there is just one unique value for country.
My final dataset should look as follows:
Country Population_1 Population_2
United States of America 3999 3447
Afghanistan 544 727
Belgium 7546 992
China 10000 12000
I tried to use the collapse command but did not get the expected outcome. The command duplicates drop does not work either, as I it does not obtain the observations from Population_2.

This works for me:
collapse Pop*, by(Country)
list, abbreviate(12)
+--------------------------------------------------------+
| Country Population_1 Population_2 |
|--------------------------------------------------------|
1. | Afghanistan 544 727 |
2. | Belgium 7546 992 |
3. | China 10000 12000 |
4. | United States of America 3999 3447 |
+--------------------------------------------------------+

The following works for me:
generate Population_ = .
by Country, sort: replace Population_ = Population_2 if Population_1 == .
by Country, sort: replace Population_ = Population_1 if Population_2 == .
by Country: generate time = _n
drop Population_1 Population_2
reshape wide Population_, i(Country) j(time)

The community-contributed command gcollapse can also preserve wanted variables in the dataset:
gcollapse (sum) Pop*, merge replace by(Country)
duplicates drop Country, force

Related

Stata: Scale x variable by lagged y variable

I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |

This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+

Use of count command

I am using a dataset, which among other variables includes the following:
. describe year country co brand
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------
year int %9.0g year (=first dimension of panel)
country byte %9.0g market market (=second dimension of panel)
co int %9.0g model code (=third dimension of panel)
brand byte %21.0g brand brand code
After I load the dataset, I generate a new variable and declare my data to be panel:
egen yearcountry = group(year country), label
xtset co yearcountry
I would like to estimate the market share of each brand in each country.
For example:
count if brand=="AlfaRomeo" & country=="Italy"
However, i get the following error:
type mismatch
r(109);
The entire dataset consisting of 11,483 observations can be downloaded from here.

The following works for me:
. count if brand == 1 & country == 4
111
The variables brand and country are not string but numeric with value labels:
. tabulate country
market |
(=second |
dimension |
of panel) | Freq. Percent Cum.
------------+-----------------------------------
Belgium | 2,641 23.00 23.00
France | 2,252 19.61 42.61
Germany | 2,281 19.86 62.47
Italy | 2,020 17.59 80.07
UK | 2,289 19.93 100.00
------------+-----------------------------------
Total | 11,483 100.00
. taulate country, nolabel
market |
(=second |
dimension |
of panel) | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,641 23.00 23.00
2 | 2,252 19.61 42.61
3 | 2,281 19.86 62.47
4 | 2,020 17.59 80.07
5 | 2,289 19.93 100.00
------------+-----------------------------------
Total | 11,483 100.00
However, note that what you are calculating here is not the market share, but the number of cars of a particular brand in a certain country. The percentage market share is usually defined as the ratio of unit sales and total market unit sales.
The following code snippet will thus produce what you want:
forvalues i = 1 / 47 {
bysort year (country): egen a_`i' = total(brand == `i')
bysort year (country): gen b_`i' = (a_`i' / _N) * 100
}
collapse b_*, by(country year)
You can also check that the results add up as follows:
egen all = rowtotal(b_*)
You could then see the market share for AlfaRomeo & Audi, for years 1970 & 1976 and for Belgium & France as follows:
format b_* all %4.2f
list year country b_1 b_2 all if inlist(year,1970, 1976) & inlist(country, 1, 2), noobs
+---------------------------------------+
| year country b_1 b_2 all |
|---------------------------------------|
| 1970 Belgium 3.31 7.35 100.00 |
| 1976 Belgium 5.01 4.13 100.00 |
| 1970 France 3.31 7.35 100.00 |
| 1976 France 5.01 4.13 100.00 |
+---------------------------------------+

Reshaping when year and countries are both columns

I am trying to reshape some data. The issue is that usually data is either long or wide but this seems to be set up in a way that I cannot figure out how to reshape. The data looks as follows:
year australia canada denmark ...
1999 10 15 20
2000 12 16 25
2001 14 18 40
And I would like to get it into a panel format like the following
year country gdppc
1999 australia 10
2000 australia 12
2001 australia 14
1999 canada 16
2000 canada 18

The problem is just in the variable names. See e.g. this FAQ for the advice that you may need rename first before you can reshape.
For more complicated variants of this problem with similar data, see e.g. this paper.
clear
input year australia canada denmark
1999 10 15 20
2000 12 16 25
2001 14 18 40
end
rename (australia-denmark) gdppc=
reshape long gdppc , i(year) string j(country)
sort country year
list, sepby(country)
+--------------------------+
| year country gdppc |
|--------------------------|
1. | 1999 australia 10 |
2. | 2000 australia 12 |
3. | 2001 australia 14 |
|--------------------------|
4. | 1999 canada 15 |
5. | 2000 canada 16 |
6. | 2001 canada 18 |
|--------------------------|
7. | 1999 denmark 20 |
8. | 2000 denmark 25 |
9. | 2001 denmark 40 |
+--------------------------+

Automatic labelling of newly created indicator variables in Stata

I have a variable called region, which has 22 elements. Here is the output of tabulate region:
region of place of work | Freq. Percent Cum.
---------------------------+-----------------------------------
tyne & wear | 6 1.20 1.20
rest of northern region | 12 2.40 3.60
south yorkshire | 9 1.80 5.40
west yorkshire | 23 4.60 10.00
rest of yorks & humberside | 9 1.80 11.80
east midlands | 42 8.40 20.20
east anglia | 12 2.40 22.60
central london | 41 8.20 30.80
inner london (not central) | 23 4.60 35.40
outer london | 19 3.80 39.20
rest of south east | 97 19.40 58.60
south west | 46 9.20 67.80
west midlands metropolitan | 29 5.80 73.60
rest of west midlands | 14 2.80 76.40
greater manchester | 15 3.00 79.40
merseyside | 2 0.40 79.80
rest of north west | 31 6.20 86.00
wales | 12 2.40 88.40
strathclyde | 23 4.60 93.00
rest of scotland | 27 5.40 98.40
northern ireland | 8 1.60 100.00
---------------------------+-----------------------------------
Total | 500 100.00
I have created indicator variables from it using tab region, gen(region_). This creates 22 new variables, from region_1 to region_22. I do want the indicator variables to have simple names like region_1, etc (e.g. they are more easy to call using region_*). The problem is the label of the variable, which is something like region==south west. I want it to be south west.
I have looked at dummieslab (SSC) but it focuses on adding labels to the new variable names. None of these solutions work either. Do you know an automatic way to obtain this? Even a simple function like eliminating specific words from labels (getting rid of the region== bit) would work. I can't find anything like it.

Loop over the variables and zap unwanted text from the label each time. This basic functionality is documented at help macro and in the corresponding manual entry.
foreach v of var region_* {
local lbl : var label `v'
local lbl : subinstr local lbl "region==" "", all
local lbl = trim("`lbl'")
label var `v' "`lbl'"
}
For a canned solution, see labvarch from labutil on SSC, which you can install with
ssc inst labutil

Convert wide-like data to a long one in Stata?

I have a dataset like
year CNMubiBeijing CNMubiTianjing CNMubiShanghai ··· ··· Wulumuqi
1998 . . . .
1999 . . . .
····
2013 . . . .
As you can see, the first row is a list of city names in China,like Beijing, Shanghai and so on, combined with a prefix "CNMubi" (which is redundant). The first column corresponds to the year,and the observations are of another variable(like local government's tax revenue).It's similar to a "wide" type data and I want to convert it to a long type panel data like
city year tax_rev
Beijing 1998
···
Beijing 2013
Shanghai 1998
···
Shanghai 2013
Two immediate solutions come into my mind. One is to directly use the --reshape-- command, like reshape long CNMubi,i(year) j(city_eng) but it turn out give me a column of missing values (column of city_eng)
The second possible solution is use loop,like
foreach var of varlist _all {
replace city_eng="`var'"
}
It also doesn't work (in fact,the new generated city_eng equals to the last variables in the varlist), I need to "expand" the data from a mn to a mnm matrix. So how can I achieve my goal, thank you.

This works:
clear
set more off
*----- example data -----
input ///
year CNMubiBeijing CNMubiTianjing
1998 . .
1999 . .
2000
2001
2002
2003
end
set seed 259376
replace CNMubiBeijing = runiform()
replace CNMubiTianjing = runiform()
*----- what you want -----
reshape long CNMubi, i(year) j(city) string
sort city year
list, sepby(city)
Notice the string option, since j() contains string values.
The result is:
. sort city year
. list, sepby(city)
+----------------------------+
| year city CNMubi |
|----------------------------|
1. | 1998 Beijing .658855 |
2. | 1999 Beijing .494634 |
|----------------------------|
3. | 1998 Tianjing .0204465 |
4. | 1999 Tianjing .0454614 |
+----------------------------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Contracting dataset to only hold unique values for a variable - stata

The community-contributed command gcollapse can also preserve wanted variables in the dataset: gcollapse (sum) Pop*, merge replace by(Country) duplicates drop Country, force

Related

Stata: Scale x variable by lagged y variable

Use of count command

Reshaping when year and countries are both columns

Automatic labelling of newly created indicator variables in Stata

Convert wide-like data to a long one in Stata?

Categories

Resources