Use of count command - stata

I am using a dataset, which among other variables includes the following:
. describe year country co brand
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------
year int %9.0g year (=first dimension of panel)
country byte %9.0g market market (=second dimension of panel)
co int %9.0g model code (=third dimension of panel)
brand byte %21.0g brand brand code
After I load the dataset, I generate a new variable and declare my data to be panel:
egen yearcountry = group(year country), label
xtset co yearcountry
I would like to estimate the market share of each brand in each country.
For example:
count if brand=="AlfaRomeo" & country=="Italy"
However, i get the following error:
type mismatch
r(109);
The entire dataset consisting of 11,483 observations can be downloaded from here.

The following works for me:
. count if brand == 1 & country == 4
111
The variables brand and country are not string but numeric with value labels:
. tabulate country
market |
(=second |
dimension |
of panel) | Freq. Percent Cum.
------------+-----------------------------------
Belgium | 2,641 23.00 23.00
France | 2,252 19.61 42.61
Germany | 2,281 19.86 62.47
Italy | 2,020 17.59 80.07
UK | 2,289 19.93 100.00
------------+-----------------------------------
Total | 11,483 100.00
. taulate country, nolabel
market |
(=second |
dimension |
of panel) | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,641 23.00 23.00
2 | 2,252 19.61 42.61
3 | 2,281 19.86 62.47
4 | 2,020 17.59 80.07
5 | 2,289 19.93 100.00
------------+-----------------------------------
Total | 11,483 100.00
However, note that what you are calculating here is not the market share, but the number of cars of a particular brand in a certain country. The percentage market share is usually defined as the ratio of unit sales and total market unit sales.
The following code snippet will thus produce what you want:
forvalues i = 1 / 47 {
bysort year (country): egen a_`i' = total(brand == `i')
bysort year (country): gen b_`i' = (a_`i' / _N) * 100
}
collapse b_*, by(country year)
You can also check that the results add up as follows:
egen all = rowtotal(b_*)
You could then see the market share for AlfaRomeo & Audi, for years 1970 & 1976 and for Belgium & France as follows:
format b_* all %4.2f
list year country b_1 b_2 all if inlist(year,1970, 1976) & inlist(country, 1, 2), noobs
+---------------------------------------+
| year country b_1 b_2 all |
|---------------------------------------|
| 1970 Belgium 3.31 7.35 100.00 |
| 1976 Belgium 5.01 4.13 100.00 |
| 1970 France 3.31 7.35 100.00 |
| 1976 France 5.01 4.13 100.00 |
+---------------------------------------+

Related

Trimming my panel dataset - filtering out observations meeting criterion if preceding ID meets the complementary criterion

I am working with a dataset that includes 118,979 observations over 9 wide variables in Stata 16.0. The most prominent variable is whether a company-observation over multiple dates reports either "GPS" or "EPS". These companies can report both a "GPS" observation in a datapoint, as well as an "EPS" observation in the following datapoint. Please refer to the data overview below for further visualisation.
Datasample:
clear
input str8 cusip8 str16 cname str4 measure double actual long anndats_act float(fyear tanalyst meanforcast UE)
"87482X10" "TALMER BANCORP" "EPS" 1.21 20118 2014 29 .8686207 .3930131
"87482X10" "TALMER BANCORP" "GPS" 1.02 20479 2015 34 .8576471 .1893004
I need to drop the GPS observations (over multiple dates) once an identifier (being cusip8 in the table above) has reported an EPS over multiple dates. That is, if a company has reported GPS as well as EPS in e.g. January 1st, 2010, I want to drop the GPS observation such that the EPS is kept.
If a company only reports a GPS, and does not report an EPS during a given date, I want to keep the GPS observation in my dataset.
The following works for me (adjust your variable names as required):
. clear
. input str10(company_id measure) month day year
company_id measure month day year
1. "Company A" "EPS" 1 1 2010
2. "Company A" "GPS" 1 1 2010
3. "Company A" "GPS" 1 1 2010
4. "Company A" "GPS" 1 2 2010
5. "Company B" "EPS" 1 2 2010
6. "Company B" "GPS" 1 1 2010
7. "Company C" "GPS" 1 4 2010
8. "Company C" "EPS" 1 4 2010
9. end
.
. gen date = mdy(month,day,year)
. format date %d
. drop month day year
.
. sort company_id date measure
.
. gen both = 0
. by company_id date: replace both = 1 if measure[1] == "EPS" & measure[2] == "GPS"
(5 real changes made)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 01jan2010 1 |
3. | Company A GPS 01jan2010 1 |
4. | Company A GPS 02jan2010 0 |
|----------------------------------------|
5. | Company B GPS 01jan2010 0 |
6. | Company B EPS 02jan2010 0 |
|----------------------------------------|
7. | Company C EPS 04jan2010 1 |
8. | Company C GPS 04jan2010 1 |
+----------------------------------------+
.
. drop if measure == "GPS" & both == 1
(3 observations deleted)
.
. list, sepby(company_id)
+----------------------------------------+
| company~d measure date both |
|----------------------------------------|
1. | Company A EPS 01jan2010 1 |
2. | Company A GPS 02jan2010 0 |
|----------------------------------------|
3. | Company B GPS 01jan2010 0 |
4. | Company B EPS 02jan2010 0 |
|----------------------------------------|
5. | Company C EPS 04jan2010 1 |
+----------------------------------------+

How can I transpose multiple columns at once?

I am trying to transpose three columns by two variables.
My current dataset looks like:
Person Date Company Industry Number
John 2017 Apple Tech 5
John 2017 Starbucks Beverages 3
Kim 2014 Hilton Hotels 9
I would like my output data set to look like:
Person | Date | Company1 | Industry1 | Number1 | Company2 |Industry2| Number2
John | 2017 | Apple | Tech | 5 | Starbucks| Beverage| 3
Kim | 2014 | Hilton | Hotels | 9 | - | - | -
As you can see, I would like each observation to be unique by name and date.
Any suggestions?

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

How can I export a two-way table?

I have created a two-way summary table in Stata, but I am struggling to output my results.
Using the auto.dta sample dataset as an example, I am trying to build a table that displays the means and standard deviations of mpg, by two other variables (expensive and foreign).
My code currently looks as follows:
sysuse auto.dta, replace
gen expensive = (price > 5000)
The table that I would like to display can be created by either of the two commands below:
tabulate expensive foreign, sum(mpg)
Means, Standard Deviations and Frequencies of Mileage (mpg)
| Car type
expensive | Domestic Foreign | Total
-----------+----------------------+----------
0 | 22.137931 28.875 | 23.594595
| 4.3648281 4.8825491 | 5.2305696
| 29 8 | 37
-----------+----------------------+----------
1 | 16.913043 22.428571 | 19
| 3.4629604 6.4416229 | 5.4467115
| 23 14 | 37
-----------+----------------------+----------
Total | 19.826923 24.772727 | 21.297297
| 4.7432972 6.6111869 | 5.7855032
| 52 22 | 74
table expensive foreign, c(mean mpg sd mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
I can also closely approximate the same results using collapse, but this does not calculate row and column totals.
My issue is that neither the tabulate (with the sum option) command nor the table command seem friendly to output. I have tried converting to matrices, but tabulate with the sum option does not allow the matcell option and table seems similarly uncooperative.
I'm familiar with tabstat, esttab etc., but was not able to create the two-way table that I need with any of those packages. Any help would be really appreciated.
The community-contributed command asdoc does exactly that:
. asdoc table expensive foreign, c(mean mpg sd mpg count mpg) row col
----------------------------------------
| Car type
expensive | Domestic Foreign Total
----------+-----------------------------
0 | 22.1379 28.875 23.5946
| 4.364828 4.882549 5.23057
| 29 8 37
|
1 | 16.913 22.4286 19
| 3.46296 6.441623 5.446712
| 23 14 37
|
Total | 19.8269 24.7727 21.2973
| 4.743297 6.611187 5.785503
| 52 22 74
----------------------------------------
Click to Open File: Myfile.doc
Alternatively, one could use the community-contributed command tabout:
. tabout expensive foreign using table1.txt, c(mean mpg) sum replace
Table output written to: table1.txt
Car type
Domestic Foreign Total
Mean mpg Mean mpg Mean mpg
expensive
0 22.1 28.9 23.6
1 16.9 22.4 19.0
Total 19.8 24.8 21.3
. tabout expensive foreign using table2.txt, c(sd mpg) sum replace
Table output written to: table2.txt
Car type
Domestic Foreign Total
Sd mpg Sd mpg Sd mpg
expensive
0 4.4 4.9 5.2
1 3.5 6.4 5.4
Total 4.7 6.6 5.8
. tabout expensive foreign using table3.txt, c(count mpg) sum replace
Table output written to: table3.txt
Car type
Domestic Foreign Total
Count mpg Count mpg Count mpg
expensive
0 29.0 8.0 37.0
1 23.0 14.0 37.0
Total 52.0 22.0 74.0
an easy solution is to use collapse to get a dataset that reproduces your desired table, and then export the dataset as a csv
example
collapse (sum) mpg, by(expensive foreign)
and then
export delimited using mydata.csv

Display all levels of a variable while tabulating in Stata

I am cross-tabulating two variables variable1 with 5 levels and variable2 with 2 levels. The result of the tabulation is such that level 1 and 2 of variable1 is not displayed in the tabulation since the frequency is zero as follows:
sysuse auto
levelsof rep78
1 2 3 4 5
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
I would like to have the tabulation with all the levels displayed as follows:
tab rep78 foreign if foreign, col nofreq
Repair |
Record | Car type
1978 | Foreign | Total
-----------+-----------+----------
1 | 0.00 | 0.00
2 | 0.00 | 0.00
3 | 14.29 | 14.29
4 | 42.86 | 42.86
5 | 42.86 | 42.86
-----------+-----------+----------
Total | 100.00 | 100.00
How can I do that?
The reason I need this is that I have created a program that tabulates a given variable and posts the results into an excel report template using the putexcel functionality of Stata. In some cases some levels are not displayed in the tabulation and this results in some values getting posted to the wrong row of the excel report.
No decent example as yet from the OP, but here is some technique.
In general, it's tricky. Stata's no metaphysician and is reluctant to display anything without empirical evidence to hand that it exists. I here create a dataset with all the cross-combinations needed and also create a variable with explicit zeros to show. For many problems, also see help fillin.
. clear
. sysuse auto
(1978 Automobile Data)
. contract foreign rep78, zero
. egen pc = pc(_freq), by(foreign)
. tabdisp rep78 foreign if !foreign, c(pc) format(%2.1f)
--------------------
Repair |
Record | Car type
1978 | Domestic
----------+---------
1 | 3.8
2 | 15.4
3 | 51.9
4 | 17.3
5 | 3.8
. | 7.7
--------------------
. tabdisp rep78 foreign if foreign, c(pc) format(%2.1f)
-------------------
Repair |
Record |Car type
1978 | Foreign
----------+--------
1 | 0.0
2 | 0.0
3 | 13.6
4 | 40.9
5 | 40.9
. | 4.5
-------------------
Commands that create tables echoing what you give them (notably tabdisp) are here more helpful than commands that create summaries and then create tables that show the summaries (e.g. tabulate, table).