I have the following variable indicating whether an observation is working or unemployed, where 0 indicates working and 1 refers to unemployed.
dataex unemp
input float unemp
0
0
0
0
1
.
1
When I tabulate the variable:
Unemploymen |
t | Freq.
------------+--------------
Employed | 80
Unemployed | 20
Total LF 100
I essentially want to divide 20/100, to obtain a total unemployment variable of 20%. I have done this manually now, but think it is better to automate this as I also want to compute unemployment by different education groups and geographic regions.
gen unemployment_broad = .
replace unemployment_broad = (20/100)*100
The education variable is as follows, where 1 "Less than basic",
2 "Basic",
3 "Secondary",
4 "Higher education",
Is there a way to compute unemployment rate by each education group?
input float educ
2
4
4
4
2
4
1
3
3
3
Using Cybernike's solution, I tried to create a variable showing unemployment by education as follows, but I got an error:
gen unemp_educ = .
replace unemp_educ = bysort educ: summarize unemp
I essentially want to visualize unemployment by education. With something like this:
graph hbar (mean) Unemployment, over(education)
This is because I also intend to replicate the same equation by demographic group, gender, etc.
Your unemployment variable is coded as 0/1. Therefore, you can obtain the proportion unemployed by taking the mean value. You could do this using the summarize command, or using the collapse command. Both of these can be performed by education group.
clear
input unemp educ
0 2
0 4
0 4
0 4
1 2
0 3
1 3
1 1
1 3
end
bysort educ: summarize unemp
collapse (mean) unemp, by(educ)
list
+-----------------+
| educ unemp |
|-----------------|
1. | 1 1 |
2. | 2 .5 |
3. | 3 .6666667 |
4. | 4 0 |
+-----------------+
In response to your edit, you can also save the mean values to the original dataset using:
bysort educ: egen unemp_mean = mean(unemp)
Your code for plotting the data seems to work fine.
Related
I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.
Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.
I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+
I'm collapsing my data using weight, but I only want the weight to apply to my median and sum, not my count. I want my count to only be the sample size, not the population size.
Example:
. input outcome group weight
outcome group weight
1. 1 1 3
2. 1 2 3
3. 1 3 3
4. end
Running collapse (sum) outcome (count) n = outcome [pweight = weight], by(group) gives
. list
+---------------------+
| group outcome n |
|---------------------|
1. | 1 3 3 |
2. | 2 3 3 |
3. | 3 3 3 |
+---------------------+
Both the sum and count are using the weight. I want the count to be the sample size, i.e. 1 for each group.
Unfortunately it is not possible to have different weights when using collapse.
The few solutions I have in mind:
create the weights yourself in the data, and compute your weighted statistics yourself
have a look at the user-written version of collapse, which might include this feature. For instance, collapse2 or xcollapse
I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.
I'm using Stata, and I'm trying to compute the average price of firms' rivals in a market. I have data that looks like:
Market Firm Price
----------------------
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
And I'm trying to compute the average price of each firm's rivals, so I want to generate a new field that is the average values of the other firms in a market. It would look like:
Market Firm Price AvRivalPrice
------------------------------------
1 1 100 137.2
1 2 150 112.5
1 3 125 125
2 1 50 87.5
2 2 100 62.5
2 3 75 75
3 1 100 200
3 2 200 150
3 3 200 150
To do the average by group, I could use the egen command:
egen AvPrice = mean(price), by(Market)
But that wouldn't exclude the firm's own price in the average, and to the best of my knowledge, using the if qualifier would only change the observations it operated on, not the groups it averaged over. Is there a simple way to do this, or do I need to create loops and generate each average manually?
This is an old thread still of interest, so materials and techniques overlooked first time round still apply.
The more general technique is to work with totals. At its simplest, total of others = total of all - this value. In a egen framework that is going to look like
egen total = total(price), by(market)
egen n = total(!missing(price)), by(market)
gen avprice = (total - cond(missing(price), 0, price)) / cond(missing(price), n, n - 1)
The total() function of egen ignores missing values in its argument. If there are missing values, we don't want to include them in the count, but we can use !missing() which yields 1 if not missing and 0 if missing. egen's count() is another way to do this.
Code given earlier gives the wrong answer if missings are present as they are included in the count _N.
Even if a value is missing, the average of the other values still makes sense.
If no value is missing, the last line above simplifies to
gen avprice = (total - price) / (n - 1)
So far, this possibly looks like no more than a small variant on previous code, but it does extend easily to using weights. Presumably we want a weighted average of others' prices given some weight. We can exploit the fact that total() works on expressions, which can be more complicated than just variable names. Indeed the code above did that already, but it is often overlooked.
egen wttotal = total(weight * price), by(market)
egen sumwt = total(weight), by(market)
gen avprice = (wttotal - price * weight) / (sumwt - weight)
As before, if price or weight is ever missing, you need more complicated code, or just to ensure that you exclude such observations from the calculations.
See also the Stata FAQ
How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data-management/creating-variables-recording-properties/
for a wider-ranging discussion.
(If the numbers get big, work with doubles.)
EDIT 2 March 2018 That was a newer post in an old thread, which in turn needs updating. rangestat (SSC) can be used here and gives one-line solutions. Not surprisingly, the option excludeself was explicitly added for these kinds of problem. But while the solution for means is easy using an identity
mean for others = (total - value for self) / (count - 1)
many other summary measures don't yield to a similar, simple trick and in that sense rangestat includes much more general coding.
clear
input Market Firm Price
1 1 100
1 2 150
1 3 125
2 1 50
2 2 100
2 3 75
3 1 100
3 2 200
3 3 200
end
rangestat (mean) Price, interval(Firm . .) by(Market) excludeself
list, sepby(Market)
+----------------------------------+
| Market Firm Price Price_~n |
|----------------------------------|
1. | 1 1 100 137.5 |
2. | 1 2 150 112.5 |
3. | 1 3 125 125 |
|----------------------------------|
4. | 2 1 50 87.5 |
5. | 2 2 100 62.5 |
6. | 2 3 75 75 |
|----------------------------------|
7. | 3 1 100 200 |
8. | 3 2 200 150 |
9. | 3 3 200 150 |
+----------------------------------+
This is a way that avoids explicit loops, though it takes several lines of code:
by Market: egen Total = total(Price)
replace Total = Total - Price
by Market: gen AvRivalPrice = Total / (_N-1)
drop Total
Here's a shorter solution with fewer lines that kind of combines your original thought and #onestop's solution:
egen AvPrice = mean(price), by(Market)
bysort Market: replace AvPrice = (AvPrice*_N - price)/(_N-1)
This is all good for a census of firms. If you have a sample of the firms, and you need to apply the weights, I am not sure what a good solution would be. We can brainstorm it if needed.