Stata: Combine and Sort categories

Stata: Combine and Sort categories - stata

Is that possible to ask stata to combine variables and sort them out in order?
My data file is a list of inventories, look something similar to the picture posted below. I have in total of 7 categories that I assign to a specific characteristic. However, these categories are not in order. For example, one would have satin and damask and the next would be damask and satin. Is that possible to ask stata to combine variables and sort them out in order?
I want to have a final column that contains all 7 categories and in order. For instance, no matter if the previous column's order is satin and damask or damask and satin, it will all become satin and damask at the end. No matter if the previous columns write fox wool satin in whatever order, it became the same order at the last column. There are about 100s of different words in the first category and then less and less in the following.
Then I can convert this from long-form to short-form to form a person list instead of a list of inventories for further graphing and calculations.enter image description here

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6(cat1 cat2) str5 cat3
"satin" "damask" ""
"damask" "satin" ""
"wool" "fox" "satin"
"satin" "fox" "wool"
end
Part of what you want may just be a combined table. Install tab_chi from SSC using ssc install tab_chi and then you have tabm installed: see its help for more.
. tabm cat?
| values
variable | damask fox satin wool | Total
-----------+--------------------------------------------+----------
cat1 | 1 0 2 1 | 4
cat2 | 1 2 1 0 | 4
cat3 | 0 0 1 1 | 2
-----------+--------------------------------------------+----------
Total | 2 2 4 2 | 10
. tabm cat?, transpose
| variable
values | cat1 cat2 cat3 | Total
-----------+---------------------------------+----------
damask | 1 1 0 | 2
fox | 0 2 0 | 2
satin | 2 1 1 | 4
wool | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 4 4 2 | 10
Note. What's with the foxes? Did foxes have to die so people could wear them?
Note. You may have to bite the bullet and reshape long.

Related

Computing Unemployment rates by education group from an indicator variable (Stata)

I have the following variable indicating whether an observation is working or unemployed, where 0 indicates working and 1 refers to unemployed.
dataex unemp
input float unemp
0
0
0
0
1
.
1
When I tabulate the variable:
Unemploymen |
t | Freq.
------------+--------------
Employed | 80
Unemployed | 20
Total LF 100
I essentially want to divide 20/100, to obtain a total unemployment variable of 20%. I have done this manually now, but think it is better to automate this as I also want to compute unemployment by different education groups and geographic regions.
gen unemployment_broad = .
replace unemployment_broad = (20/100)*100
The education variable is as follows, where 1 "Less than basic",
2 "Basic",
3 "Secondary",
4 "Higher education",
Is there a way to compute unemployment rate by each education group?
input float educ
2
4
4
4
2
4
1
3
3
3
Using Cybernike's solution, I tried to create a variable showing unemployment by education as follows, but I got an error:
gen unemp_educ = .
replace unemp_educ = bysort educ: summarize unemp
I essentially want to visualize unemployment by education. With something like this:
graph hbar (mean) Unemployment, over(education)
This is because I also intend to replicate the same equation by demographic group, gender, etc.

Your unemployment variable is coded as 0/1. Therefore, you can obtain the proportion unemployed by taking the mean value. You could do this using the summarize command, or using the collapse command. Both of these can be performed by education group.
clear
input unemp educ
0 2
0 4
0 4
0 4
1 2
0 3
1 3
1 1
1 3
end
bysort educ: summarize unemp
collapse (mean) unemp, by(educ)
list
+-----------------+
| educ unemp |
|-----------------|
1. | 1 1 |
2. | 2 .5 |
3. | 3 .6666667 |
4. | 4 0 |
+-----------------+
In response to your edit, you can also save the mean values to the original dataset using:
bysort educ: egen unemp_mean = mean(unemp)
Your code for plotting the data seems to work fine.

Which function I can use in Stata to replicate a quantitative variable?

I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.

Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).

tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Conditioning Stata dataset on past values of variables

I have a problem in conditioning the dataset I have on Stata. Basically I want to condition the presence in the dataset -within a certain group- of an observation for which a certain action is performed (as indicated by a variable) on the past values of another variable. So let's suppose I have the following
obs | id | action1 | action2 | year
1 | 1 | 1 | 0 | 2000
2 | 1 | 0 | 1 | 2001
3 | 1 | 0 | 1 | 2002
4 | 1 | 0 | 1 | 2002
5 | 1 | 0 | 1 | 2003
6 | 2 | 1 | 0 | 2000
7 | 2 | 1 | 0 | 2001
8 | 2 | 0 | 1 | 2002
9 | 2 | 0 | 1 | 2002
10 | 2 | 0 | 1 | 2003
And for each group identified by 'id' I want to keep the observation only if action 1 is performed or if action1 has been performed no earlier than 2 years before action2 has been performed. In this simplified example only observation 4 should be deleted. Please note that the 2 actions are not mutually exclusive and they can be performed more than once within the same year therefore looking at 2 observations in the past does not necessarily means to look at 2 years in the past.
A solution which I am not able to implement by code would be:
gen act1year= action1 * year
then by(id) store the value of act1year when they're different from 0 somewhere (I am not able to implement this)
and then by(id) keep if action1=1 or if action2[_n]=1 and the range year[_n] to year[_n]-2 contains at least one of the values in the previously stored variable.
I know probably my suggestion is not the easiest way to go and still I am not able to implement it, unfortunately I cannot manage to find a code that help me doing this. Hope you can help me. Thanks
Francesco

The following assumes certain things.
clear
set more off
input ///
obs id action1 action2 year
1 1 1 0 2000
2 1 0 1 2001
3 1 0 1 2002
4 1 0 1 2003
5 2 1 0 2000
6 2 0 1 2001
7 2 1 0 2002
8 2 0 1 2003
end
list, sepby(id)
*-----
bysort id (year) : keep if action1 | (action1[_n-1] + action1[_n-2] > 0)
list, sepby(id)
What is between parenthesis evaluates to one or zero depending on whether the inequality is true or false, respectively. This fragment indicates if action 1 was taken in either of the previous two observations.
You need to decide what to do with the first two observations, as they can't be compared with exactly two previous observations (they don't exist). In the following example they are always kept, because comparing with a non-existant observation in this case implies adding missing values, which results in missing. A missing is considered a very large number in Stata.
You can also work with time-series operators (help tsvarlist, help xtset) and really respect the time variable. Here, I work with the previous two observations. That may or may not coincide with the previous two time points.
I think your two actions are mutually exclusive, but you are not explicit about it.

How to obtain the order of multiple reponses?

I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.

I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js