Count the number of distinct strings and their occurrence in a variable

Count the number of distinct strings and their occurrence in a variable - stata

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).

tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Related

Computing Unemployment rates by education group from an indicator variable (Stata)

I have the following variable indicating whether an observation is working or unemployed, where 0 indicates working and 1 refers to unemployed.
dataex unemp
input float unemp
0
0
0
0
1
.
1
When I tabulate the variable:
Unemploymen |
t | Freq.
------------+--------------
Employed | 80
Unemployed | 20
Total LF 100
I essentially want to divide 20/100, to obtain a total unemployment variable of 20%. I have done this manually now, but think it is better to automate this as I also want to compute unemployment by different education groups and geographic regions.
gen unemployment_broad = .
replace unemployment_broad = (20/100)*100
The education variable is as follows, where 1 "Less than basic",
2 "Basic",
3 "Secondary",
4 "Higher education",
Is there a way to compute unemployment rate by each education group?
input float educ
2
4
4
4
2
4
1
3
3
3
Using Cybernike's solution, I tried to create a variable showing unemployment by education as follows, but I got an error:
gen unemp_educ = .
replace unemp_educ = bysort educ: summarize unemp
I essentially want to visualize unemployment by education. With something like this:
graph hbar (mean) Unemployment, over(education)
This is because I also intend to replicate the same equation by demographic group, gender, etc.

Your unemployment variable is coded as 0/1. Therefore, you can obtain the proportion unemployed by taking the mean value. You could do this using the summarize command, or using the collapse command. Both of these can be performed by education group.
clear
input unemp educ
0 2
0 4
0 4
0 4
1 2
0 3
1 3
1 1
1 3
end
bysort educ: summarize unemp
collapse (mean) unemp, by(educ)
list
+-----------------+
| educ unemp |
|-----------------|
1. | 1 1 |
2. | 2 .5 |
3. | 3 .6666667 |
4. | 4 0 |
+-----------------+
In response to your edit, you can also save the mean values to the original dataset using:
bysort educ: egen unemp_mean = mean(unemp)
Your code for plotting the data seems to work fine.

Stata: Combine and Sort categories

Is that possible to ask stata to combine variables and sort them out in order?
My data file is a list of inventories, look something similar to the picture posted below. I have in total of 7 categories that I assign to a specific characteristic. However, these categories are not in order. For example, one would have satin and damask and the next would be damask and satin. Is that possible to ask stata to combine variables and sort them out in order?
I want to have a final column that contains all 7 categories and in order. For instance, no matter if the previous column's order is satin and damask or damask and satin, it will all become satin and damask at the end. No matter if the previous columns write fox wool satin in whatever order, it became the same order at the last column. There are about 100s of different words in the first category and then less and less in the following.
Then I can convert this from long-form to short-form to form a person list instead of a list of inventories for further graphing and calculations.enter image description here

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6(cat1 cat2) str5 cat3
"satin" "damask" ""
"damask" "satin" ""
"wool" "fox" "satin"
"satin" "fox" "wool"
end
Part of what you want may just be a combined table. Install tab_chi from SSC using ssc install tab_chi and then you have tabm installed: see its help for more.
. tabm cat?
| values
variable | damask fox satin wool | Total
-----------+--------------------------------------------+----------
cat1 | 1 0 2 1 | 4
cat2 | 1 2 1 0 | 4
cat3 | 0 0 1 1 | 2
-----------+--------------------------------------------+----------
Total | 2 2 4 2 | 10
. tabm cat?, transpose
| variable
values | cat1 cat2 cat3 | Total
-----------+---------------------------------+----------
damask | 1 1 0 | 2
fox | 0 2 0 | 2
satin | 2 1 1 | 4
wool | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 4 4 2 | 10
Note. What's with the foxes? Did foxes have to die so people could wear them?
Note. You may have to bite the bullet and reshape long.

Which function I can use in Stata to replicate a quantitative variable?

I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.

Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.

How to obtain the order of multiple reponses?

I'm working on a survey dataset which contains a question with multiple responses. The data is not well cleaned for the order of responses depends on the order in which an interviewee chose the multiple options. So it's a so-called "many-to-many" multiple response (I borrow the term from N.J. Cox and U. Kohler's tutorial on this topic). There are also several following complementary questions (like the year a certain event happened) which share the order of the first question. The basic data structure is like
q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1 3 . 1998 1999 .
2 . . 2000 . .
3 2 . 2001 1997 .
I can use code provided in the tutorial cited to detect whether a certain value appears in q1_* and set a new dummy to 1 in this case. But how can I retain the order in which I encounter the certain value and use it in my analysis regarding q2_* in the loop?
forvalues i = 1/3 {
egen Q1_`i' = anymatch(q1_*), val(`i')
}
UPDATE
The current answer is brilliant, but it gives the general order, not the particular order in which a certain value occurs.
I may not have expressed my question clearly enough.
What I desire is to detect if a certain event (a option of the multiple responses represented by certain value like 3) happens. If it does happen, then set a new-created dummy, say eventhappens, to 1: so in my example, we shall set eventhappens to 1 for the first and third id.
If that's all my desire, then anymatch() suffices.
However, I also need to retain the order in which the particular value 3 occurs, like 2 for first observation, to ease the analysis of the following questions. So for the first id, 1999 is the year when the certain event happened, not 1998. Then what should I do?
Update
Appologize for my former unclear description. The real data is like (I don't have the authority to post a picture of the real data in Stata browse window)
id ce101_s_1 ce101_s_2 ... ce101_s_13 ce102_s_1 ...... ce102_s_13
1 1 2 13 1999 1998 2005
2 13 . . 1999 2007 .
the ce101_s_* is a list of variable,they represent the options interviewee choose with regarding to question ce101 and their orders are the orders in which interviewee make the choice.Certain value(in the real data is chinese character with value labels)represents certain event had occured, for example 1 represents a villiage build its own hospital,13 represent a villiage has mobile signal and so on.Take id_1 for example, this village build a hospital (represented by 1) in 1999, build a preliminary school(represented by 2) in 1998 and so on, in fact , all event listed actually happened in id_1 village,but for id_2 only 2 and 13 event happens. The difficulty for me is to retain the order certain event happened in each villiage, take 13(mobile signal for instance),it occured in 2005 for id_1 village, because interviwee choose it at 13th order when answering question ce101, and the value of ce102_s_13 is 2005.But for id_2, interviewee choose it at the second order and the correponding value in ce102 is 2007.So if a want to create a dummy to represent if household live in certain villiage before certain event occur in this village, I need the order in ce102_s_*
.

I am not especially clear what you want, but I suspect the one-word answer is reshape. This structure may make it easier for you to cross-relate responses.
. input id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
id q1_1 q1_2 q1_3 q2_1 q2_2 q2_3
1. 1 1 3 . 1998 1999 .
2. 2 2 . . 2000 . .
3. 3 3 2 . 2001 1997 .
4. end
. reshape long q , i(id) j(Q) string
(note: j = 1_1 1_2 1_3 2_1 2_2 2_3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 18
Number of variables 7 -> 3
j variable (6 values) -> Q
xij variables:
q1_1 q1_2 ... q2_3 -> q
-----------------------------------------------------------------------------
. rename q answer
. split Q, parse(_) destring
variables born as string:
Q1 Q2
Q1 has all characters numeric; replaced as byte
Q2 has all characters numeric; replaced as byte
. rename Q1 question
. rename Q2 order
. list, sepby(id)
+--------------------------------------+
| id Q answer question order |
|--------------------------------------|
1. | 1 1_1 1 1 1 |
2. | 1 1_2 3 1 2 |
3. | 1 1_3 . 1 3 |
4. | 1 2_1 1998 2 1 |
5. | 1 2_2 1999 2 2 |
6. | 1 2_3 . 2 3 |
|--------------------------------------|
7. | 2 1_1 2 1 1 |
8. | 2 1_2 . 1 2 |
9. | 2 1_3 . 1 3 |
10. | 2 2_1 2000 2 1 |
11. | 2 2_2 . 2 2 |
12. | 2 2_3 . 2 3 |
|--------------------------------------|
13. | 3 1_1 3 1 1 |
14. | 3 1_2 2 1 2 |
15. | 3 1_3 . 1 3 |
16. | 3 2_1 2001 2 1 |
17. | 3 2_2 1997 2 2 |
18. | 3 2_3 . 2 3 |
+--------------------------------------+

Generating a new variable by selection from multiple variables

I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+

I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js