I have a dataset in which variables are like the image
I want to produce a table for serogroup and all antibiotics[penicillin-tetracycline]. Antibiotics have value label ("Sensitive" "Resistant").
Here I only consider "Resistant" value.
I have tried following code:
gen All_antibiotic =1 if penicillin=="Resistant"
replace All_antibiotic =2 if ampicillin=="Resistant"
.
.
tab All_antibiotic serogroup
But it did not give complete table.
There are various difficulties here:
You don't provide a reproducible example, in that you don't provide a data example we can use. See this page on minimal examples.
You don't make clear what would be the rows, columns and cells of the table.
You are confusing string values and value labels. "Resistant" is a string value, not a value label.
The question title doesn't really indicate the problem.
This may help. In your case you would need rename before you could use reshape.
clear
input id group str4(y1 y2 y3)
1 1 frog frog toad
2 1 frog toad toad
3 1 toad toad toad
4 2 frog frog frog
5 2 frog frog toad
6 2 frog toad toad
end
preserve
reshape long y, i(id) j(which)
describe
tab group y
| y
group | frog toad | Total
-----------+----------------------+----------
1 | 3 6 | 9
2 | 6 3 | 9
-----------+----------------------+----------
Total | 9 9 | 18
restore
Related
I have the following variable indicating whether an observation is working or unemployed, where 0 indicates working and 1 refers to unemployed.
dataex unemp
input float unemp
0
0
0
0
1
.
1
When I tabulate the variable:
Unemploymen |
t | Freq.
------------+--------------
Employed | 80
Unemployed | 20
Total LF 100
I essentially want to divide 20/100, to obtain a total unemployment variable of 20%. I have done this manually now, but think it is better to automate this as I also want to compute unemployment by different education groups and geographic regions.
gen unemployment_broad = .
replace unemployment_broad = (20/100)*100
The education variable is as follows, where 1 "Less than basic",
2 "Basic",
3 "Secondary",
4 "Higher education",
Is there a way to compute unemployment rate by each education group?
input float educ
2
4
4
4
2
4
1
3
3
3
Using Cybernike's solution, I tried to create a variable showing unemployment by education as follows, but I got an error:
gen unemp_educ = .
replace unemp_educ = bysort educ: summarize unemp
I essentially want to visualize unemployment by education. With something like this:
graph hbar (mean) Unemployment, over(education)
This is because I also intend to replicate the same equation by demographic group, gender, etc.
Your unemployment variable is coded as 0/1. Therefore, you can obtain the proportion unemployed by taking the mean value. You could do this using the summarize command, or using the collapse command. Both of these can be performed by education group.
clear
input unemp educ
0 2
0 4
0 4
0 4
1 2
0 3
1 3
1 1
1 3
end
bysort educ: summarize unemp
collapse (mean) unemp, by(educ)
list
+-----------------+
| educ unemp |
|-----------------|
1. | 1 1 |
2. | 2 .5 |
3. | 3 .6666667 |
4. | 4 0 |
+-----------------+
In response to your edit, you can also save the mean values to the original dataset using:
bysort educ: egen unemp_mean = mean(unemp)
Your code for plotting the data seems to work fine.
Is that possible to ask stata to combine variables and sort them out in order?
My data file is a list of inventories, look something similar to the picture posted below. I have in total of 7 categories that I assign to a specific characteristic. However, these categories are not in order. For example, one would have satin and damask and the next would be damask and satin. Is that possible to ask stata to combine variables and sort them out in order?
I want to have a final column that contains all 7 categories and in order. For instance, no matter if the previous column's order is satin and damask or damask and satin, it will all become satin and damask at the end. No matter if the previous columns write fox wool satin in whatever order, it became the same order at the last column. There are about 100s of different words in the first category and then less and less in the following.
Then I can convert this from long-form to short-form to form a person list instead of a list of inventories for further graphing and calculations.enter image description here
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6(cat1 cat2) str5 cat3
"satin" "damask" ""
"damask" "satin" ""
"wool" "fox" "satin"
"satin" "fox" "wool"
end
Part of what you want may just be a combined table. Install tab_chi from SSC using ssc install tab_chi and then you have tabm installed: see its help for more.
. tabm cat?
| values
variable | damask fox satin wool | Total
-----------+--------------------------------------------+----------
cat1 | 1 0 2 1 | 4
cat2 | 1 2 1 0 | 4
cat3 | 0 0 1 1 | 2
-----------+--------------------------------------------+----------
Total | 2 2 4 2 | 10
. tabm cat?, transpose
| variable
values | cat1 cat2 cat3 | Total
-----------+---------------------------------+----------
damask | 1 1 0 | 2
fox | 0 2 0 | 2
satin | 2 1 1 | 4
wool | 1 0 1 | 2
-----------+---------------------------------+----------
Total | 4 4 2 | 10
Note. What's with the foxes? Did foxes have to die so people could wear them?
Note. You may have to bite the bullet and reshape long.
I'm using a sample survey by persons of a country. Every person has an ID that represents the home whom he/she belongs. I'm doing a probit model to analyze the effect of household head's education on poverty, but I need to replicate the level of education of the head of household to all the members of the household.
How can I create a variable in Stata that replicates the level of education of the head of householdenter image description here to all the members of the household, if they share the same household ID?
I need to do something like the image. I need "schooling of the head of household" variable.
Your data example is helpful, but still ambiguous as the column headers are not all legal Stata variable names and it is not clear whether variables are string or numeric with value labels or numeric. See the Stata tag wiki for detailed advice on data examples.
This example works in terms of numeric variables.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id float(relationship schooling)
1 1 4
1 2 4
1 3 2
2 1 5
2 2 4
3 1 5
3 3 1
end
bysort id : egen wanted = mean(cond(relationship == 1, schooling, .))
list, sepby(id)
+-----------------------------------+
| id relati~p school~g wanted |
|-----------------------------------|
1. | 1 1 4 4 |
2. | 1 2 4 4 |
3. | 1 3 2 4 |
|-----------------------------------|
4. | 2 1 5 5 |
5. | 2 2 4 5 |
|-----------------------------------|
6. | 3 1 5 5 |
7. | 3 3 1 5 |
+-----------------------------------+
If there is at most one person who is head of household, some other functions of the egen command would work to give the same result, including min(), max() and total(). If two or more people were recorded as head of household, then the mean would indeed be recorded and it might not be an integer.
For explanation and discussion, see Section 9 of this paper.
I am working with a data set covering multiple countries, variables, and years. It is currently organized wide like so (actually ~30 years and 5 different variables for each country):
country measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
What I would like is for the data to be rearranged long like so:
country year A B C
USA 1995 5 1 0
USA 1996 4 2 4
USA 1997 1 1 2
UK 1995 2 2 2
UK 1996 4 8 4
UK 1997 9 4 1
I tried using reshape long yr, i(country) j(year) but get the following error message:
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified i(country) and j(year). In
the current wide form, variable country should uniquely identify the observations.
I think this is because country is not the only long variable? (measure also is?)
Besides fixing that issue and arranging the years long instead of wide, I don't think this command will accomplish the other task of moving the different variables (A, B, C) into the wide format as column headers.
Will I need to use a separate reshape wide command for that? Or is there some way to expand the command to do both at once?
It's a double reshape. At least it can be done that way; and, further, that seems essential because years need to be long, not wide, and the measure(s) need to be wide, not long, so there are flavours of both problems.
Economic development data often arrive like this. Indeed the problem has given rise to at least one dedicated short paper
in the Stata Journal, but visible to all.
Your data example is helpful, and almost immediately useful, but please read the Stata tag and help dataex (if necessary, install dataex first using ssc install dataex).
See also this FAQ, which includes some hints beyond the Stata help and manual entry.
A search reshape in Stata would have pointed to these resources.
clear
input str3 country str1 measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
end
reshape long yr, i(country measure) j(year)
reshape wide yr, i(country year) j(measure) string
rename (yr*) *
list, sepby(country)
+----------------------------+
| country year A B C |
|----------------------------|
1. | UK 1995 2 2 2 |
2. | UK 1996 4 8 4 |
3. | UK 1997 9 4 1 |
|----------------------------|
4. | USA 1995 5 1 0 |
5. | USA 1996 4 2 4 |
6. | USA 1997 1 1 2 |
+----------------------------+
I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+