Generating a new variable by selection from multiple variables - stata

I have some data on diseases and age of diagnosis. Each participant was asked what diseases they have had and at what age that disease was diagnosed.
There are a set of variables disease1-28 with a numeric code for each disease and another set age1-28 with the age at diagnosis in years. The diseases are placed in successive variables in the order recalled; the age of diagnosis is placed in the appropriate age variable.
I would like to generate a new variable for each of several diseases giving the age of diagnosis of that disease: e.g. asthma_age_at_diagnosis
Can I do this without having 28 replace statements?
Example of the data:
+-------------+----------+----------+----------+------+------+------+
| Participant | Disease1 | Disease2 | Disease3 | Age1 | Age2 | Age3 |
+-------------+----------+----------+----------+------+------+------+
| 1 | 123 | 3 | . | 30 | 2 | . |
| 2 | 122 | 123 | 5 | 23 | 51 | 44 |
| 3 | 5 | . | . | 50 | . | . |
+-------------+----------+----------+----------+------+------+------+

I give a general heads-up that a question of this form without any code of your own is often considered off-topic for Stack Overflow. Still, the Stata users around here are the people answering Stata questions (surprise) and we usually indulge questions like this if interesting and well-posed.
I'd advise a different data structure, period. With your example data
clear
input Patient Disease1 Disease2 Disease3 Age1 Age2 Age3
1 123 3 . 30 2 .
2 122 123 5 23 51 44
3 5 . . 50 . .
end
You can reshape
reshape long Disease Age, i(Patient) j(Order)
drop if missing(Disease)
list, sep(0)
+--------------------------------+
| Patient Order Disease Age |
|--------------------------------|
1. | 1 1 123 30 |
2. | 1 2 3 2 |
3. | 2 1 122 23 |
4. | 2 2 123 51 |
5. | 2 3 5 44 |
6. | 3 1 5 50 |
+--------------------------------+
With the data in this form you can now answer lots of questions easily. I don't see that a whole bunch of new variables would make many analyses easier. Another way to see this is that you have hinted that the order in which diseases are coded is arbitrary; that being so, wiring that into the data structure is ill-advised. Even if order is important, it is still accessible as part of the dataset (variable Order).
Hint: If you still want separate variables for some purposes, look at separate.

Related

Count the number of distinct strings and their occurrence in a variable

I have a variable called Category that specifies the category of observations. The problem is that some observation have multiple categories. For example:
id Category
1 Economics
2 Biology
3 Psychology; Economics
4 Economics; Psychology
There is no meaning in the order of categories. They are always separated by ";". There are 250 categories, so creating dummy variables might be tricky. I have the complete list of categories in a separate Excel file if this might help.
What I want is simply to summarize my dataset by unique categories such as Economics (3), Psychology (2), Biology (1) (so the sum of all can be superior to the number of observations).
tabsplit from the tab_chi package on SSC will do this for you.
clear
input id str42 Category
1 "Economics"
2 "Biology"
3 "Psychology; Economics"
4 "Economics; Psychology"
end
capture ssc install tab_chi
tabsplit Category, p(;)
Category | Freq. Percent Cum.
------------+-----------------------------------
Biology | 1 16.67 16.67
Economics | 3 50.00 66.67
Psychology | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
Note: You can count semi-colons and thus phrases like this.
gen count = 1 + length(category) - length(subinstr(category, ";", "", .))
The logic is that you measure the length of the string and its length should semi-colons ; be replaced by empty strings (namely, removed). The difference is the number of semi-colons, to which you add 1.
EDIT: How to get to a different data structure, starting with the data example above.
. split Category, p(;)
variables created as string:
Category1 Category2
. drop Category
. reshape long Category, i(id) j(mention)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 3 -> 3
j variable (2 values) -> mention
xij variables:
Category1 Category2 -> Category
-----------------------------------------------------------------------------
. drop if missing(Category)
(2 observations deleted)
. list, sepby(id)
+----------------------------+
| id mention Category |
|----------------------------|
1. | 1 1 Economics |
|----------------------------|
2. | 2 1 Biology |
|----------------------------|
3. | 3 1 Psychology |
4. | 3 2 Economics |
|----------------------------|
5. | 4 1 Economics |
6. | 4 2 Psychology |
+----------------------------+

Update results in a column from multiple columns with different names

Based on the image, I would like to loop through the columns to find where there is a text mo. It updates mo with the results not the text mo. The challenge has been how to select the result in the next column different from where mo is.
Your answer to my comment above suggests to me that the question you ask reflects the wrong approach to the larger problem. Your description suggests that you have observations with a varying number of testname/testvalue pairs, such as
+----------------------------------------+
| id day test1 val1 test2 val2 |
|----------------------------------------|
| A 1 mo 11 . |
| A 2 mo 12 df 98.2 |
|----------------------------------------|
| B 1 df 98.3 mo 23 |
| B 2 mo 14 . |
+----------------------------------------+
and your objective is to produce observations that look like this
+----------------------+
| id day df mo |
|----------------------|
| A 1 . 11 |
| A 2 98.2 12 |
|----------------------|
| B 1 98.3 23 |
| B 2 . 14 |
+----------------------+
If that is the case, here is a reproducible example that you can copy, paste into Stata's Do-file Editor window, execute it, and examine the output to see how the technique avoids all the complexity you introduce by trying to use loops to accomplish the task. The reshape command is one of Stata's most powerful data management tools and it will benefit you to learn how to use it.
clear
input str8 id int day str8 test1 float val1 str8 test2 float val2
A 1 "mo" 11 "" .
A 2 "mo" 12 "df" 98.2
B 1 "df" 98.3 "mo" 23
B 2 "mo" 14 "" .
end
list, sepby(id) noobs
reshape long test val, i(id day) j(num)
drop if missing(test)
drop num
list, sepby(id) noobs
reshape wide val, i(id day) j(test) str
rename val* *
list, sepby(id) noobs

How to delete variables which occur in column x but not in column y?

How can I delete duplicates which occur in column x but not in column y?
My dataset is as follows:
+-------+---+---+
| year | x | y |
+-------+---+---+
| 2001 | 1 | 2 |
| 2001 | 2 | 3 |
| 2001 | 2 | 3 |
| 2001 | 4 | 6 |
| 2001 | 5 | 9 |
| 2001 | 4 | 2 |
| 2001 | 4 | 9 |
+-------+---+---+
What I want is to remove the entries which occur in column y from the ones in column x.
My result would be: 1,4,5
I am currently learning Stata and I would love to know a good source for all possible commands, if this exists? So I can learn better on my own. Currently I have trouble to find good sources.
In Stata what you call columns are always called variables.
See http://www.statalist.org/forums/help#stata for general advice on how to present data examples in Stata questions. (The comments on CODE delimiters don't apply here.)
This may help. I didn't understand the role of year in your problem.
clear
input year x y
2001 1 2
2001 2 3
2001 2 3
2001 4 6
2001 5 9
2001 4 2
2001 4 9
end
rename x Datax
rename y Datay
gen long obs = _n
reshape long Data, i(obs) j(which) string
bysort Data (which) : drop if which[_N] == "y"
list
+---------------------------+
| obs which year Data |
|---------------------------|
1. | 1 x 2001 1 |
2. | 4 x 2001 4 |
3. | 7 x 2001 4 |
4. | 6 x 2001 4 |
5. | 5 x 2001 5 |
+---------------------------+
All possible commands aren't documented in a single place. Someone could write new commands all the time and they would not be documented anywhere except their help files. Did you mean that? Nor are all existing commands documented in one place: many are user-written and most of those are just documented by their help files.
Most of the official commands in Stata as supplied by StataCorp are documented in the manuals. Literally, there are also undocumented commands (I am not inventing this: see help undocumented) and there are also nondocumented commands that exist, known about because StataCorp mention them in talks or emails. To be as positive as possible: start with the manuals, bundled with your copy of Stata as .pdf files.

Quintiles with different quantity of observations

I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.

How can I match two columns of data by name?

I have two sets of data that look something like this:
Bill | 7
Sam | 13
Chuck | 9
and
Bill | 6
Sam | 3
Beth | 6
and I want:
Beth | 0 | 6
Bill | 7 | 6
Chuck| 9 | 0
Sam | 13 | 3
I don't even care if the data ends up looking like this:
Bill | 7 | Bill | 6
| | Beth | 6
Sam | 13 | Sam | 3
Chuck| 9 | Chuck| 0
I just would like to match up the names.
Your desired outcome - I've never seen such an order in "real life practice".
To use the data, I would go with an operating system tool to combine the source files
(like: copy file1 + file2 newfile.csv; new file extension for easily recognizing by OOo Calc).
In CALC you can then sort / filter, to show a persons data together, or sum / calculate with it.
If you want standard operations, like SUM per person, check out the pivot table feature.
HTH