How to recode separate variables from a multiple response survey question into one variable - stata

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115

The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Related

Using summation to create a new variable

I have data that look like this:
| Country | Year | Firm | Profit |
|---------|------|------|--------|
| A | 1 | 1 | 10 |
| A | 1 | 2 | 20 |
| A | 1 | 3 | 30 |
| A | 1 | 4 | 40 |
I want to create a new variable for each firm i that calculates the following:
For example, the value of the variable for firm 1 would be:
max(20 - 10, 0) + max(30 - 10, 0) + max(40 - 10, 0)
How can I do this in Stata by country and year?
Below is a direct solution to your problem (note the use of dataex for providing example data):
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Country float(Year Firm Profit)
"A" 1 1 10
"A" 1 2 20
"A" 1 3 30
"A" 1 4 40
end
generate Wanted = -Profit
bysort Country Year (Wanted): replace Wanted = sum(Profit) - _n * Profit
list
+-----------------------------------------+
| Country Year Firm Profit Wanted |
|-----------------------------------------|
1. | A 1 4 40 0 |
2. | A 1 3 30 10 |
3. | A 1 2 20 30 |
4. | A 1 1 10 60 |
+-----------------------------------------+
The logic behind it is the following:
Note: This was the first answer posted. It didn't avoid the pitfall of taking the OP's algebra literally and wanting to implement the calculation in terms of maxima within groups. But I realised after posting that there must be a much simpler way of doing it and #Romalpa Akzo got there, which is excellent. I undeleted this on request because it does show some machinery for looping over groups and implementing a calculation for each group with a customised Mata function.
Here I write a Mata function to return the wanted result for a group and then loop over the groups to populate a pre-defined variable.
To test the code for a dataset with more than one group, I use mpg from Stata's auto toy dataset.
mata :
void wanted (string scalar varname, string scalar usename, string scalar resultname) {
real scalar i
real colvector x, result, zero
result = x = st_data(., varname, usename)
zero = J(rows(x), 1, 0)
for(i = 1; i <= rows(x); i++) {
result[i] = sum(rowmax((x :- x[i], zero)))
}
st_store(., resultname, usename, result)
}
end
sysuse auto, clear
sort foreign rep78 mpg
egen group = group(foreign rep78), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1 / `G' {
replace touse = group == `g'
mata : wanted("mpg", "touse", "wanted")
}
How did that work out? Here are some results:
. list mpg wanted group if foreign, sepby(group)
+--------------------------+
| mpg wanted group |
|--------------------------|
53. | 21 7 Foreign 3 |
54. | 23 3 Foreign 3 |
55. | 26 0 Foreign 3 |
|--------------------------|
56. | 21 35 Foreign 4 |
57. | 23 19 Foreign 4 |
58. | 23 19 Foreign 4 |
59. | 24 13 Foreign 4 |
60. | 25 8 Foreign 4 |
61. | 25 8 Foreign 4 |
62. | 25 8 Foreign 4 |
63. | 28 2 Foreign 4 |
64. | 30 0 Foreign 4 |
|--------------------------|
65. | 17 84 Foreign 5 |
66. | 17 84 Foreign 5 |
67. | 18 77 Foreign 5 |
68. | 18 77 Foreign 5 |
69. | 25 42 Foreign 5 |
70. | 31 18 Foreign 5 |
71. | 35 6 Foreign 5 |
72. | 35 6 Foreign 5 |
73. | 41 0 Foreign 5 |
|--------------------------|
74. | 14 . . |
+--------------------------+
So, how would that be applied to your data?
clear
input str1 Country Year Firm Profit
A 1 1 10
A 1 2 20
A 1 3 30
A 1 4 40
end
egen group = group(Country Year), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1/`G' {
replace touse = group == `g'
mata: wanted("Profit", "touse", "wanted")
}
Results:
. list Firm Profit wanted, sepby(group)
+------------------------+
| Firm Profit wanted |
|------------------------|
1. | 1 10 60 |
2. | 2 20 30 |
3. | 3 30 10 |
4. | 4 40 0 |
+------------------------+

Convert one to many with 2 digits

I am currently handling a data set in Stata generated through ODK, the open data kit.
There is an option to answer questions with multiple answers. E.g. in my questionnaire "Which of these assets do you own?" and the interviewer tagged all the answers out of 20 options.
This generated for me a string variable with contents such as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
As this is difficult to analyse for several hundred participants, I wanted to generate new variables creating a 1 or 0 for each of the answer options.
For the variable hou_as I tried to generate the variables hou_as_1, hou_as_2 etc. with the following code:
foreach p in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 {
local P : subinstr local p "-" ""
gen byte hou_as_`P' = strpos(hou_as, "`p'") > 0
}
For the single digits this brings the problem that the variable hou_as_1 is also filled with a 1 if any of the 10 11 12 ... 19 is filled even if the option 1 was not chosen. Similarly hou_as_2 is filled when the option 2, 12 or 20 is checked.
How can I avoid this issue?
You want 20 indicator or dummy variables. Note first that it's much easier to use forval to loop 1(1)20, e.g.
forval j = 1/20 {
gen hou_as_`j' = 0
}
initialises 20 such variables as 0.
I think it's easier to loop over the words of your answer variables, words being here just whatever is separated by spaces. There are at most 20 words, and it is a little crude but likely to be fast enough to go
forval j = 1/20 {
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Let's put that together and try it out on your example:
clear
input str42 hou_as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
end
forval j = 1/20 {
gen hou_as_`j' = 0
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Just to show that it worked:
. list in 3
+----------------------------------------------------------------------------+
3. | hou_as | hou_as_1 | hou_as_2 | hou_as_3 | hou_as_4 | hou_as_5 | hou_as_6 |
| 1 3 9 11 | 1 | 0 | 1 | 0 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_as_7 | hou_as_8 | hou_as_9 | hou_a~10 | hou_a~11 | hou_a~12 | hou_a~13 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_a~14 | hou_a~15 | hou_a~16 | hou_a~17 | hou_a~18 | hou_a~19 | hou_a~20 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------------------------------------------+
Incidentally, your line
local P : subinstr local p "-" ""
does nothing useful. The local macro p only ever has contents which are integer digits, so there is no punctuation at all to remove.
See also this explanation and
. search multiple responses, sj
Search of official help files, FAQs, Examples, SJs, and STBs
SJ-5-1 st0082 . . . . . . . . . . . . . . . Tabulation of multiple responses
(help _mrsvmat, mrgraph, mrtab if installed) . . . . . . . . B. Jann
Q1/05 SJ 5(1):92--122
introduces new commands for the computation of one- and
two-way tables of multiple responses
SJ-3-1 pr0008 Speaking Stata: On structure & shape: the case of mult. resp.
. . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox & U. Kohler
Q1/03 SJ 3(1):81--99 (no commands)
discussion of data manipulations for multiple response data

Fill with values from an earlier time point - Stata

I am trying to generate a variable that is filled using a sequence of values starting at time==1.
The sequence changes everytime the variable rest1w changes from 0 to 1 or vice versa.
Firstly, I think I need to generate x, that is where the sequence restarts (see below example dataset). In my example, this is uniform, but in my full dataset the change varies (i.e. it does not change at every 5th observation).
list time restload trainload rest1w x in 1/15
+-----------------------------------------+
| time restload trainload rest1w x |
|-----------------------------------------|
1. | 1 .1994715 .4780615 0 1 |
2. | 2 .2077734 .471063 0 2 |
3. | 3 .2157595 .4641159 0 3 |
4. | 4 .2234298 .4572202 0 4 |
5. | 5 .2307843 .4503757 0 5 |
|-----------------------------------------|
6. | 6 .2378229 .4435827 1 1 |
7. | 7 .2445457 .436841 1 2 |
8. | 8 .2509527 .4301506 1 3 |
9. | 9 .2570438 .4235116 1 4 |
10. | 10 .2628191 .4169239 1 5 |
|-----------------------------------------|
11. | 11 .2682785 .4103876 0 1 |
12. | 12 .2734221 .4039026 0 2 |
13. | 13 .2782499 .397469 0 3 |
14. | 14 .2827618 .3910867 0 4 |
15. | 15 .2869579 .3847558 0 5 |
+-----------------------------------------+
Secondly, I need to generate a variable load. Which as per below shows how I would like to restart from time==1 everytime the sequence restarts. That is, at the second sequence where rest1w==0, load!=trainload.
The rule is that for each new sequence of 0's the value for load again goes back to the start of time (where time==1). This is demonstrated by the load values in the second sequence of 0's being exactly the same as the first sequence. In other words, where time==1, trainload==.478 then load==.478; BUT where time==11, then load==.478 (the clock essentially restarts for load so time==1) and in sequence where time==15, load==.450 (the same load as for where time==5). This is why I wanted to generate x, as I think I could just use that as my new time variable.
+-----------------------------------------+
| time restload trainload rest1w x load
|-----------------------------------------
1. | 1 .1994715 .4780615 0 1 .4780615
2. | 2 .2077734 .471063 0 2 .471063
3. | 3 .2157595 .4641159 0 3 .4641159
4. | 4 .2234298 .4572202 0 4 .4572202
5. | 5 .2307843 .4503757 0 5 .4503757
|-----------------------------------------
6. | 6 .2378229 .4435827 1 1 .1994715
7. | 7 .2445457 .436841 1 2 .2077734
8. | 8 .2509527 .4301506 1 3 .2157595
9. | 9 .2570438 .4235116 1 4 .2234298
10. | 10 .2628191 .4169239 1 5 .2307843
|-----------------------------------------
11. | 11 .2682785 .4103876 0 1 .4780615
12. | 12 .2734221 .4039026 0 2 .471063
13. | 13 .2782499 .397469 0 3 .4641159
14. | 14 .2827618 .3910867 0 4 .4572202
15. | 15 .2869579 .3847558 0 5 .4503757
+-----------------------------------------+
The below code only gives me an entry for where _n==1:
gen load==.
replace load = restload[_n==1] if rest1w==1
And I like the use of levelsof but haven't been able to get it to work (although it might work once I have generated x, but when using time it doesn't restart the sequence obviously).
gen load=.
levelsof x, local(levels)
foreach l of local levels {
replace load=trainload if rest1w==0
replace load=restload if rest1w==1
}
Thanks for any help!
I ended up cross-posting this on statalist.org and got two workable answers.
http://www.statalist.org/forums/forum/general-stata-discussion/general/1355917-fill-with-values-from-an-earlier-time-point
These were:
gen newtime = 1 if rest1w[_n - 1] != rest1w
replace newtime = newtime[_n - 1] + 1 if newtime == .
gen newload = cond(rest1w == 0, trainload[newtime], restload[newtime])
and...
gen newtime = 1
replace newtime = newtime[_n-1] + 1 if rest1w == rest1w[_n-1]
gen newload = .
replace newload = restload[newtime] if rest1w == 1
replace newload = trainload[newtime] if rest1w == 0

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.