Hamming distance for grouped results - stata

I'm working with a data set that contains 40 different participants, with each 30 observations.
As I am observing search behavior, I want to calculate the search distance for each subject per round (from 1 to30).
In order to compare my data with current literature, I need to use the Hamming distance to describe search distances.
The variable is called Inputs and is a string variable with binary inputs 0 or 1 with a length of 10. E.g:
Input Type 1 Subject 1 Round 1: 0000011111
Input Type 1 Subject 1 Round 2: 0000011110
Using the Levensthein distance, my approach was simple:
sort type_num Subject round_num
gen input_prev=Input[_n-1]
replace input_prev="0000000000" if round_num==1 //default starting position with 0000000000 to get search distance for first input in round 1
//Levensthein distance & clearing data (Levensthein instead of hamming distance)
ustrdist Input input_prev
rename strdist input_change
I am now struggling with getting the right Stata commands for the Hamming distance. Can someone help?

Does this help? As I understand it, Hamming distance is the count of characters (bits) that differ at corresponding positions of strings of equal length. So, given two variables and wanting comparisons within each observation, it is just a loop over the characters.
clear
set obs 10
set seed 2803
* sandbox
quietly forval j = 1/2 {
gen test`j' = ""
forval k = 1/10 {
replace test`j' = test`j' + strofreal(runiform() > (`j' * 0.3))
}
}
set obs 12
replace test1 = 10 * "1" in 11
replace test2 = test1 in 11
replace test1 = test1[11] in 12
replace test2 = 10 * "0" in 12
* calculation
gen wanted = 0
quietly forval k = 1/10 {
replace wanted = wanted + (substr(test1, `k', 1) != substr(test2, `k', 1))
}
list
+----------------------------------+
| test1 test2 wanted |
|----------------------------------|
1. | 1110001111 1001101000 7 |
2. | 1111011011 1101011111 2 |
3. | 1011001111 1110110111 5 |
4. | 0000111011 1011010100 8 |
5. | 1011011011 1111100110 6 |
|----------------------------------|
6. | 0011111100 0100011110 5 |
7. | 0011011011 0011111010 2 |
8. | 1010100011 1011000100 5 |
9. | 1110011011 1010010100 5 |
10. | 1001011111 0100111001 6 |
|----------------------------------|
11. | 1111111111 1111111111 0 |
12. | 1111111111 0000000000 10 |
+----------------------------------+

Related

Capturing non-missing values row wise and storing it in new variables

My dataset contains multiple variables called avar_1 to bvar_10 referring to the history of an individual. For some reasons, the history is not always complete and there are some "gaps" (e.g. avar_1 and avar_4 are non-missing, but avar_2 and avar_3 are missing). For each individual, I want to store the first non-missing value in a new variable called var1 the second non-missing in var2 etc, so that I have a history without missing values.
I've tried the following code
local x=1
foreach wave in a b {
forval i=1/10 {
capture drop var`x'
generate var`x'=.
capture replace var`x'=`wave'var`i' if !mi(`wave'`var'`i')
if (!mi(var`x')) {
local x=1+`x'
}
}
}
var1 is generated properly but var2 only contains missings and following variables are not generated. However, I set trace on and saw that the var2 is actually replaced for all variables from avar_1 to bvar_10.
My guess is that the local x is not correctly updated as its value change for the whole dataset but should be different for each observation.
Is that the problem and if so, how can I avoid it?
A concise concrete data example is worth more than a long explanation. Your description seems consistent with an example like this:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 id float(avar_1 avar_2 avar_3 bvar_1 bvar_2)
"A" 1 . 6 8 10
"B" 2 4 . 9 .
"C" 3 5 7 . 11
end
* 4 is specific to this example.
rename (bvar_*) (avar_#), renumber(4)
reshape long avar_, i(id) j(which)
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 15
Number of variables 6 -> 3
j variable (5 values) -> which
xij variables:
avar_1 avar_2 ... avar_5 -> avar_
-----------------------------------------------------------------------------
drop if missing(avar_)
bysort id (which) : replace which = _n
list, sepby(id)
+--------------------+
| id which avar_ |
|--------------------|
1. | A 1 1 |
2. | A 2 6 |
3. | A 3 8 |
4. | A 4 10 |
|--------------------|
5. | B 1 2 |
6. | B 2 4 |
7. | B 3 9 |
|--------------------|
8. | C 1 3 |
9. | C 2 5 |
10. | C 3 7 |
11. | C 4 11 |
+--------------------+
Positive points:
Your data layout cries out for some structure given by a rename and especially by a reshape long. I don't give here code for a reshape wide as for the great majority of Stata purposes, you'd be better off with this layout.
Negative points:
!mi(var`x')
returns whether the first value of a variable is not missing. If foo were a variable in the dataset, !mi(foo) is evaluated as !mi(foo[1]). That is not what you want here. See https://www.stata.com/support/faqs/programming/if-command-versus-if-qualifier/ for the full story.
I'd recommend more evocative variable names.

Retrieve row number according to value of another variable in index

This problem is very simple in R, but I can't seem to get it to work in Stata.
I want to use the square brackets index, but with an expression in it that involves another variable, i.e. for a variable with unique values cumul I want:
replace country = country[cumul==20] in 12
cumul == 20 corresponds to row number 638 in the dataset, so the above should replace in line 12 the country variable with the value of that same variable in line 638. The above expression is clearly not the right way to do it: it just replaces the country variable in line 12 with a missing value.
Stata's row indexing does not work in that way. What you can do, however, is a simple two-line solution:
levelsof country if cumul==20
replace country = "`r(levels)'" in 12
If you want to be sure that cumul==20 uniquely identifies just a single value of country, add:
assert `:word count `r(levels)''==1
between the two lines.
It's probably worth explaining why the construct in the question doesn't work as you wish, beyond "Stata is not R!".
Given a variable x: in a reference like x[1] the [1] is referred to as a subscript, despite nothing being written below the line. The subscript is the observation number, the number being always that in the dataset as currently held in memory.
Stata allows expressions within subscripts; they are evaluated observation by observation and the result is then used to look-up values in variables. Consider this sandbox:
clear
input float y
1
2
3
4
5
end
. gen foo = y[mod(_n, 2)]
(2 missing values generated)
. gen x = 3
. gen bar = y[y == x]
(4 missing values generated)
. list
+-------------------+
| y foo x bar |
|-------------------|
1. | 1 1 3 . |
2. | 2 . 3 . |
3. | 3 1 3 1 |
4. | 4 . 3 . |
5. | 5 1 3 . |
+-------------------+
mod(_n, 2) is the remainder on dividing the observation _n by 2: that is 1 for odd observation numbers and 0 for even numbers. Observation 0 is not in the dataset (Stata starts indexing at 1). It's not an error to refer to values in that observation, but the result is returned as missing (numeric missing here, and empty strings "" if the variable is string). Hence foo is x[1] or 1 for odd observation numbers and missing for even numbers.
True or false expressions are evaluated as 1 if true and 0 is false. Thus y == x is true only in observation 3, and so bar is the value of y[1] there and missing everywhere else. Stata doesn't have the special (and useful) twist in R that it is the subscripts for which a true or false expression is true that are used to select zero or more values.
There are ways of using subscripts to get special effects. This example shows one. (It's much easier to get the same kind of result in Mata.)
. gen random = runiform()
. sort random
. gen obs = _n
. sort y
. gen randomsorted = random[obs]
. l
+-----------------------------------------------+
| y foo x bar random obs random~d |
|-----------------------------------------------|
1. | 1 1 3 . .3488717 4 .0285569 |
2. | 2 . 3 . .2668857 3 .1366463 |
3. | 3 1 3 1 .1366463 2 .2668857 |
4. | 4 . 3 . .0285569 1 .3488717 |
5. | 5 1 3 . .8689333 5 .8689333 |
+-----------------------------------------------+
This answer doesn't cover matrices in Stata or Mata.

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

How can I create a trailing count for binary data in Stata?

In Stata, I currently have a data set that looks like:
I am trying to create a "trailing counter" in column B so that it looks like:
Here, the counter starts at 1 and for every time a "1" appears in A, B adds on a value.
This seems to be very simple, but I am not sure how to do this exactly. Here is what I have done so far:
Assuming the column A is called "A" in Stata,
I use:
gen B = A + A[_n - 1]
But, this gives me something off. I am not sure how to proceed, would anyone have any tips?
Here's one way:
clear all
set more off
*----- example data -----
input ///
var1
0
0
0
0
1
0
0
1
0
0
0
end
list, sep(0)
*----- what you want -----
gen counter = sum(var1) + 1
list, sep(0)
The sum() function will give you a cumulative sum. See help sum(). This is a very basic Stata function. A search sum would have gotten you there quickly.
Your approach fails because you are only adding up, for each observation, the "current" value of A with the previous value of itself. That might sound like a cumulative sum, but think about it and you will see that it isn't.
With your code and my data, the result would be:
+----------------+
| var1 counter |
|----------------|
1. | 0 . |
2. | 0 0 |
3. | 0 0 |
4. | 0 0 |
5. | 1 1 |
6. | 0 1 |
7. | 0 0 |
8. | 1 1 |
9. | 0 1 |
10. | 0 0 |
11. | 0 0 |
+----------------+
The first observation for counter is missing (.). That is because there's no previous value for the first observation of var1, so Stata does something like var1[1] + var1[0] = 0 + . = ..
The second observation for counter is var1[2] + var1[1] = 0 + 0 = 0.
The fifth observation for counter is var1[5] + var1[4] = 1 + 0 = 1.
The seventh observation for counter is var1[7] + var1[6] = 0 + 0 = 0. And so on.