Stata: Using egen, anycount() when values vary for each observation - stata

Each observation in my data presents a player who follows some random pattern. Variables move1 up represent on which moves each player was active. I need to count the number of times each player was active:
The data look as follows (with _count representing a variable that I would like to generate). The number of moves can also be different depending on simulation.
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| simulation | playerlist | move1 | move2 | move3 | move4 | move5 | move6 | _count |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| 1 | 1 | 1 | 1 | 1 | 2 | . | . | 3 |
| 1 | 2 | 2 | 2 | 4 | 4 | . | . | 2 |
| 2 | 3 | 1 | 2 | 3 | 3 | 3 | 3 | 4 |
| 2 | 4 | 4 | 1 | 2 | 3 | 3 | 3 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
egen combined with anycount() is not applicable in this case because the argument for the value() option is not a constant integer.
I have made an attempt to cycle through each observation and use egen rowwise (see below) but it keeps count as missing (as initialised) and is not very efficient (I have 50,000 observations). Is there a way to do this in Stata?
gen _count =.
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values( `=`playerlist'[`i']')
replace _count = temp
drop temp
}

You can easily cut out the loop over observations. In addition, egen is only to be used for convenience, never speed.
gen _count = 0
quietly forval j = 1/6 {
replace _count = _count + (move`j' == playerlist)
}
or
gen _count = move1 == playerlist
quietly forval j = 2/6 {
replace _count = _count + (move`j' == playerlist)
}
Even if you had been determined to use egen, the loop need only be over the distinct values of playerlist, not all the observations. Say the maximum is 42
gen _count = 0
quietly forval k = 1/42 {
egen temp = anycount(move*), value(`k')
replace _count = _count + temp
drop temp
}
But that's still a lousy method for your problem. (I wrote the original of anycount() so I can say why it was written.)
See also http://www.stata-journal.com/sjpdf.html?articlenum=pr0046 for a review of working rowwise.
P.S. Your code contains bugs.
You replace your count variable in all observations by the last value calculated for the count in the last observation.
Values are compared with a local macro playerlist. You presumably have no local macro of that name, so the macro is evaluated as empty. The result is that you end by comparing each value of your move* variables with the observation numbers. You meant to use the variable name playerlist, but the single quotation marks force the macro interpretation.
For the record, this fixes both bugs:
gen _count = .
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values(`= playerlist[`i']')
replace _count = temp in `i'
drop temp
}

Related

Hamming distance for grouped results

I'm working with a data set that contains 40 different participants, with each 30 observations.
As I am observing search behavior, I want to calculate the search distance for each subject per round (from 1 to30).
In order to compare my data with current literature, I need to use the Hamming distance to describe search distances.
The variable is called Inputs and is a string variable with binary inputs 0 or 1 with a length of 10. E.g:
Input Type 1 Subject 1 Round 1: 0000011111
Input Type 1 Subject 1 Round 2: 0000011110
Using the Levensthein distance, my approach was simple:
sort type_num Subject round_num
gen input_prev=Input[_n-1]
replace input_prev="0000000000" if round_num==1 //default starting position with 0000000000 to get search distance for first input in round 1
//Levensthein distance & clearing data (Levensthein instead of hamming distance)
ustrdist Input input_prev
rename strdist input_change
I am now struggling with getting the right Stata commands for the Hamming distance. Can someone help?
Does this help? As I understand it, Hamming distance is the count of characters (bits) that differ at corresponding positions of strings of equal length. So, given two variables and wanting comparisons within each observation, it is just a loop over the characters.
clear
set obs 10
set seed 2803
* sandbox
quietly forval j = 1/2 {
gen test`j' = ""
forval k = 1/10 {
replace test`j' = test`j' + strofreal(runiform() > (`j' * 0.3))
}
}
set obs 12
replace test1 = 10 * "1" in 11
replace test2 = test1 in 11
replace test1 = test1[11] in 12
replace test2 = 10 * "0" in 12
* calculation
gen wanted = 0
quietly forval k = 1/10 {
replace wanted = wanted + (substr(test1, `k', 1) != substr(test2, `k', 1))
}
list
+----------------------------------+
| test1 test2 wanted |
|----------------------------------|
1. | 1110001111 1001101000 7 |
2. | 1111011011 1101011111 2 |
3. | 1011001111 1110110111 5 |
4. | 0000111011 1011010100 8 |
5. | 1011011011 1111100110 6 |
|----------------------------------|
6. | 0011111100 0100011110 5 |
7. | 0011011011 0011111010 2 |
8. | 1010100011 1011000100 5 |
9. | 1110011011 1010010100 5 |
10. | 1001011111 0100111001 6 |
|----------------------------------|
11. | 1111111111 1111111111 0 |
12. | 1111111111 0000000000 10 |
+----------------------------------+

How to detect specific subwords in text

I have a column as a string with no spaces:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
end
I am using the following command:
gen = regex,(var, "(news)")
This outputs 1 1 1 because it finds that the 3 rows in the column var contain the word news.
I'm trying to alter the regular expression "(news)" to create two columns. One for news and one for newspaper. regexm(var, "(newspaper)") makes sure that the row contains a newspaper, but I need a command to make sure characters after news are not "paper" as I'm trying to quantify the two.
EDIT:
Is there a way to count the third entry as 1, because it has a news occurrence without however being a newspaper?
You can quantify as follows without a regular expression:
clear
input str100 var
"ihaveanewspaper"
"watchingthenewsonthetv"
"watchthenewsandreadthenewspaper"
"fdgdnews"
"fgogodigjhoigjnewspaper"
"fgeogeionnewsfgdgfpaper"
"45pap9358newsfjfgni"
end
generate news = strmatch(var, "*news*") & !strmatch(var, "*newspaper*")
list, separator(0)
+----------------------------------------+
| var news |
|----------------------------------------|
1. | ihaveanewspaper 0 |
2. | watchingthenewsonthetv 1 |
3. | watchthenewsandreadthenewspaper 0 |
4. | fdgdnews 1 |
5. | fgogodigjhoigjnewspaper 0 |
6. | fgeogeionnewsfgdgfpaper 1 |
7. | 45pap9358newsfjfgni 1 |
+----------------------------------------+
count if news
4
count if !news
3
EDIT:
One way to do this is to eliminate all instances of the word newspaper and repeat the process:
generate var2 = subinstr(var, "newspaper", "", .)
replace news = 1 if strmatch(var2, "*news*")
list, separator(0)
+------------------------------------------------------------------+
| var news var2 |
|------------------------------------------------------------------|
1. | ihaveanewspaper 0 ihavea |
2. | watchingthenewsonthetv 1 watchingthenewsonthetv |
3. | watchthenewsandreadthenewspaper 1 watchthenewsandreadthe |
4. | fdgdnews 1 fdgdnews |
5. | fgogodigjhoigjnewspaper 0 fgogodigjhoigj |
6. | fgeogeionnewsfgdgfpaper 1 fgeogeionnewsfgdgfpaper |
7. | 45pap9358newsfjfgni 1 45pap9358newsfjfgni |
+------------------------------------------------------------------+
count if news
5
count if !news
2

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Stata: Comparing string variables

I have two string variables that differ on one character for each observation. I need to get the position of that different character.
I have tried to use indexnot() function but it yields false results as the characters in both strings are the same.
Here is an illustrative example, and variable position is the one I am trying to get to:
+--------------+--------------+-----------+
| String 1 | String 2 | Position |
+--------------+--------------+-----------+
| 000002002000 | 000000002000 | 6 |
| 000002102000 | 000002002000 | 7 |
| 000002112000 | 000002102000 | 8 |
| 000002112020 | 000002112000 | 11 |
| 000002112120 | 000002112020 | 10 |
+--------------+--------------+-----------+
gen Position = .
quietly forval j = 1/12 {
replace Position = `j' if substr(String1, `j', 1) != substr(String2, `j', 1) & missing(Position)
}
Commentary is perhaps redundant here, but will harm no-one.
In the absence of a built-in function to do this, you need to write some code using existing commands and functions. Initialise a Position to missing (zero would do fine as an alternative). Then loop over the characters, here 1 to 12 because the example shows 12 character strings. We record the position of the first difference in characters. Note how the condition missing(Position) (Position == . if you like) restricts changes to the first difference met.
Stata loops automatically over all the observations here, so the only loop needed is over string positions.