Stata: Maximum number of consecutive occurrences of the same value across variables - stata

Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}

Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
 local lookfor : di _dup(42) "1" 
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+

Related

Connect IDs based on values in rows, ignoring connection between identical IDs

This is a follow-up to my previous question: Connect IDs based on values in rows.
I would now like to consider the case, where connections between identical idb's should be classified as 0.
The output is similar to the matrix in my previous post but with diagonal elements equal to 0:
62014 62015 62016 62017 62018
62014 0 1 0 1 1
62015 1 0 0 0 0
62016 0 0 0 0 1
62017 1 0 0 0 1
62018 1 0 1 1 0
How can I do this in Stata?
You can easily change the values in the diagonal of a matrix as follows:
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 1 |
2 | 1 1 |
3 | 0 0 1 |
4 | 1 0 0 1 |
5 | 1 0 1 1 1 |
+---------------------+
: _diag(B, 0)
: B
[symmetric]
1 2 3 4 5
+---------------------+
1 | 0 |
2 | 1 0 |
3 | 0 0 0 |
4 | 1 0 0 0 |
5 | 1 0 1 1 0 |
+---------------------+
In the context of your question, you can simply do the following:
mata: B = foo1(A)
mata: _diag(B, 0)
getmata (idb*) = B
list
+------------------------------------------------------------------------+
| idb idd1 idd2 idd3 idb1 idb2 idb3 idb4 idb5 |
|------------------------------------------------------------------------|
1. | 62014 370490 879271 1112878 0 1 0 1 1 |
2. | 62015 457013 1112878 370490 1 0 0 0 0 |
3. | 62016 341863 1366174 533773 0 0 0 0 1 |
4. | 62017 879271 327069 341596 1 0 0 0 1 |
5. | 62018 1391443 1366174 879271 1 0 1 1 0 |
+------------------------------------------------------------------------+

Check whether a range contains specific values

Var1 is given. Var2 should take value 1 if the Observation or one of the previous 5 observations is a missing value or 0. What is the Syntax for Var2?
I know how to do it with a lot of if Statements. But when I need to do it for the previous 50 observations that gets too inconvenient.
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
The question is similar to your previous --Finding the second smallest value -- which you should quote. So is this answer. rangestat is from SSC.
clear
input float(Var1 Var2)
5 0
. 1
2 1
5 1
7 1
9 1
5 1
9 0
0 1
2 1
7 1
5 1
3 1
2 1
5 0
end
gen long id = _n
gen Bad = inlist(Var1, 0, .)
rangestat (sum) Bad, int(id -5 0)
list, sepby(Bad_sum)
+----------------------------------+
| Var1 Var2 id Bad Bad_sum |
|----------------------------------|
1. | 5 0 1 0 0 |
|----------------------------------|
2. | . 1 2 1 1 |
3. | 2 1 3 0 1 |
4. | 5 1 4 0 1 |
5. | 7 1 5 0 1 |
6. | 9 1 6 0 1 |
7. | 5 1 7 0 1 |
|----------------------------------|
8. | 9 0 8 0 0 |
|----------------------------------|
9. | 0 1 9 1 1 |
10. | 2 1 10 0 1 |
11. | 7 1 11 0 1 |
12. | 5 1 12 0 1 |
13. | 3 1 13 0 1 |
14. | 2 1 14 0 1 |
|----------------------------------|
15. | 5 0 15 0 0 |
+----------------------------------+

Creating complete data in stata

I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}
The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+

Fill with values from an earlier time point - Stata

I am trying to generate a variable that is filled using a sequence of values starting at time==1.
The sequence changes everytime the variable rest1w changes from 0 to 1 or vice versa.
Firstly, I think I need to generate x, that is where the sequence restarts (see below example dataset). In my example, this is uniform, but in my full dataset the change varies (i.e. it does not change at every 5th observation).
list time restload trainload rest1w x in 1/15
+-----------------------------------------+
| time restload trainload rest1w x |
|-----------------------------------------|
1. | 1 .1994715 .4780615 0 1 |
2. | 2 .2077734 .471063 0 2 |
3. | 3 .2157595 .4641159 0 3 |
4. | 4 .2234298 .4572202 0 4 |
5. | 5 .2307843 .4503757 0 5 |
|-----------------------------------------|
6. | 6 .2378229 .4435827 1 1 |
7. | 7 .2445457 .436841 1 2 |
8. | 8 .2509527 .4301506 1 3 |
9. | 9 .2570438 .4235116 1 4 |
10. | 10 .2628191 .4169239 1 5 |
|-----------------------------------------|
11. | 11 .2682785 .4103876 0 1 |
12. | 12 .2734221 .4039026 0 2 |
13. | 13 .2782499 .397469 0 3 |
14. | 14 .2827618 .3910867 0 4 |
15. | 15 .2869579 .3847558 0 5 |
+-----------------------------------------+
Secondly, I need to generate a variable load. Which as per below shows how I would like to restart from time==1 everytime the sequence restarts. That is, at the second sequence where rest1w==0, load!=trainload.
The rule is that for each new sequence of 0's the value for load again goes back to the start of time (where time==1). This is demonstrated by the load values in the second sequence of 0's being exactly the same as the first sequence. In other words, where time==1, trainload==.478 then load==.478; BUT where time==11, then load==.478 (the clock essentially restarts for load so time==1) and in sequence where time==15, load==.450 (the same load as for where time==5). This is why I wanted to generate x, as I think I could just use that as my new time variable.
+-----------------------------------------+
| time restload trainload rest1w x load
|-----------------------------------------
1. | 1 .1994715 .4780615 0 1 .4780615
2. | 2 .2077734 .471063 0 2 .471063
3. | 3 .2157595 .4641159 0 3 .4641159
4. | 4 .2234298 .4572202 0 4 .4572202
5. | 5 .2307843 .4503757 0 5 .4503757
|-----------------------------------------
6. | 6 .2378229 .4435827 1 1 .1994715
7. | 7 .2445457 .436841 1 2 .2077734
8. | 8 .2509527 .4301506 1 3 .2157595
9. | 9 .2570438 .4235116 1 4 .2234298
10. | 10 .2628191 .4169239 1 5 .2307843
|-----------------------------------------
11. | 11 .2682785 .4103876 0 1 .4780615
12. | 12 .2734221 .4039026 0 2 .471063
13. | 13 .2782499 .397469 0 3 .4641159
14. | 14 .2827618 .3910867 0 4 .4572202
15. | 15 .2869579 .3847558 0 5 .4503757
+-----------------------------------------+
The below code only gives me an entry for where _n==1:
gen load==.
replace load = restload[_n==1] if rest1w==1
And I like the use of levelsof but haven't been able to get it to work (although it might work once I have generated x, but when using time it doesn't restart the sequence obviously).
gen load=.
levelsof x, local(levels)
foreach l of local levels {
replace load=trainload if rest1w==0
replace load=restload if rest1w==1
}
Thanks for any help!
I ended up cross-posting this on statalist.org and got two workable answers.
http://www.statalist.org/forums/forum/general-stata-discussion/general/1355917-fill-with-values-from-an-earlier-time-point
These were:
gen newtime = 1 if rest1w[_n - 1] != rest1w
replace newtime = newtime[_n - 1] + 1 if newtime == .
gen newload = cond(rest1w == 0, trainload[newtime], restload[newtime])
and...
gen newtime = 1
replace newtime = newtime[_n-1] + 1 if rest1w == rest1w[_n-1]
gen newload = .
replace newload = restload[newtime] if rest1w == 1
replace newload = trainload[newtime] if rest1w == 0

How can I create a trailing count for binary data in Stata?

In Stata, I currently have a data set that looks like:
I am trying to create a "trailing counter" in column B so that it looks like:
Here, the counter starts at 1 and for every time a "1" appears in A, B adds on a value.
This seems to be very simple, but I am not sure how to do this exactly. Here is what I have done so far:
Assuming the column A is called "A" in Stata,
I use:
gen B = A + A[_n - 1]
But, this gives me something off. I am not sure how to proceed, would anyone have any tips?
Here's one way:
clear all
set more off
*----- example data -----
input ///
var1
0
0
0
0
1
0
0
1
0
0
0
end
list, sep(0)
*----- what you want -----
gen counter = sum(var1) + 1
list, sep(0)
The sum() function will give you a cumulative sum. See help sum(). This is a very basic Stata function. A search sum would have gotten you there quickly.
Your approach fails because you are only adding up, for each observation, the "current" value of A with the previous value of itself. That might sound like a cumulative sum, but think about it and you will see that it isn't.
With your code and my data, the result would be:
+----------------+
| var1 counter |
|----------------|
1. | 0 . |
2. | 0 0 |
3. | 0 0 |
4. | 0 0 |
5. | 1 1 |
6. | 0 1 |
7. | 0 0 |
8. | 1 1 |
9. | 0 1 |
10. | 0 0 |
11. | 0 0 |
+----------------+
The first observation for counter is missing (.). That is because there's no previous value for the first observation of var1, so Stata does something like var1[1] + var1[0] = 0 + . = ..
The second observation for counter is var1[2] + var1[1] = 0 + 0 = 0.
The fifth observation for counter is var1[5] + var1[4] = 1 + 0 = 1.
The seventh observation for counter is var1[7] + var1[6] = 0 + 0 = 0. And so on.