How can I create a trailing count for binary data in Stata? - stata

In Stata, I currently have a data set that looks like:
I am trying to create a "trailing counter" in column B so that it looks like:
Here, the counter starts at 1 and for every time a "1" appears in A, B adds on a value.
This seems to be very simple, but I am not sure how to do this exactly. Here is what I have done so far:
Assuming the column A is called "A" in Stata,
I use:
gen B = A + A[_n - 1]
But, this gives me something off. I am not sure how to proceed, would anyone have any tips?

Here's one way:
clear all
set more off
*----- example data -----
input ///
var1
0
0
0
0
1
0
0
1
0
0
0
end
list, sep(0)
*----- what you want -----
gen counter = sum(var1) + 1
list, sep(0)
The sum() function will give you a cumulative sum. See help sum(). This is a very basic Stata function. A search sum would have gotten you there quickly.
Your approach fails because you are only adding up, for each observation, the "current" value of A with the previous value of itself. That might sound like a cumulative sum, but think about it and you will see that it isn't.
With your code and my data, the result would be:
+----------------+
| var1 counter |
|----------------|
1. | 0 . |
2. | 0 0 |
3. | 0 0 |
4. | 0 0 |
5. | 1 1 |
6. | 0 1 |
7. | 0 0 |
8. | 1 1 |
9. | 0 1 |
10. | 0 0 |
11. | 0 0 |
+----------------+
The first observation for counter is missing (.). That is because there's no previous value for the first observation of var1, so Stata does something like var1[1] + var1[0] = 0 + . = ..
The second observation for counter is var1[2] + var1[1] = 0 + 0 = 0.
The fifth observation for counter is var1[5] + var1[4] = 1 + 0 = 1.
The seventh observation for counter is var1[7] + var1[6] = 0 + 0 = 0. And so on.

Related

Using date range to populate year variables

I have three variables containing a designation date, a last update date, and a geographic identifier (see below)
designation designationupdate fipscode
11/2/12 9/10/21 10001
5/15/02 6/29/12 10005
11/6/12 7/7/21 10005
12/15/20 9/22/22 10005
10/4/22 10/4/22 1001
7/14/97 2/4/10 1001
I am trying to create separate year variables that take on a value of 1 if the date range includes that year and 0 if not (see below).
designation designationupdate fipscode yr2000 yr2001 yr2002 yr2003
01/01/2000 01/01/2002 3004 1 1 1 0
Is there a way to do this?
Your data example is helpful but needs surgery to be useful. See the stata tag wiki for detailed advice about asking Stata questions.
This layout (called "wide") is likely to prove awkward for many Stata needs, but here is what you ask for.
* Example generated by -dataex-.
clear
input float(date1 date2)
19299 22533
15475 19173
19303 22468
22264 22910
22922 22922
13709 18297
end
format %td date1
format %td date2
forval y = 2000/2003 {
gen yr`y' = inrange(`y', year(date1), year(date2))
}
list , sep(0)
+-----------------------------------------------------------+
| date1 date2 yr2000 yr2001 yr2002 yr2003 |
|-----------------------------------------------------------|
1. | 02nov2012 10sep2021 0 0 0 0 |
2. | 15may2002 29jun2012 0 0 1 1 |
3. | 06nov2012 07jul2021 0 0 0 0 |
4. | 15dec2020 22sep2022 0 0 0 0 |
5. | 04oct2022 04oct2022 0 0 0 0 |
6. | 14jul1997 04feb2010 1 1 1 1 |
+-----------------------------------------------------------+

Fill with values from an earlier time point - Stata

I am trying to generate a variable that is filled using a sequence of values starting at time==1.
The sequence changes everytime the variable rest1w changes from 0 to 1 or vice versa.
Firstly, I think I need to generate x, that is where the sequence restarts (see below example dataset). In my example, this is uniform, but in my full dataset the change varies (i.e. it does not change at every 5th observation).
list time restload trainload rest1w x in 1/15
+-----------------------------------------+
| time restload trainload rest1w x |
|-----------------------------------------|
1. | 1 .1994715 .4780615 0 1 |
2. | 2 .2077734 .471063 0 2 |
3. | 3 .2157595 .4641159 0 3 |
4. | 4 .2234298 .4572202 0 4 |
5. | 5 .2307843 .4503757 0 5 |
|-----------------------------------------|
6. | 6 .2378229 .4435827 1 1 |
7. | 7 .2445457 .436841 1 2 |
8. | 8 .2509527 .4301506 1 3 |
9. | 9 .2570438 .4235116 1 4 |
10. | 10 .2628191 .4169239 1 5 |
|-----------------------------------------|
11. | 11 .2682785 .4103876 0 1 |
12. | 12 .2734221 .4039026 0 2 |
13. | 13 .2782499 .397469 0 3 |
14. | 14 .2827618 .3910867 0 4 |
15. | 15 .2869579 .3847558 0 5 |
+-----------------------------------------+
Secondly, I need to generate a variable load. Which as per below shows how I would like to restart from time==1 everytime the sequence restarts. That is, at the second sequence where rest1w==0, load!=trainload.
The rule is that for each new sequence of 0's the value for load again goes back to the start of time (where time==1). This is demonstrated by the load values in the second sequence of 0's being exactly the same as the first sequence. In other words, where time==1, trainload==.478 then load==.478; BUT where time==11, then load==.478 (the clock essentially restarts for load so time==1) and in sequence where time==15, load==.450 (the same load as for where time==5). This is why I wanted to generate x, as I think I could just use that as my new time variable.
+-----------------------------------------+
| time restload trainload rest1w x load
|-----------------------------------------
1. | 1 .1994715 .4780615 0 1 .4780615
2. | 2 .2077734 .471063 0 2 .471063
3. | 3 .2157595 .4641159 0 3 .4641159
4. | 4 .2234298 .4572202 0 4 .4572202
5. | 5 .2307843 .4503757 0 5 .4503757
|-----------------------------------------
6. | 6 .2378229 .4435827 1 1 .1994715
7. | 7 .2445457 .436841 1 2 .2077734
8. | 8 .2509527 .4301506 1 3 .2157595
9. | 9 .2570438 .4235116 1 4 .2234298
10. | 10 .2628191 .4169239 1 5 .2307843
|-----------------------------------------
11. | 11 .2682785 .4103876 0 1 .4780615
12. | 12 .2734221 .4039026 0 2 .471063
13. | 13 .2782499 .397469 0 3 .4641159
14. | 14 .2827618 .3910867 0 4 .4572202
15. | 15 .2869579 .3847558 0 5 .4503757
+-----------------------------------------+
The below code only gives me an entry for where _n==1:
gen load==.
replace load = restload[_n==1] if rest1w==1
And I like the use of levelsof but haven't been able to get it to work (although it might work once I have generated x, but when using time it doesn't restart the sequence obviously).
gen load=.
levelsof x, local(levels)
foreach l of local levels {
replace load=trainload if rest1w==0
replace load=restload if rest1w==1
}
Thanks for any help!
I ended up cross-posting this on statalist.org and got two workable answers.
http://www.statalist.org/forums/forum/general-stata-discussion/general/1355917-fill-with-values-from-an-earlier-time-point
These were:
gen newtime = 1 if rest1w[_n - 1] != rest1w
replace newtime = newtime[_n - 1] + 1 if newtime == .
gen newload = cond(rest1w == 0, trainload[newtime], restload[newtime])
and...
gen newtime = 1
replace newtime = newtime[_n-1] + 1 if rest1w == rest1w[_n-1]
gen newload = .
replace newload = restload[newtime] if rest1w == 1
replace newload = trainload[newtime] if rest1w == 0

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

Stata: Maximum number of consecutive occurrences of the same value across variables

Observations in my dataset are players, and binary variables temp1 up are equal to 1 if the player made a move, and equal to zero otherwise.
I would like to to calculate the maximum number of consecutive moves per player.
+------------+------------+-------+-------+-------+-------+-------+-------+
| simulation | playerlist | temp1 | temp2 | temp3 | temp4 | temp5 | temp6 |
+------------+------------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+
My idea was to generate auxiliary variables in a loop, which would count consecutive duplicates and then apply egen, rowmax():
+------------+------------+------+------+------+------+------+------+------+
| simulation | playerlist | aux1 | aux2 | aux3 | aux4 | aux5 | aux6 | _max |
+------------+------------+------+------+------+------+------+------+------+
| 1 | 1 | 0 | 1 | 2 | 3 | 0 | 0 | 3 |
| 1 | 2 | 1 | 0 | 0 | 0 | 1 | 2 | 2 |
+------------+------------+------+------+------+------+------+------+------+
I am struggling with introducing a local counter variable that would be incrementally increased by 1 if consecutive move is made, and would be reset to zero otherwise (the code below keeps auxiliary variables fixed..):
quietly forval i = 1/42 { /*42 is max number of variables temp*/
local j = 1
gen aux`i'=.
local j = `j'+1
replace aux`i'= `j' if temp`i'!=0
}
Tactical answer
You can concatenate your move* variables into a single string and look for the longest substring of 1s.
egen history = concat(move*)
gen max = 0
quietly forval j = 1/6 {
replace max = `j' if strpos(history, substr("111111", 1, `j'))
}
If the number is much more than 6, use something like
 local lookfor : di _dup(42) "1" 
quietly forval j = 1/42 {
replace max = `j' if strpos(history, substr("`lookfor'", 1, `j'))
}
Compare also http://www.stata-journal.com/article.html?article=dm0056
Strategic answer
Storing a sequence rowwise is working against the grain so far as Stata is concerned. Much more flexibility is available if you reshape long and tsset your data as panel data. Note that the code here uses tsspell which must be installed from SSC using ssc inst tsspell.
tsspell is dedicated to identifying spells or runs in which some condition remains true. Here the condition is that a variable is 1 and since the only other allowed value is 0 that is equivalent to a variable being positive. tsspell creates three variables, giving spell identifier, sequence within spell and whether the spell is ending. Here the maximum length of spell is just the maximum sequence number for each game.
. input simulation playerlist temp1 temp2 temp3 temp4 temp5 temp6
simulat~n playerl~t temp1 temp2 temp3 temp4 temp5 temp6
1. 1 1 0 1 1 1 0 0
2. 1 2 1 0 0 0 1 1
3. end
. reshape long temp , i(sim playerlist) j(seq)
(note: j = 1 2 3 4 5 6)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 12
Number of variables 8 -> 4
j variable (6 values) -> seq
xij variables:
temp1 temp2 ... temp6 -> temp
-----------------------------------------------------------------------------
. egen id = group(sim playerlist)
. tsset id seq
panel variable: id (strongly balanced)
time variable: seq, 1 to 6
delta: 1 unit
. tsspell, p(temp)
. egen max = max(_seq), by(id)
. l
+--------------------------------------------------------------------+
| simula~n player~t seq temp id _seq _spell _end max |
|--------------------------------------------------------------------|
1. | 1 1 1 0 1 0 0 0 3 |
2. | 1 1 2 1 1 1 1 0 3 |
3. | 1 1 3 1 1 2 1 0 3 |
4. | 1 1 4 1 1 3 1 1 3 |
5. | 1 1 5 0 1 0 0 0 3 |
|--------------------------------------------------------------------|
6. | 1 1 6 0 1 0 0 0 3 |
7. | 1 2 1 1 2 1 1 1 2 |
8. | 1 2 2 0 2 0 0 0 2 |
9. | 1 2 3 0 2 0 0 0 2 |
10. | 1 2 4 0 2 0 0 0 2 |
|--------------------------------------------------------------------|
11. | 1 2 5 1 2 1 2 0 2 |
12. | 1 2 6 1 2 2 2 1 2 |
+--------------------------------------------------------------------+