Dropping entire subject if single observation meets criterion in panel data

Dropping entire subject if single observation meets criterion in panel data - stata

I have some panel data of the form...
id | amount
-----------
1 | 10
1 | 10
1 | 100
2 | 10
2 | 15
2 | 10
3 | 100
What I'm looking to do seems like it should be fairly simple, but my experience with Stata is limited and I'm used to programming in languages similar to C/Java. Essentially, I want to drop an entire person (id) if any of their individual observations ever exceed a certain amount. So let's say I set this amount to 50, I want to drop all the observations from id 1 and id 3 such that the data will then only contain observations from id 2.
The pseudo-code in Java would be fairly straightforward...
for(int i = 0; i < dataset_length; i++) {
if dataset[i].amount > 50 {
int drop_id = dataset[i].id;
for(int j = 0; j < dataset_length; j++) {
if dataset[j].id == drop_id {
delete observation
}
}
}
}
What would the Stata equivalent of something akin to this be? I'm surely missing something and making it more complicated than it ought to be, but I cannot figure it out.

If there are no missings on amount this is just
bysort id (amount) : drop if amount[_N] > 50
If there are missings, then
gen ismissing = -missing(amount)
bysort id (ismissing amount): drop if amount[_N] > 50 & amount[_N] < .
would be one kind of check, although it's hard to see how the missings could be interesting or useful.
The machinery here (for one introduction see here) in effect builds in a loop over identifiers, and over the observations for each identifier. Literal translation using models from mainstream programming languages could only result in lengthier and less efficient code.

Related

manipulating the format of date on X-axis

I have a weekly dataset. I use this code to plot the causality between variables. Stata shows the number of weeks of each year on the X-axis. Is it possible to show only year or year-month instead of year-week on the X-axis?
generate Date =wofd(D)
format Date %tw
tsset Date
tvgc Momentum supply, p(3) d(3) trend window(25) prefix(_) graph

The fact that you have weekly data is only a distraction here.
You should only use Stata's weekly date functions if your weeks satisfy Stata's rules:
Week 1 starts on 1 January, always.
Later weeks start 7 days later in turn, except that week 52 is always 8 or 9 days long.
Hence there is no week 53.
These are documented rules, and they do not match your data. You are lucky that you have no 53 week years in your data; otherwise you would get some bizarre results.
See much detailed discussion at references turned up by search week, sj.
The good news is that you need just to build on what you have and put labels and ticks on your x axis. It's a little bit of work but no more than use of standard and documented label and tick options. The main ideas are blindingly obvious once spelled out:
Labels Put informative labels in the middle of time intervals. Suppress the associated ticks. You can suppress a tick by setting its length to zero or its colour to invisible.
Ticks Put ticks as the ends (equivalently beginnings) of time intervals. Lengthen ticks as needed.
Grid lines Lines demarcating years could be worth adding. None are shown here, but the syntax is just an extension of that given.
Axis titles If the time (usually x) axis is adequately explained, that axis title is redundant and even dopey if it is some arbitrary variable name.
See especially https://www.stata-journal.com/article.html?article=gr0030 and https://www.stata-journal.com/article.html?article=gr0079
With your data, showing years is sensible but showing months too is likely to produce crowded detail that is hard to read and not much use. I compromised on quarters.
* Example generated by -dataex-. For more info, type help dataex
clear
input str10 D float(Momentum Supply)
"12/2/2010" -1.235124 4.760894
"12/9/2010" -1.537671 3.002344
"12/16/2010" -.679893 1.5665628
"12/23/2010" 1.964229 .5875537
"12/30/2010" -1.1872853 -1.1315695
"1/6/2011" .028031677 .065580264
"1/13/2011" .4438451 1.2316793
"1/20/2011" -.3865465 1.7899017
"1/27/2011" -.4547117 1.539866
"2/3/2011" 1.6675532 1.352376
"2/10/2011" -.016190516 3.72986
"2/17/2011" .5471755 2.0804555
"2/24/2011" .2695233 2.1094923
"3/3/2011" .5136591 -1.0686383
"3/10/2011" .606721 3.786967
"3/17/2011" .004175631 .4544936
"3/24/2011" 1.198901 -.3316304
"3/31/2011" .1973385 .5846249
"4/7/2011" 2.2470737 1.0026894
"4/14/2011" .3980386 -2.6676855
"4/21/2011" -1.530687 -7.214682
"4/28/2011" -.9735931 3.246654
"5/5/2011" .13312873 .9581707
"5/12/2011" -.8017629 -.468076
"5/19/2011" -.11491735 -4.354526
"5/26/2011" .3627179 -2.233418
"6/2/2011" .13805833 2.2697728
"6/9/2011" .27832976 .58203816
"6/16/2011" -1.9467738 -.2834298
"6/23/2011" -.9579238 -1.0356172
"6/30/2011" 1.1799787 1.1011268
"7/7/2011" -2.0982232 .5292908
"7/14/2011" -.2992591 -.4004747
"7/21/2011" .5904395 -2.5159726
"7/28/2011" -.21626104 1.936029
"8/4/2011" -.02421602 -.8160484
"8/11/2011" 1.5797064 -.6868965
"8/18/2011" 1.495294 -1.8621664
"8/25/2011" -1.2188485 -.8388996
"9/1/2011" .4991612 -1.6689343
"9/8/2011" 2.1691883 1.3244398
"9/15/2011" -1.2074957 .9707839
"9/22/2011" -.3399567 .6742781
"9/29/2011" 1.9860272 -3.331345
"10/6/2011" 1.935733 -.3882593
"10/13/2011" -1.278119 .6796986
"10/20/2011" -1.3209987 .2258049
"10/27/2011" 4.315368 .7879103
"11/3/2011" .58669937 -.5040554
"11/10/2011" 1.460597 -2.0426705
"11/17/2011" -1.338189 -.24199644
"11/24/2011" -1.6870773 -1.1143018
"12/1/2011" -.19232976 -1.2156726
"12/8/2011" -2.655519 -2.054406
"12/15/2011" 1.7161795 -.15301673
"12/22/2011" -1.43026 -3.138013
"12/29/2011" .03427247 -.28446484
"1/5/2012" -.15930523 -3.362428
"1/12/2012" .4222094 4.0962815
"1/19/2012" -.2413332 3.8277814
"1/26/2012" -2.850591 .067359865
"2/2/2012" -1.1785052 -.3558361
"2/9/2012" -1.0380571 .05134211
"2/16/2012" .8539951 -4.421839
"2/23/2012" .2636529 1.3424703
"3/1/2012" .022639304 2.734022
"3/8/2012" .1370547 .8043283
"3/15/2012" .1787796 -.56465846
"3/22/2012" -2.0645525 -2.9066684
"3/29/2012" 1.562931 -.4505192
"4/5/2012" 1.2587242 -.6908772
"4/12/2012" -1.5202224 .7883849
"4/19/2012" 1.0128288 -1.6764873
"4/26/2012" -.29182148 1.920932
"5/3/2012" -1.228097 -3.7068026
"5/10/2012" -.3124508 -3.034149
"5/17/2012" .7570716 -2.3398724
"5/24/2012" -1.0697783 -2.438565
"5/31/2012" 1.2796624 1.299344
"6/7/2012" -1.5482885 -1.228557
"6/14/2012" 1.396692 3.2158935
"6/21/2012" .3116726 8.035475
"6/28/2012" -.22332123 .7450229
"7/5/2012" .4655248 .04986914
"7/12/2012" .4769497 4.045938
"7/19/2012" .08743203 .25987592
"7/26/2012" -.402533 .3213503
"8/2/2012" -.1564897 1.5290447
"8/9/2012" -.0919008 .13955575
"8/16/2012" -1.3851573 1.0860283
"8/23/2012" .020250637 -.8858514
"8/30/2012" -.29458764 -1.6602173
"9/6/2012" -.39921495 -.8043483
"9/13/2012" 1.76396 4.2867813
"9/20/2012" -1.2335806 2.476225
"9/27/2012" .176066 -.5992883
"10/4/2012" .1075483 1.7167135
"10/11/2012" .06365488 1.1636261
"10/18/2012" -.2305842 -1.506699
"10/25/2012" -.1526354 -2.669866
"11/1/2012" -.06311637 -2.0813057
"11/8/2012" .55959195 .8805096
"11/15/2012" 1.5306772 -2.708766
"11/22/2012" -.5585792 .26319882
"11/29/2012" -.035690214 -1.6176193
"12/6/2012" -.7885767 1.1719254
"12/13/2012" .9131169 -1.1135346
"12/20/2012" -.6910864 -.4893669
"12/27/2012" .9836168 .4052487
"1/3/2013" -.8828759 .7161615
"1/10/2013" 1.505474 -.1768004
"1/17/2013" -1.3013282 -1.333739
"1/24/2013" -1.3670077 1.0568022
"1/31/2013" .05846912 -.7845241
"2/7/2013" .4923012 -1.202816
"2/14/2013" -.06551787 -.9198701
"2/21/2013" -1.8149366 -.1746187
"2/28/2013" .3370621 1.0104061
"3/7/2013" 1.2698976 1.273357
"3/14/2013" -.3884514 .7927139
"3/21/2013" -.1437847 1.7798674
"3/28/2013" -.2325031 .9336611
"4/4/2013" .03971701 .6680117
"4/11/2013" -.25990707 -3.0261614
"4/18/2013" .7046488 -.458615
"4/25/2013" -2.1198323 -.14664523
"5/2/2013" 1.591287 -.3687443
"5/9/2013" -1.1266721 -2.0973356
"5/16/2013" -.7595757 -1.1238302
"5/23/2013" 2.2590933 2.124479
"5/30/2013" -.7447268 .7387985
"6/6/2013" 1.3409324 -1.3744274
"6/13/2013" -.3844476 -.8341842
"6/20/2013" -.8135379 -1.7971268
"6/27/2013" -2.506065 -.4194731
"7/4/2013" -.4755843 -5.216218
"7/11/2013" -1.256806 1.8539237
"7/18/2013" -.13328764 -1.0578626
"7/25/2013" 1.2412375 1.7703875
"8/1/2013" 1.5033063 -2.2505422
"8/8/2013" -1.291876 -1.5896243
"8/15/2013" 1.0093634 -2.8861396
"8/22/2013" -.6952878 -.23103845
"8/29/2013" -.05459245 1.53916
"9/5/2013" 1.2413216 .749662
"9/12/2013" .19232245 2.81967
"9/19/2013" -2.6861706 -4.520664
"9/26/2013" .3105677 -5.274343
"10/3/2013" -.2184027 -3.251637
"10/10/2013" -1.233326 -5.031735
"10/17/2013" 1.9415965 -1.250861
"10/24/2013" -1.2008202 -1.5703772
"10/31/2013" -.6394427 -1.1347327
"11/7/2013" 2.715824 2.0324607
"11/14/2013" -1.5833142 2.5080755
"11/21/2013" .9940037 4.117931
"11/28/2013" -.8226601 3.752914
"12/5/2013" .09966203 1.865995
"12/12/2013" -.18744355 2.5426314
end
gen ddate = daily(D, "MDY")
gen year = year(ddate)
gen dow = dow(ddate)
tab year
tab dow
forval y = 2010/2013 {
local Y = `y' + 1
local yend `yend' `=mdy(1,1,`Y')'
if `y' > 2010 local ymid `ymid' `=mdy(7,1, `y')' "`y'"
forval q = 1/4 {
if `q' > 4 | `y' > 2010 {
local qmid : word `q' of 2 5 8 11
local qmids `qmids' `=mdy(`qmid', 15, `y')' "Q`q'"
local qend : word `q' of 4 7 10 4
local qends `qends' `=mdy(`qend', 1, `y')'
}
}
}
line M S ddate, xla(`ymid', tlength(*3) tlc(none)) xtic(`yend', tlength(*5)) xmla(`qmids', tlc(none) labsize(small) tlength(*.5)) xmti(`qends', tlength(*5)) xtitle("") scheme(s1color)

Extract intercepts from multiple regressions in stata

I am attempting to reproduce the following in stata. This is a scatter plot of average portfolio returns (y axis) and predicted retruns (x axis).
To do so, I need your help on how I can extract the intercepts from 25 regressions into one variable? I am currently running the 25 portfolio regressions as follows. I have seen that parmest can potentially do this but can't get it to work with the forval. Many thanks
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
}
}

I don't know what your data look like, but maybe something like this will work:
gen intercepts = .
local i = 1
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
// assign the ith observation of intercepts
// equal to the regression constant
replace intercepts = _b[_cons] if _n == `i'
// increment i
local ++i
}
}

The postfile series of commands can be very helpful in a situation like this. The commands allows you to store results in a separate data set without losing the data in memory.
You can start with this as a simple example. This code will produce a Stata data set called "results.dta" with the variables s h and constant with a record of each regression.
cap postclose results
postfile results s h constant using results.dta, replace
forval s = 1 / 5 {
forval h = 1 / 5 {
reg S`s'H`h' Mkt_Rf SMB HML
loc c = _b[_cons]
post results (`s') (`h') (`c')
}
}
postclose results
use results, clear

Coding dichotomous variables in Stata

I have a set of dichotomous variables for firm size:
emp1_2 (i.e. firm with 1 or 2 employed people, including the owner), emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500, plus I do not have information on 27 firms size but I have an educated guess that they are large firms.
I want to create a dichotomous variable for a firm being a "small firm"; therefore, this variable equals 1 when emp1_2==1 | emp3_9==1 | emp10_19==1 equals 1, and 0 otherwise.
To my understanding of Stata, of which I am a bare user, the two following methods to construct dichotomous variables should be equivalent.
Method 1)
gen lar_firm = 0
replace lar_firm = 1 if emp1_2==1 | emp3_9==1 | emp10_19==1
Method 2)
gen lar_firm = (emp1_2 | emp3_9 | emp10_19)
Instead I have found out that with method 2) lar_firm equals 1 for firms for which emp1_2 | emp3_9 | emp10_19 and for firms that do not enter in any of the categories (i.e. emp1_2, emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500) but for which I have an educated guess that they are large firms.
I am wondering whether there is some subtle difference between the two methods. I though they should lead to equal outcomes.

When you do
gen lar_firm = emp1_2 | emp3_9 | emp10_19
you're testing if
(emp1_2 != 0) | (emp3_9 != 0) |(emp10_19 != 0)
In particular, missing values . are different from 0: they are greater in fact.
For more information:
http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/

Stata - assign different variables depending on the value within a variable

Sorry that title is confusing. Hopefully it's clear below.
I'm using Stata and I'd like to assign the value 1 to a variable that depends on the value within a different variable. I have 20 order variables and also 20 corresponding variables. For example if order1 = 3, I'd like to assign variable3 = 1. Below is a snippet of what the final dataset would look like if I had only 3 of each variable.
Right now I'm doing this with two loops but I have to another loop around this that goes through this 9 more times plus I'd doing this for a couple hundred data files. I'd like to make it more efficient.
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`j' = 1 if order`i'==`j'
}
}
Is it possible to use the value of order'i' to assign the variable[order`i'VALUE] directly? Then I can get rid of the j loop above. Something like this.
forvalues i = 1/20 {
replace variable[`order`i'value] = 1
}
Thanks for your help!
***** CLARIFICATION ADDED Feb 2nd.**
I simplified my problem and the dataset too much bc the solutions suggested work for what I presented but, are not getting at what I'm really attempting to do. Thank you three for your solutions though. I was not clear enough in my post.
To clarify, my data doesn't have a one to one correspondence of each order# assigning variable# a 1 if it's not missing. For example, the first observation for order1=3, variable1 isn't supposed to get a 1, variable3 should get a 1. What I didn't include in my original post is that I'm actually checking for other conditions to set it equal to 1.
For more background, I'm counting up births of women by birth order(1st child, 2nd child, etc) that occurred at different ages of mothers. So in the data, each row is a woman, each order# is the number birth (order1=3, it's her third child). The corresponding variable#s are the counts (variable# means the woman has a child of birth order #). I mentioned in the post, that I do this 9 times bc I'm doing it for 5 year age groups (15-19; 20-24; etc). So the first set of variable# would be counts of birth by order when women were ages 15-19; the second set of variable# would be counts of births by order when women were 20-24. etc etc. After this, I sum up the counts in different ways (by woman's education, geography, etc).
So with the additional loop what I do is something more like this
forvalues k = 1/9{
forvalues i = 1/20 {
forvalues j = 1/20 {
replace variable`k'_`j' = 1 if order`i'==`j' & age`i'==`k' & birth_age`i'<36
}
}
}
Not sure if it's possible, but I wanted to simplify so I only need to cycle through each child once, without cycling through the birth orders and directly use the value in order# to assign a 1 to the correct variable. So if order1=3 and the woman had the child at the specific age group, assign variable[agegroup][3]=1; if order1=2, then variable[agegroup][2] should get a 1.
forvalues k=1/9{
forvalues i = 1/20 {
replace variable`k'_[`order`i'value] = 1 if age`i'==`k' & birth_age`i'<36
}
}

I would reshape twice. First reshape to long, then condition variable on !missing(order), then reshape back to wide.
* generate your data
clear
set obs 3
forvalues i = 1/3 {
generate order`i' = .
local k = (3 - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
list
*. list
*
* +--------------------------+
* | order1 order2 order3 |
* |--------------------------|
* 1. | 3 2 1 |
* 2. | 2 1 . |
* 3. | 1 . . |
* +--------------------------+
* I would rehsape to long, then back to wide
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
order order* variable*
drop id
list
*. list
*
* +-----------------------------------------------------------+
* | order1 order2 order3 variab~1 variab~2 variab~3 |
* |-----------------------------------------------------------|
* 1. | 3 2 1 1 1 1 |
* 2. | 2 1 . 1 1 0 |
* 3. | 1 . . 1 0 0 |
* +-----------------------------------------------------------+

Using a simple forvalues loop with generate and missing() is orders of magnitude faster than other proposed solutions (until now). For this problem you need only one loop to traverse the complete list of variables, not two, as in the original post. Below some code that shows both points.
*----------------- generate some data ----------------------
clear all
set more off
local numobs 60
set obs `numobs'
quietly {
forvalues i = 1/`numobs' {
generate order`i' = .
local k = (`numobs' - `i' + 1)
forvalues j = 1/`k' {
replace order`i' = (`k' - `j' + 1) if (_n == `j')
}
}
}
timer clear
*------------- method 1 (gen + missing()) ------------------
timer on 1
quietly {
forvalues i = 1/`numobs' {
generate variable`i' = !missing(order`i')
}
}
timer off 1
* ----------- method 2 (reshape + missing()) ---------------
drop variable*
timer on 2
quietly {
generate id = _n
reshape long order, i(id)
generate variable = !missing(order)
reshape wide order variable, i(id) j(_j)
}
timer off 2
*--------------- method 3 (egen, rowmax()) -----------------
drop variable*
timer on 3
quietly {
// loop over the order variables creating dummies
forvalues v=1/`numobs' {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables
// (may need to change)
forvalues l=1/`numobs' {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}
}
timer off 3
*----------------- method 4 (original post) ----------------
drop variable*
timer on 4
quietly {
forvalues i = 1/`numobs' {
gen variable`i' = 0
forvalues j = 1/`numobs' {
replace variable`i' = 1 if order`i'==`j'
}
}
}
timer off 4
*-----------------------------------------------------------
timer list
The timed procedures give
. timer list
1: 0.00 / 1 = 0.0010
2: 0.30 / 1 = 0.3000
3: 0.34 / 1 = 0.3390
4: 0.07 / 1 = 0.0700
where timer 1 is the simple gen, timer 2 the reshape, timer 3 the egen, rowmax(), and timer 4 the original post.
The reason you need only one loop is that Stata's approach is to execute the command for all observations in the database, from top (first observation) to bottom (last observation). For example, variable1 is generated but according to whether order1 is missing or not; this is done for all observations of both variables, without an explicit loop.
I wonder if you actually need to do this. For future questions, if you have a further goal in mind, I think a good strategy is to mention it in your post.
Note: I've reused code from other posters' answers.

Here's a simpler way to do it (that still requires 2 loops):
// loop over the order variables creating dummies
forvalues v=1/20 {
tab order`v', gen(var`v'_)
}
// loop over the domain of the order variables (may need to change)
forvalues l=1/3 {
egen variable`l' = rmax(var*_`l')
drop var*_`l'
}

Stata: Counting number of consecutive occurrences of a pre-defined length

Observations in my data set contain the history of moves for each player. I would like to count the number of consecutive series of moves of some pre-defined length (2, 3 and more than 3 moves) in the first and the second halves of the game. The sequences cannot overlap, i.e. the sequence 1111 should be considered as a sequence of the length 4, not 2 sequences of length 2. That is, for an observation like this:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 1 | . | . | 1 | 1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
…the following variables should be generated:
Number of sequences of 2 in the first half =0
Number of sequences of 2 in the second half =1
Number of sequences of 3 in the first half =0
Number of sequences of 3 in the second half =0
Number of sequences of >3 in the first half =1
Number of sequences of >3 in the second half = 0
I have two potential options of how to proceed with this task but neither of those leads to the final solution:
Option 1: Elaborating on Nick’s tactical suggestion to use strings (Stata: Maximum number of consecutive occurrences of the same value across variables), I have concatenated all “move*” variables and tried to identify the starting position of a substring:
egen test1 = concat(move*)
gen test2 = subinstr(test1,"11","X",.) // find all consecutive series of length 2
There are several problems with Option 1:
(1) it does not account for cases with overlapping sequences (“1111” is recognized as 2 sequences of 2)
(2) it shortens the resulting string test2 so that positions of X no longer correspond to the starting positions in test1
(3) it does not account for variable length of substring if I need to check for sequences of the length greater than 3.
Option 2: Create an auxiliary set of variables to identify the starting positions of the consecutive set (sets) of the 1s of some fixed predefined length. Building on the earlier example, in order to count sequences of length 2, what I am trying to get is an auxiliary set of variables that will be equal to 1 if the sequence of started at a given move, and zero otherwise:
+-------+-------+-------+-------+-------+-------+-------+-------+
| Move1 | Move2 | Move3 | Move4 | Move5 | Move6 | Move7 | Move8 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
+-------+-------+-------+-------+-------+-------+-------+-------+
My code looks as follows but it breaks when I am trying to restart counting consecutive occurrences:
quietly forval i = 1/42 {
gen temprow`i' =.
egen rowsum = rownonmiss(seq1-seq`i') //count number of occurrences
replace temprow`i'=rowsum
mvdecode seq1-seq`i',mv(1) if rowsum==2
drop rowsum
}
Does anyone know a way of solving the task?

Assume a string variable concatenating all moves all (the name test1 is hardly evocative).
FIRST TRY: TAKING YOUR EXAMPLE LITERALLY
From your example with 8 moves, the first half of the game is moves 1-4 and the second half moves 5-8. Thus there is for each half only one way to have >3 moves, namely that there are 4 moves. In that case each substring will be "1111" and counting reduces to testing for the one possibility:
gen count_1_4 = substr(all, 1, 4) == "1111"
gen count_2_4 = substr(all, 5, 4) == "1111"
Extending this approach, there are only two ways to have 3 moves in sequence:
gen count_1_3 = inlist(substr(all, 1, 4), "111.", ".111")
gen count_2_3 = inlist(substr(all, 5, 4), "111.", ".111")
In similar style, there can't be two instances of 2 moves in sequence in each half of the game as that would qualify as 4 moves. So, at most there is one instance of 2 moves in sequence in each half. That instance must match either of two patterns, "11." or ".11". ".11." is allowed, so either includes both. We must also exclude any false match with a sequence of 3 moves, as just mentioned.
gen count_1_2 = (strpos(substr(all, 1, 4), "11.") | strpos(substr(all, 1, 4), ".11") ) & !count_1_3
gen count_2_2 = (strpos(substr(all, 5, 4), "11.") | strpos(substr(all, 5, 4), ".11") ) & !count_2_3
The result of each strpos() evaluation will be positive if a match is found and (arg1 | arg2) will be true (1) if either argument is positive. (For Stata, non-zero is true in logical evaluations.)
That's very much tailored to your particular problem, but not much worse for that.
P.S. I didn't try hard to understand your code. You seem to be confusing subinstr() with strpos(). If you want to know positions, subinstr() cannot help.
SECOND TRY
Your last code segment implies that your example is quite misleading: if there can be 42 moves, the approach above can not be extended without pain. You need a different approach.
Let's suppose that the string variable all can be 42 characters long. I will set aside the distinction between first and second halves, which can be tackled by modifying this approach. At its simplest, just split the history into two variables, one for the first half and one for the second and repeat the approach twice.
You can clone the history by
clonevar work = all
gen length1 = .
gen length2 = .
and set up your count variables. Here count_4 will hold counts of 4 or more.
gen count_4 = 0
gen count_3 = 0
gen count_2 = 0
First we look for move sequences of length 42, ..., 2. Every time we find one, we blank it out and bump up the count.
qui forval j = 42(-1)2 {
replace length1 = length(work)
local pattern : di _dup(`j') "1"
replace work = subinstr(work, "`pattern'", "", .)
replace length2 = length(work)
if `j' >= 4 {
replace count4 = count4 + (length1 - length2) / `j'
}
else if `j' == 3 {
replace count3 = count3 + (length1 - length2) / 3
}
else if `j' == 2 {
replace count2 = count2 + (length1 - length2) / 2
}
}
The important details here are
If we delete (repeated instances of) a pattern and measure the change in length, we have just deleted (change in length) / (length of pattern) instances of that pattern. So, if I look for "11" and found that the length decreased by 4, I just found two instances.
Working downwards and deleting what we found ensures that we don't find false positives, e.g. if "1111111" is deleted, we don't find later "111111", "11111", ..., "11" which are included within it.
Deletion implies that we should work on a clone in order not to destroy what is of interest.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Dropping entire subject if single observation meets criterion in panel data - stata

Related

manipulating the format of date on X-axis

Extract intercepts from multiple regressions in stata

Coding dichotomous variables in Stata

Stata - assign different variables depending on the value within a variable

Stata: Counting number of consecutive occurrences of a pre-defined length

Categories

Resources