How to perform rolling window calculations without SSC packages - stata

Goal: perform rolling window calculations on panel data in Stata with variables PanelVar, TimeVar, and Var1, where the window can change within a loop over different window sizes.
Problem: no access to SSC for the packages that would take care of this (like rangestat)
I know that
by PanelVar: gen Var1_1 = Var1[_n]
produces a copy of Var1 in Var1_1. So I thought it would make sense to try
by PanelVar: gen Var1SumLag = sum(Var1[(_n-3)/_n])
to produce a rolling window calculation for _n-3 to _n for the whole variable. But it fails to produce the results I want, it just produces zeros.
You could use sum(Var1) - sum(Var1[_n-3]), but I also want to be able to make the rolling window left justified (summing future observations) as well as right justified (summing past observations).
Essentially I would like to replicate Python's ".rolling().agg()" functionality.

In Stata _n is the index of the current observation. The expression (_n - 3) / _n yields -2 when _n is 1 and increases slowly with _n but is always less than 1. As a subscript applied to extract values from observations of a variable it always yields missing values given an extra rule that Stata rounds down expressions so supplied. Hence it reduces to -2, -1 or 0: in each case it yields missing values when given as a subscript. Experiment will show you that given any numeric variable say numvar references to numvar[-2] or numvar[-1] or numvar[0] all yield missing values. Otherwise put, you seem to be hoping that the / yields a set of subscripts that return a sequence you can sum over, but that is a long way from what Stata will do in that context: the / is just interpreted as division. (The running sum of missings is always returned as 0, which is an expression of missings being ignored in that calculation: just as 2 + 3 + . + 4 is returned as 9 so also . + . + . + . is returned as 0.)
A fairly general way to do what you want is to use time series operators, and this is strongly preferable to subscripts as (1) doing the right thing with gaps (2) automatically working for panels too. Thus after a tsset or xtset
L0.numvar + L1.numvar + L2.numvar + L3.numvar
yields the sum of the current value and the three previous and
L0.numvar + F1.numvar + F2.numvar + F3.numvar
yields the sum of the current value and the three next. If any of these terms is missing, the sum will be too; a work-around for that is to return say
cond(missing(L3.numvar), 0, L3.numvar)
More general code will require some kind of loop.
Given a desire to loop over lags (negative) and leads (positive) some code might look like this, given a range of subscripts as local macros i <= j
* example i and j
local i = -3
local j = 0
gen double wanted = 0
forval k = `i'/`j' {
if `k' < 0 {
local k1 = -(`k')
replace wanted = wanted + L`k1'.numvar
}
else replace wanted = wanted + F`k'.numvar
}
Alternatively, use Mata.
EDIT There's a simpler method, to use tssmooth ma to get moving averages and then multiply up by the number of terms.
tssmooth ma wanted1=numvar, w(3 1)
tssmooth ma wanted2=numvar, w(0 1 3)
replace wanted1 = 4 * wanted1
replace wanted2 = 4 * wanted2
Note that in contrast to the method above tssmooth ma uses whatever is available at the beginning and end of each panel. So, the first moving average, the average of the first value and the three previous, is returned as just the first value at the beginning of each panel (when the three previous values are unknown).

Related

Understanding & Converting ThinkScripts CompoundValue Function

I'm currently converting a ThinkScript indicator to C#, however, I've run into this CompoundValue function and I'm unsure how to covert it.
The documents reads :
Calculates a compound value according to following rule: if a bar
number is greater than length then the visible data value is returned,
otherwise the historical data value is returned. This function is used
to initialize studies with recursion.
Example Use:
declare lower;
def x = CompoundValue(2, x[1] + x[2], 1);
plot FibonacciNumbers = x;
My interpretation:
Based on description and example. It appears we are passing a calculation in x[1] + x[2] and it performing this calculation on the current bar and the previous bar (based on first param of 2). I'm unsure what the parameter 1 is for.
My Question:
Please explain what this function is actually doing. If possible, please illustrate how this method works using pseudo-code.
For the TLDR; crowd, some simple code that hopefully explains what the CompoundValue() function is trying to do, and which might help in converting it's functionality:
# from: Chapter 12. Past/Future Offset and Prefetch
# https://tlc.thinkorswim.com/center/reference/thinkScript/tutorials/Advanced/Chapter-12---Past-Offset-and-Prefetch
# According to this tutorial, thinkScript uses the highest offset, overriding
# all lower offsets in the script - WOW
declare lower;
# recursive addition using x[1] is overridden by 11 in the plot for
# Average(close, 11) below; SO `x = x[1] + 1` becomes `x = x[11] + 1`
def x = x[1] + 1;
# using CompoundValue, though, we can force the use of the *desired* value
# arguments are:
# - length: the number of bars for this variable's offset (`1` here)
# - "visible data": value to use IF VALUES EXIST for a bar (a calculation here)
# - "historical data": value to use IF NO VALUE EXISTS for a bar (`1` here)
def y = CompoundValue(1, y[1] + 1, 1);
# *plotting* this Average statement will change ALL offsets to 11!
plot Average11 = Average(close, 11);
# `def`ing the offset DOES NOT change other offsets, so no issue here
# (if the `def` setup DID change the offsets, then `x[1]` would
# become `x[14]`, as 14 is higher than 11. However, `x[1]` doesn't change.
def Average14 = Average(close, 14);
plot myline = x;
plot myline2 = y;
# add some labels to tell us what thinkScript calculated
def numBars = HighestAll(BarNumber());
AddLabel(yes, "# Bars on Chart: " + numBars, Color.YELLOW);
AddLabel(yes, "x # bar 1: " + GetValue(x, numBars), Color.ORANGE);
AddLabel(yes, "x # bar " + numBars + ": " + x, Color.ORANGE);
AddLabel(yes, "y # bar 1: " + GetValue(y, numBars), Color.LIGHT_ORANGE);
AddLabel(yes, "y # bar " + numBars + ": " + y, Color.ORANGE);
Now, some, er, lots of details...
First, a quick note on "offset" values:
thinkScript, like other trading-related languages, uses an internal looping system. This is like a for loop, iterating through all the "periods" or "bars" on a chart (eg, 1 bar = 1 day on a daily chart; 1 bar = 1 minute on a 1 minute intraday chart, etc). Every line of code in thinkScript is run for each and every bar in the chart or length of time specified in the script.
As noted by the OP, x[1] represents an offset of one bar before the current bar the loop is processing. x[2] represents two bars before the current bar, and so on. Additionally, it's possible to offset into the future by using negative numbers: x[-1] means one bar ahead of the current bar, for example.
These offsets work similarly to the for loop in C#, except they're backwards: x[0] in C# would represent the current x value, as it would in thinkScript; however, moving forward in the loop, x[1] would be the next value, and x[-1] wouldn't exist because, well, there is no past value before 0. (In general, of course! One can definitely loop with negative numbers in C#. The point is that positive offset indices in thinkScript represent past bars, while negative offset indices in thinkScript represent future bars - not the case in C#.)
Also important here is the concept of "length": in thinkScript, length parameters represent the distance you want to go - like the offset, but a range instead of one specific bar. In my example code above, I used the statement plot Average11 = Average(close, 11); In this case, the 11 parameter represents plotting the close for a period of 11 bars, ie, offsets x[0] through x[10].
Now, to explain the CompoundValue() function's purpose:
The Chapter 12. Past/Future Offset and Prefetch thinkScript tutorial explains that thinkScript actually overrides smaller offset or length values with the highest value in a script. What that means is that if you have two items defined as follows:
def x = x[1] + 1;
plot Average11 = Average(close, 11);
thinkScript will actually override the x[1] offset with the higher length used in the Average statement - therefore causing x[1] to become x[11]!
Yike! That means that the specified offsets, except the highest offset, mean nothing to thinkScript! So, wait a minute - does one have to use all the same offsets for everything, then? No! This is where CompoundValue() comes in...
That same chapter explains that CompoundValue() allows one to specify an offset for a variable that won't be changed, even if a higher offset exists.
The CompoundValue() function, with parameter labels, looks like this:
CompoundValue(length, "visible data", "historical data")
As the OP noted, this isn't really particularly clear. Here's what the parameters represent:
length: the offset number of bars for this variable.
In our example, def x = x[1] + 1, there is a 1 bar offset, so our statement starts as CompoundValue(length=1, ...). If instead, it was a larger offset, say 14 bars, we'd put CompoundValue(length=14, ...)
"visible data": the value or calculation thinkScript should perform if DATA IS AVAILABLE for the current bar.
Again, in our example, we're using a calculation of x[1] + 1, so CompoundValue(length=1, "visible data"=(x[1] + 1), ...). (Parentheses around the equation aren't necessary, but may help with clarity.)
"historical data": the value to use if NO DATA IS AVAILABLE for the current bar.
In our example, if no data is available, we'll use a value of 1.
Now, in thinkScript, parameter labels aren't required if the arguments are in order and/or defaults are supplied. So, we could write this CompoundValue statement like this without the labels:
def y = CompoundValue(1, y[1] + 1, 1);
or like this with the labels:
def y = CompoundValue(length=1, "visible data"=(y[1] + 1), "historical data"=1);
(Note that parameter names containing spaces have to be surrounded by double quotes. Single-word parameter names don't need the quotes. Also, I've placed parens around the equation just for the sake of clarity; this is not required.)
In summary: CompoundValue(...) is needed to ensure a variable uses the actual desired offset/number of bars in a system (thinkScript) that otherwise overrides the specified offsets with a higher number if present.
If all the offsets in a script are the same, or if one is using a different programming system, then CompoundValue() can simply be broken down into its appropriate calculations or values, eg def x = x[1] + 1 or, alternatively, an if/else statement that fills in the values desired at whatever bars or conditions are needed.
Please let me provide two equivalent working versions of the code in thinkscript itself. We use this approach to prove equivalence by subtracting the equivalent outputs from each other - the result should be 0.
# The original Fibonacci code with a parameter "length" added.
# That parameter is the first parameter of the CompoundValue function.
declare lower;
def length = 2;
def x = CompoundValue(length, x[1] + x[2], 1);
# plot FibonacciNumbers = x;
# Equivalent code using the `if` statement:
def y;
if(BarNumber() > length){
# Visible data. This is within the guarded branch of the if statement.
# Historical data y[1] (1 bar back) and y[2] (2 bars back) is available
y = y[1] + y[2];
}else{
# Not enough historical data so we use the special case satisfying the
# original rule.
y = 1;
}
plot FibonacciNumbersDiff = y - x;
Thinkscript "recursion" is a somewhat inflated term. The function name CompoundValue is not very helpful so it may create confusion.
The version using the if statement is more useful in general because when walking through the time series of bars, we often need a program structure with multiple nested if statements - this cannot be done with the CompoundValue function. Please see my other articles which make use of this in the context of scanning.
In Java, using the same structure, it looks like this:
int size = 100;
int length = 2;
int[] values = new int[size];
for(int index = 1; index < size; index++){
if(index > length){
values[index] = values[index - 1] + values[index - 2];
}else{
values[index] = 1;
}
}
The fundamental difference is the for loop which is not present in the thinkscript code. thinkscript provides the loop in a kind of inversion of control where it executes user code multiple times, once for each bar.

Having an issue with stopping a loop in Stata

So, I have a bunch of variables in my data set which are binary and contain information on whether an individual was married or not. So, for example, marr79, is whether a person was married in 1979 or not.
I'm trying to find how many years a person was married (the first time) from the child's birth. So, if the child was born in 1980, and the person was married in 1980, it would add to child_marr, and it would do the same for the following 18 years of their life. I want it to stop, though, if it encounters a 0. So if there are 1's for 1980, 1981, and 1982, and a 0 for 1983, I want it to stop at 1983, even if there is a 1 in 1984.
My code below (and it is one of many iterations I've tried) either has it run through all the years without stopping, or never run at all, leaving values of all 0.
Any help is appreciated.
gen child_marr=0;
forvalues y=79(1)99 {;
gen temp_yr=1900+`y';
if (ch_yob<=temp_yr & marr`y'==1 & temp_yr<(ch_yob+18))==1 {;
replace child_marr = child_marr + 1;
};
else if (marr`y'==0 & ch_yob<=temp_yr) {;
continue, break;
};
drop temp_yr;
};
A few comments:
Your condition if (test1 & test2 & test3) == 1 does not need the == 1 portion -- Stata infers that if (condition) means if condition == 1 (caveat: for cases where the logical test is {0,1}).
There is no need to generate a temporary variable, since you can compare the value of a variable to a local macro directly.
To the issue at hand, your loop is comparing observation-level criteria (e.g., the value of the variable temp_yr to the value of the variable ch_yob). This can seem correct, but is often problematic -- see Stata FAQ: if command versus if qualifier.
A first pass at a solution would be to recode your forvalues loop to use the if qualifier rather than the if command:
gen child_marr = 0
forvalues y = 79/99 {
local yr = 1900 + `y'
replace child_marr = child_marr + 1 if (ch_yob <= `yr') & (marr`y' == 1) & (`yr' < (ch_yob + 18))
}
But as mentioned, a concrete solution would be easier with a reproducible example.

Nearest Neighbor Matching in Stata

I need to program a nearest neighbor algorithm in stata from scratch because my dataset does not allow me to use any of the available solutions (as far as I am concerned).
To be pecise. I have a dataset that is of similar structure to that of the following (original has around 14k observations)
input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end
I want to generate a variable called match (already included in the example above). For each observation with treatment == 1 the variable match should store the id of another observation from within treatment == 0 whose value is closest to value of the considered observation (treatment == 1).
I am new to stata programming, so I am not yet familiar with the syntax. My first shot is the following however it does not produce any changes to the match variable. I am sure this is a novice question but I am hoping for some advice on how to make the code running.
EDIT: I have changed the code slightly and now it seems to work. Do you see any problems that may arise if I run it on a bigger dataset?
set more off
clear all
input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end
gen match = .
forval i = 1/`= _N' {
if treatment[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treatment[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
Consider some simulated data: 1,000 observations, 200 of them untreated (treat == 0) and the rest treated (treat == 1). Then the code included below will be much more efficient than the originally posted. (Ties, like in your code, are not explicitly handled.)
clear
set more off
*----- example data -----
set obs 1000
set seed 32956
gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)
*----- new method -----
timer clear
timer on 1
// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1
// start loop
gen match = .
gen dif = .
quietly forvalues i = `firstt'/`=_N' {
// compute distances
replace dif = (pscore[`i'] - pscore)^2
summarize dif in 1/`lastnt', meanonly
// identify id of minimum-distance observation
replace match = . in 1/`lastnt'
replace match = id in 1/`lastnt' if dif == r(min)
summarize match in 1/`lastnt', meanonly
// save the minimum-distance id
replace match = r(max) in `i'
}
// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat
timer off 1
tempfile first
save `first'
*----- your method -----
drop match
timer on 2
gen match = .
quietly forval i = 1/`= _N' {
if treat[`i'] == 1 {
local dist 1
forvalues j = 1/`= _N' {
if (treat[`j'] == 0) {
local current_dist (pscore[`i'] - pscore[`j'])^2
if `dist' > `current_dist' {
local dist `current_dist' // update smallest distance
replace match = id[`j'] in `i' // write match
}
}
}
}
}
timer off 2
tempfile second
save `second'
// check for equality of results
cf _all using `first'
// check times
timer list
The results in seconds to finish execution:
. timer list
1: 0.19 / 1 = 0.1930
2: 10.79 / 1 = 10.7900
The difference is huge, specially considering this data set has only 1,000 observations.
An interesting thing to notice is that as the number of non-treated cases increases relative to the number of treated, then the original method improves, but never reaches the levels of efficiency of the new method. As an example, invert the number of cases, so there is now 800 untreated and 200 treated (change data setup to gen treat = cond(_n <= 800, 0, 1)). The result is
. timer list
1: 0.07 / 1 = 0.0720
2: 4.45 / 1 = 4.4470
You can see that the new method also improves and is still much faster. In fact, the relative difference is still the same.
Another way to do this is using joinby or cross. The problem is they temporarily expand (a lot) the size of your data base. In many cases, they are not feasible due to the hard limit Stata has on the number of possible observations (see help limits). You can find an example of joinby here: https://stackoverflow.com/a/19784222/2077064.
Edit
If there's a large number of treated relative to untreated, your code suffers
because you go through the whole first loop many more times (due to the first if).
Furthermore, going through
that whole loop once, implies going through another loop that
has itself two if conditions, _N more times.
The opposite case in which there are few treated observations means that you go through the whole
first loop only in a small number of occasions, speeding up your code substantially.
The reason my code can maintain its efficiency is due to the use of in. This always
offers speed gains over if. Stata will go directly to those observations with no
logical checking needed. Your problem provides an opportunity for that replacement
and it's wise to seize it.
If my code used if where in is in place, the results would be different.
Your code would be faster for the
case in which there's a large number of untreated relative to treated, and again, that
is because in your code there would not be the need to go through the complete loop,
requiring very little work;
the first loop is short-circuited with the first if. For the opposite case,
my code would still dominate.
The key is to "separate" treated from untreated and work on each group using in.

Not Equals Constraint in PROC OPTMODEL

I have an optimization problem that I need to solve. It's a binary linear programming problem, so all of the decision variables are equal to 0 or 1. I need certain combinations of these decision variables to add up to either 0 or 2+, they cannot sum to 1. I'm struggling with how to accomplish this in PROC OPTMODEL.
Something like this is what I need:
con sum_con: x+y+z~=1;
Unfortunately, this just throws a syntax error... Is there any way to accomplish this?
See below for a linear reformulation. However, you may not need it. In SAS 9.4m2 (SAS/OR 13.2), your expression works as written. You just need to invoke the (experimental) CLP solver:
proc optmodel;
/* In SAS/OR 13.2 you can use your code directly.
Just invoke the experimental CLP solver */
var x binary, y binary, z binary;
con sum_con: x+y+z~=1;
solve with clp / findall;
print {i in 1 .. _NSOL_} x.sol[i]
{i in 1 .. _NSOL_} y.sol[i]
{i in 1 .. _NSOL_} z.sol[i];
produces immediately:
[1] x.SOL y.SOL z.SOL
1 0 0 0
2 0 1 1
3 1 0 1
4 1 1 0
5 1 1 1
In older versions of SAS/OR, you can still call PROC CLP directly,
which is not experimental.
The syntax for your example will be very similar to PROC OPTMODEL's.
I am sure, however, that your model has other variables and constraints.
In that case, remember that no matter how you formulate this,
it is still a search space with a hole in the middle.
So it potentially can make the solver perform poorly.
How poorly is hard to predict. It depends on other features of your model.
If MILP is a better fit for the rest of your model,
you can reformulate your constraint as a valid MILP in two steps.
First, add a binary variable that is zero only when the expression is zero:
/* If solve with CLP is not available, you can linearize the disjunction: */
var IsGTZero binary; /* 1 if any variable in the expression is 1 */
con IsGTZeroBoundsExpression: 3 * IsGTZero >= x + y + z;
Then add another constraint that forces the expression to be
at least the constant you want (in this case 2) when it is nonzero.
num atLeast init 2;
con ZeroOrAtLeast: x + y + z >= atLeast * IsGTZero;
min f=0; /* Explicit objectives are unnecessary in 13.2 */
solve;
The following equation should work:
(x+y-z)*z + (y+z-x)*x + (x+z-y)*y > -1
It can be generalized to more than three variables and if you have some large number you should be able to use index expansions to make it easier.

Stata: compare coefficients of factor variables using foreach (or forvalues)

I am using an ordinal independent variable in an OLS regression as a categorical variable using the factor variable technique in Stata (i.e, i.ordinal). The variable can take on values of the integers from 0 to 9, with 0 being the base category. I am interested in testing if the coefficient of each variable is greater (or less) than that which succeeds it (i.e. _b[1.ordinal] >= _b[2.ordinal], _b[2.ordinal] >= _b[3.ordinal], etc.). I've started with the following pseudocode based on FAQ: One-sided t-tests for coefficients:
foreach i in 1 2 3 5 6 7 8 {
test _b[`i'.ordinal] - _b[`i+'.ordinal] = 0
gen sign_`i'`i+' = sign(_b[`i'.ordinal] - _b[`i+'.ordinal])
display "Ho: i <= i+ p-value = " ttail(r(df_r), sign_`i'`i+'*sqrt(r(F)))
display "Ho: i >= i+ p-value = " 1-ttail(r(df_r), sign_`i'`i+'*sqrt(r(F)))
}
where I want the ```i+' to mean the next value of i in the sequence (so if i is 3 then ``i+' is 5). Is this even possible to do? Of course, if you have any cleaner suggestions to test the coefficients in this manner, please advise.
Note: The model only uses a sub-sample of my dataset for which there are no observations for 4.ordinal, which is why I use foreach instead of forvalues. If you have suggestions for developing a general code that can be used regardless of missing variables, please advise.
There are various ways to do this. Note that there is little obvious point to creating a new variable just to hold one constant. Code not tested.
forval i = 1/8 {
local j = `i' + 1
capture test _b[`i'.ordinal] - _b[`j'.ordinal] = 0
if _rc == 0 {
local sign = sign(_b[`i'.ordinal] - _b[`j'.ordinal])
display "Ho: `i' <= `j' p-value = " ttail(r(df_r), `sign' * sqrt(r(F)))
display "Ho: `i' >= `j' p-value = " 1-ttail(r(df_r), `sign' * sqrt(r(F)))
}
}
The capture should eat errors.