Formula that uses previous value - stata

In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2

Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.

Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1

Related

Identify and delete observations that do not meet conditions in Stata

I need help identifying and removing observations that meet certain conditions. My data looks like this:
ID caseID set Var1 Var2
1 1 1 1 0
1 2 1 2 0
1 3 1 3 1
1 4 2 1 0
1 5 2 2 0
1 6 2 3 1
2 7 3 1 0
2 8 3 2 0
2 9 3 3 1
2 10 4 1 0
2 11 4 2 0
2 12 4 3 0
For every set, I want to have one observation in which Var2=1 and two observations in which Var2=0. If they do not meet this condition, I want to delete all observations from the set. For example, I would delete set=4 because Var2=0 for all observations. How can I do this in Stata?
Consider the following new variables:
egen count1 = total(Var2 == 1), by(set)
egen count0 = total(Var2 == 0), by(set)
egen total = total(Var2), by(set)
A literal reading of your question implies that you want to
keep if count1 == 1 & count0 == 2
But if sets are always of size 3 and no values other than 0 or 1 are possible, then you need only count1 == 1 OR count0 == 2 OR total == 1 as a condition.

shifting with re-sampling in time series data

assume that i have this time-series data:
A B
timestamp
1 1 2
2 1 2
3 1 1
4 0 1
5 1 0
6 0 1
7 1 0
8 1 1
i am looking for a re-sample value that would give me specific count of occurrences at least for some frequency
if I would use re sample for the data from 1 to 8 with 2S, i will get different maximum if i would start from 2 to 8 for the same window size (2S)
ds = series.resample( str(tries) +'S').sum()
for shift in range(1,100):
tries = 1
series = pd.read_csv("file.csv",index_col='timestamp') [shift:]
ds = series.resample( str(tries) +'S').sum()
while ( (ds.A.max + ds.B.max < 4) & (tries < len(ds))):
ds = series.resample( str(tries) +'S').sum()
tries = tries + 1
#other lines
i am looking for performance improvement as it takes prohibitively long to finish for large data

Create a dummy variable for the last rows based on on another variable

I would like to create a dummy variable that will look at the variable "count" and label the rows as 1 starting from the last row of each id. As an example ID 1 has count of 3 and the last three rows of this id will have such pattern: 0,0,1,1,1 Similarly, ID 4 which has a count of 1 will have 0,0,0,1. The IDs have different number of rows. The variable "wish" shows what I want to obtain as a final output.
input byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
For future questions, you should provide your failed attempts. This shows that you have done your part, namely, research your problem.
One way is:
clear
set more off
*----- example data -----
input ///
byte id count wish str9 date
1 3 0 22sep2006
1 3 0 23sep2006
1 3 1 24sep2006
1 3 1 25sep2006
1 3 1 26sep2006
2 4 1 22mar2004
2 4 1 23mar2004
2 4 1 24mar2004
2 4 1 25mar2004
3 2 0 28jan2003
3 2 0 29jan2003
3 2 1 30jan2003
3 2 1 31jan2003
4 1 0 02dec1993
4 1 0 03dec1993
4 1 0 04dec1993
4 1 1 05dec1993
5 1 0 08feb2005
5 1 0 09feb2005
5 1 0 10feb2005
5 1 1 11feb2005
6 3 0 15jan1999
6 3 0 16jan1999
6 3 1 17jan1999
6 3 1 18jan1999
6 3 1 19jan1999
end
list, sepby(id)
*----- what you want -----
bysort id: gen wish2 = _n > (_N - count)
list, sepby(id)
I assume you already sorted your date variable within ids.
One way to accomplish this would be to use within-group row numbers using 'bysort'-type logic:
***Create variable of within-group row numbers.
bysort id: gen obsnum = _n
***Calculate total number of rows within each group.
by id: egen max_obsnum = max(obsnum)
***Subtract the count variable from the group row count.
***This is the number of rows where we want the dummy to equal zero.
gen max_obsnum_less_count = max_obsnum - count
***Create the dummy to equal one when the row number is
***greater than this last variable.
gen dummy = (obsnum > max_obsnum_less_count)
***Clean up.
drop obsnum max_obsnum max_obsnum_less_count

Creating a non-right Pascal's triangle (centered) in python

I need to write a code that inputs a non-right Pascal's triangle given the nth level as an input where the first row is the 0th level. Apart from that, at the end of each row the level must be indicated.
Here's what I've made so far:
level = input('Please input nth level: ')
x = -1
y = 1
while x < level:
x = x+1
d = str(11**x)
while y < level:
y = y+1
print " ",
for m,n in enumerate(d):
print str(n) + " ",
while y < level:
y = y+1
print " ",
print x
When I input 3, it outputs:
1 0
1 1 1
1 2 1 2
1 3 3 1 3
My desired output is:
   1 0
1 1 1
1 2 1 2
1 3 3 1 3
You could use str.format to center the string for you:
level = int(raw_input('Please input nth level: '))
N = level*2 + 5
for x in range(level+1):
d = ' '.join(str(11**x))
print('{d:^{N}} {x:>}'.format(N=N, d=d, x=x))
Please input nth level: 4
1 0
1 1 1
1 2 1 2
1 3 3 1 3
1 4 6 4 1 4
Note that if d = '1331', then you can add a space between each digit using ' '.join(d):
In [29]: d = '1331'
In [30]: ' '.join(d)
Out[30]: '1 3 3 1'
Note that using d = str(11**x) is a problematic way of computing the numbers in Pascal's triangle since it does not give you the correct digits for x >= 5. For example,
Please input nth level: 5
1 0
1 1 1
1 2 1 2
1 3 3 1 3
1 4 6 4 1 4
1 6 1 0 5 1 5 <-- Should be 1 5 10 10 5 1 !
You'll probably want to compute the digits in Pascal's triangle a different way.

Stata moving products

Using Stata I want a formula (line of code) that takes all of the previous entries for a given group G at a given cell and returns the product for all of the values at that cell and above. For example:
G X Y
1 1 1
1 2 2
1 6 12
1 3 36
2 2 2
2 4 8
3 2 2
4 2 2
4 11 22
4 7 154
G = Group ID, X = Value, Y = Moving Product
The way I have been doing this is pretty long and involves creating a good number of variables. There must be a way in Stata to just have it do a moving product by group ID (G).
Any insight is helpful
Here is the solution:
sort G
by G: gen moving_product = exp(sum(ln(X)))
This should make X = Y