assume that i have this time-series data:
A B
timestamp
1 1 2
2 1 2
3 1 1
4 0 1
5 1 0
6 0 1
7 1 0
8 1 1
i am looking for a re-sample value that would give me specific count of occurrences at least for some frequency
if I would use re sample for the data from 1 to 8 with 2S, i will get different maximum if i would start from 2 to 8 for the same window size (2S)
ds = series.resample( str(tries) +'S').sum()
for shift in range(1,100):
tries = 1
series = pd.read_csv("file.csv",index_col='timestamp') [shift:]
ds = series.resample( str(tries) +'S').sum()
while ( (ds.A.max + ds.B.max < 4) & (tries < len(ds))):
ds = series.resample( str(tries) +'S').sum()
tries = tries + 1
#other lines
i am looking for performance improvement as it takes prohibitively long to finish for large data
Related
I need help identifying and removing observations that meet certain conditions. My data looks like this:
ID caseID set Var1 Var2
1 1 1 1 0
1 2 1 2 0
1 3 1 3 1
1 4 2 1 0
1 5 2 2 0
1 6 2 3 1
2 7 3 1 0
2 8 3 2 0
2 9 3 3 1
2 10 4 1 0
2 11 4 2 0
2 12 4 3 0
For every set, I want to have one observation in which Var2=1 and two observations in which Var2=0. If they do not meet this condition, I want to delete all observations from the set. For example, I would delete set=4 because Var2=0 for all observations. How can I do this in Stata?
Consider the following new variables:
egen count1 = total(Var2 == 1), by(set)
egen count0 = total(Var2 == 0), by(set)
egen total = total(Var2), by(set)
A literal reading of your question implies that you want to
keep if count1 == 1 & count0 == 2
But if sets are always of size 3 and no values other than 0 or 1 are possible, then you need only count1 == 1 OR count0 == 2 OR total == 1 as a condition.
For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:
d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)
The data looks like this:
In [3]: data
Out[3]:
A B ID REFERENCE
0 1 1 1 1
1 2 2 1 0
2 3 3 1 0
3 4 4 2 0
4 5 11 2 0
5 6 12 2 1
6 7 13 2 0
7 8 14 3 1
8 9 15 3 0
Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:
A B ID REFERENCE CARDINALITY
0 1 1 1 1 1
1 2 2 1 0 2
2 3 3 1 0 2
3 4 4 2 0 3
4 5 11 2 0 4
5 6 12 2 1 2
6 7 13 2 0 4
7 8 14 3 1 2
8 9 15 3 0 3
The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?
I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()
I need to write a code that inputs a non-right Pascal's triangle given the nth level as an input where the first row is the 0th level. Apart from that, at the end of each row the level must be indicated.
Here's what I've made so far:
level = input('Please input nth level: ')
x = -1
y = 1
while x < level:
x = x+1
d = str(11**x)
while y < level:
y = y+1
print " ",
for m,n in enumerate(d):
print str(n) + " ",
while y < level:
y = y+1
print " ",
print x
When I input 3, it outputs:
1 0
1 1 1
1 2 1 2
1 3 3 1 3
My desired output is:
1 0
1 1 1
1 2 1 2
1 3 3 1 3
You could use str.format to center the string for you:
level = int(raw_input('Please input nth level: '))
N = level*2 + 5
for x in range(level+1):
d = ' '.join(str(11**x))
print('{d:^{N}} {x:>}'.format(N=N, d=d, x=x))
Please input nth level: 4
1 0
1 1 1
1 2 1 2
1 3 3 1 3
1 4 6 4 1 4
Note that if d = '1331', then you can add a space between each digit using ' '.join(d):
In [29]: d = '1331'
In [30]: ' '.join(d)
Out[30]: '1 3 3 1'
Note that using d = str(11**x) is a problematic way of computing the numbers in Pascal's triangle since it does not give you the correct digits for x >= 5. For example,
Please input nth level: 5
1 0
1 1 1
1 2 1 2
1 3 3 1 3
1 4 6 4 1 4
1 6 1 0 5 1 5 <-- Should be 1 5 10 10 5 1 !
You'll probably want to compute the digits in Pascal's triangle a different way.
In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2
Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.
Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1
It has been posted that slicing on the second index can be done on a multi-indexed pandas Series:
import numpy as np
import pandas as pd
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
s = pd.Series(np.random.randn(len(sequence)),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print s
0 0 0.021362
1 0.917947
2 -0.956313
1 0 -0.242659
1 0.398657
2 0.455909
3 0.200061
4 -1.273537
2 0 0.747849
1 -0.012899
2 1.026659
3 -0.256648
4 0.799381
5 0.064147
6 0.491336
Then to get the first three rows for the first index=1, you simply say:
s[1].ix[range(3)]
0 -0.242659
1 0.398657
2 0.455909
This works fine for 1-dimensional Series, but not for DataFrames:
buckets = np.repeat(range(3), [3,5,7])
sequence = np.hstack(map(range,[3,5,7]))
d = pd.DataFrame(np.random.randn(len(sequence),2),
index=pd.MultiIndex.from_tuples(zip(buckets, sequence)))
print d
0 1
0 0 1.217659 0.312286
1 0.559782 0.686448
2 -0.143116 1.146196
1 0 -0.195582 0.298426
1 1.504944 -0.205834
2 0.018644 -0.979848
3 -0.387756 0.739513
4 0.719952 -0.996502
2 0 0.065863 0.481190
1 -1.309163 0.881319
2 0.545382 2.048734
3 0.506498 0.451335
4 0.872743 -0.070985
5 -1.160473 1.082550
6 0.331796 -0.366597
d[1].ix[range(3)]
0 0 0.312286
1 0.686448
2 1.146196
Name: 1
It gives you the "1th" column of data, and the first three rows, irrespective of the first index level. How can you get the first three rows for the first index=1 for a multi-indexed DataFrame?
d.xs(1)[0:3]
0 1
0 -0.716206 0.119265
1 -0.782315 0.097844
2 2.042751 -1.116453
.loc is more efficient and is evaluated simultaneously
s.loc[pd.IndexSlice[1],:3] will return 0th level = 1 and [0:3] entry.