I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you
There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop
Related
I am trying to print out an std::array as seen below, the output is supposed to consist of only booleans, but there seem to be numbers in the output aswell (also below). I've tried printing out the elements which give numbers on their own, but then I get their actual value, which is weird.
My main function:
float f(float x, float y)
{
return x * x + y * y - 1;
}
int main()
{
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> temp = ConvertToBinaryImage(&f);
for(int i = 0; i < (int)temp.size(); ++i)
{
for(int j = 0; j < (int)temp[0].size(); ++j)
{
std::cout << temp[i][j] << " ";
}
std::cout << std::endl;
}
}
The function that sets the array:
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> ConvertToBinaryImage(float(*func)(float, float))
{
std::array<std::array<bool, ARRAY_SIZE_X>, ARRAY_SIZE_Y> result;
for(float x = X_MIN; x <= X_MAX; x += STEP_SIZE)
{
for(float y = Y_MIN; y <= Y_MAX; y += STEP_SIZE)
{
int indx = ARRAY_SIZE_X - (x - X_MIN) * STEP_SIZE_INV;
int indy = ARRAY_SIZE_Y - (y - Y_MIN) * STEP_SIZE_INV;
result[indx][indy] = func(x, y) < 0;
}
}
return result;
}
The constants
#define X_MIN -1
#define Y_MIN -1
#define X_MAX 1
#define Y_MAX 1
#define STEP_SIZE_INV 10
#define STEP_SIZE (float)1 / STEP_SIZE_INV
#define ARRAY_SIZE_X (X_MAX - X_MIN) * STEP_SIZE_INV
#define ARRAY_SIZE_Y (Y_MAX - Y_MIN) * STEP_SIZE_INV
My output:
184 225 213 111 0 0 0 0 230 40 212 111 0 0 0 0 64 253 98 0
0 0 0 0 1 0 1 0 1 1 1 1 6 1 0 0 168 0 0 0
0 183 213 111 0 0 0 0 0 0 0 0 0 0 0 0 9 242 236 108
0 0 0 1 64 1 1 0 1 1 1 1 240 1 1 1 249 1 0 0
0 21 255 0 0 0 0 0 98 242 236 108 0 0 0 0 0 0 0 0
0 0 0 1 128 1 1 0 1 1 1 1 128 1 1 1 0 1 1 0
0 1 255 1 0 1 1 0 1 1 1 1 0 1 1 1 31 1 1 1
0 0 0 0 184 225 213 111 0 0 0 0 2 0 0 0 0 0 0 0
9 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 64 1 1 1
0 1 0 1 64 1 1 0 1 1 1 1 96 1 1 1 249 1 1 1
0 1 213 1 0 1 1 0 1 1 1 1 0 1 1 1 32 1 1 1
0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1
0 21 255 0 0 0 0 0 80 59 117 0 0 0 0 0 32 112 64 0
0 1 0 1 17 1 1 16 1 1 1 1 104 1 1 1 0 1 1 1
0 0 144 1 249 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0
0 0 0 1 80 1 1 0 1 1 1 1 24 1 1 1 0 1 1 0
0 0 0 0 0 0 0 0 17 0 1 16 0 0 0 0 112 7 255 0
0 0 0 1 134 1 1 30 1 1 1 1 8 1 1 1 0 1 0 0
0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 32 0 0 0
0 0 0 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0
Floating point maths will often not produce accurate results, see Is floating point math broken?.
If we print out the values of indx and indy:
20, 20
20, 19
20, 18
20, 17
20, 15
20, 14
20, 13
20, 13
20, 11
20, 10
20, 9
20, 9
20, 8
20, 6
20, 5
20, 5
20, 3
20, 3
20, 1
20, 1
19, 20
19, 19
19, 18
19, 17
...
You can see that you are writing to indexes with the value 20 which is out of bounds of the array and also you aren't writing to every index leaving some of the array elements uninitialised. Though normally booleans are only true or false they are usually actually stored as a byte allowing storing values between 0 and 255, printing the uninitialised values is undefined behaviour.
We can fix your code in this particular instance by calculating the indexes a little more carefully:
int indx = std::clamp(int(round(ARRAY_SIZE_X - (x - X_MIN) * STEP_SIZE_INV)), 1, ARRAY_SIZE_X)-1;
int indy = std::clamp(int(round(ARRAY_SIZE_Y - (y - Y_MIN) * STEP_SIZE_INV)), 1, ARRAY_SIZE_Y)-1;
There are two fixes here, you were generating values between 1 and 20, the -1 reduces this to 0 to 19. The round solves the issue of not using all the indexes (you were simply truncating by assigning to an int). The clamp ensures the values are always in range (though in this case the calculations work out to be in range).
As you want to always write to every pixel a better solution would be to iterate over the values of indx and indy and calculate the values of x and y from the indices:
for (int indx = 0; indx < ARRAY_SIZE_X; indx++)
{
float x = X_MIN - (indx - ARRAY_SIZE_X) * STEP_SIZE;
for (int indy = 0; indy < ARRAY_SIZE_Y; indy++)
{
float y = Y_MIN - (indy - ARRAY_SIZE_Y) * STEP_SIZE;
result[indx][indy] = func(x, y) < 0;
}
}
Trying to categorize a variable X which has 82 values as 0, 118 values between 1 and 6, 0 values between 7 and 12, 0 values between 13 and 18, 0 values between 19 and 24.
Tried the following code:
gen X = .
replace X = 1 if Y >= 1 & Y <= 6
replace X = 2 if Y >= 7 & Y <= 12
replace X = 3 if Y >= 13 & Y <= 18
replace X = 4 if Y >= 19 & Y <= 24
I wish to see X categorized as 0, 1-6, 7-12, 13-18, 19-24. Instead of just 0 and 1.
Current Results:
tab X
X Freq. Percent Cum.
0 82 41.00 41.00
1 118 59.00 100.00
Total 200 100.00
* Example generated by -dataex-. To install: ssc install dataex
clear
input int FID byte Y float X
150 0 0
17 0 0
95 1 1
0 0 0
18 0 0
1 0 0
96 0 0
54 0 0
172 3 1
97 0 0
57 1 1
19 0 0
98 1 1
151 0 0
99 1 1
2 3 1
197 1 1
55 2 1
58 1 1
100 0 0
end
Your code does serve for your purpose, i.e. variable X is indeed the right set of categories for variable Y as you intended.
The fact that you only see X in the range 0,1 simply means that the data have no observations with Y falling in other categories. If the data had any Y belonging to other categories, then the correct corresponding values of X would show up.
A direct way to achieve this output is shown below. Just give it a try.
egen YCat = cut(Y), at(0,1,7,13,19,25)
Your code looks fine, except crucially that nothing in your code yields 0 as a result.
However, I disagree with #Romalpa Akzo on recommending egen, cut(). Even an experienced Stata user is unlikely to remember the exact rules used by that function of that command.
Are lower bounds >= or >, in particular? What happens above and below the extreme values mentioned? What if you don't want results 1 up?
I prefer explicit code.
Here's another way to do it. With a programmer's understanding that cond(A, B, C) yields B if A is true (non-zero) and C if A is false (zero), then we can go
clear
set obs 26
generate Y = _n - 1
generate X = cond(Y > 24, ., ///
cond(Y >= 19, 4, ///
cond(Y >= 13, 3, ///
cond(Y >= 7, 2, ///
cond(Y >= 1, 1, 0 )))))
tabulate Y X , missing
| X
Y | 0 1 2 3 4 . | Total
-----------+------------------------------------------------------------------+----------
0 | 1 0 0 0 0 0 | 1
1 | 0 1 0 0 0 0 | 1
2 | 0 1 0 0 0 0 | 1
3 | 0 1 0 0 0 0 | 1
4 | 0 1 0 0 0 0 | 1
5 | 0 1 0 0 0 0 | 1
6 | 0 1 0 0 0 0 | 1
7 | 0 0 1 0 0 0 | 1
8 | 0 0 1 0 0 0 | 1
9 | 0 0 1 0 0 0 | 1
10 | 0 0 1 0 0 0 | 1
11 | 0 0 1 0 0 0 | 1
12 | 0 0 1 0 0 0 | 1
13 | 0 0 0 1 0 0 | 1
14 | 0 0 0 1 0 0 | 1
15 | 0 0 0 1 0 0 | 1
16 | 0 0 0 1 0 0 | 1
17 | 0 0 0 1 0 0 | 1
18 | 0 0 0 1 0 0 | 1
19 | 0 0 0 0 1 0 | 1
20 | 0 0 0 0 1 0 | 1
21 | 0 0 0 0 1 0 | 1
22 | 0 0 0 0 1 0 | 1
23 | 0 0 0 0 1 0 | 1
24 | 0 0 0 0 1 0 | 1
25 | 0 0 0 0 0 1 | 1
-----------+------------------------------------------------------------------+----------
Total | 1 6 6 6 6 1 | 26
Naturally, you could write all the command on one line, but many will find the multiline layout easier to understand and to debug. With nested function calls, each new condition implies a promise to close all the parentheses at the end.
Multiple commands like those in the question are preferred by many Stata users too, so taste is behind many choices.
I have a dataset consisting of variables ObservationNumber, MeasurementNumber, SubjectID, and many dummy variables.
I would like to consolidate all non-zero values into one row by SubjectID GroupNumber.
Have:
ObsNum MeasurementNum SubjectID Dummy0 Dummy1 ... Dummy999
----------------------------------------------------...---------------
01 1 1 0 1 ... 0
02 2 1 0 1 ... 0
03 3 1 0 1 ... 0
04 4 1 0 0 ... 0
05 5 1 - - ... -
06 6 1 0 0 ... 0
07 1 2 1 0 ... 0
08 2 2 0 0 ... 0
09 3 2 0 1 ... 0
10 4 2 1 0 ... 0
11 4 2 0 1 ... 0
12 5 2 0 0 ... 1
13 6 2 0 0 ... 0
14 6 2 0 0 ... 1
15 6 2 0 0 ... 0
16 6 2 0 0 ... 0
17 6 2 0 1 ... 0
18 6 2 0 0 ... 0
19 6 2 0 0 ... 0
20 6 2 0 0 ... 0
21 6 2 1 0 ... 0
22 1 3 1 0 ... 0
23 2 3 0 1 ... 0
24 3 3 0 0 ... 1
25 4 3 - - ... -
26 5 3 0 0 ... 0
27 6 3 0 0 ... 0
28 1 4 - - ... -
29 2 4 0 0 ... 0
30 3 4 0 1 ... 0
31 4 4 1 0 ... 0
32 4 4 0 1 ... 0
33 4 4 0 0 ... 1
34 5 4 0 0 ... 1
35 6 4 0 1 ... 0
36 6 4 0 0 ... 1
Want:
MeasurementNum SubjectID Dummy0 Dummy1 ... Dummy999
----------------------------------------------------...---------------
1 1 0 1 ... 0
2 1 0 1 ... 0
3 1 0 1 ... 0
4 1 0 0 ... 0
5 1 - - ... -
6 1 0 0 ... 0
1 2 1 0 ... 0
2 2 0 0 ... 0
3 2 0 1 ... 0
4 2 1 1 ... 0
5 2 0 0 ... 1
6 2 1 1 ... 1
1 3 1 0 ... 0
2 3 0 1 ... 0
3 3 0 0 ... 1
4 3 - - ... -
5 3 0 0 ... 0
6 3 0 0 ... 0
1 4 - - ... -
2 4 0 0 ... 0
3 4 0 1 ... 0
4 4 1 1 ... 1
5 4 0 0 ... 1
6 4 0 1 ... 1
Each SubjectID has six measurement in which a set of dummyvariables are measured without outcome 0, 1 or missing. If a missing value occurs, all dummy variables for the respective observation are missing--and only one observation will be present in the dataset for that `MeasurementNumber.
I have tried to use the UPDATE statement, but it seems to not be able to deal with '0' and '-'.
Is there a direct way of condensing all dummyvariables in this dataset for each SubjectID grouped by MeasurementNumber?
Use Proc MEANS with BY and OUTPUT statements.
data have;
rownum = 0;
do rowid = 1 to 1000;
subjectid + 1;
do measurenum = 1 to 6;
do repeat = 1 to ceil(4 * ranuni(123));
array flags flag1-flag999;
do _n_ = 1 to dim(flags);
flags(_n_) = ranuni(123) < 0.10;
if subjectid < 7 and measurenum = subjectid then flags(_n_) = .;
end;
rownum + 1;
output;
end;
end;
end;
keep rownum measurenum subjectid flag:;
run;
proc means noprint data=have;
by subjectid measurenum;
var flag:;
output max=;
run;
I have a large number of rows dataframe(df_m) as following, I want to plot the number of occurrence of month for years(2010-2017) of date_m column in the dataframe. Since the year range of date_m is from 2010 -2017.
db num date_a date_m date_c zip_b zip_a
0 old HKK10032 2010-07-14 2010-07-26 NaT NaN NaN
1 old HKK10109 2011-07-14 2011-09-15 NaT NaN NaN
2 old HNN10167 2012-07-15 2012-08-09 NaT 177-003 NaN
3 old HKK10190 2013-07-15 2013-09-02 NaT NaN NaN
4 old HKK10251 2014-07-16 2014-05-02 NaT NaN NaN
5 old HKK10253 2015-07-16 2015-05-01 NaT NaN NaN
6 old HNN10275 2017-07-16 2017-07-18 2010-07-18 1070062 NaN
7 old HKK10282 2017-07-16 2017-08-16 NaT NaN NaN
............................................................
Firstly, I abstract the month occurrence of month(1-12) for every year(2010-2017). But there is error in my code:
lst_all = []
for i in range(2010, 2018):
lst_num = [sum(df_m.date_move.dt.month == j & df_m.date_move.dt.year == i) for j in range(1, 13)]
lst_all.append(lst_num)
print lst_all
You need add () to conditions:
lst_all = []
for i in range(2010, 2018):
lst_num = [((df_m.date_m.dt.month == j) & (df_m.date_m.dt.year == i)).sum() for j in range(1, 13)]
lst_all.append(lst_num)
Then get:
df1 = pd.DataFrame(lst_all, index=range(2010, 2018), columns=range(1, 13))
print (df1)
1 2 3 4 5 6 7 8 9 10 11 12
2010 0 0 0 0 0 0 1 0 0 0 0 0
2011 0 0 0 0 0 0 0 0 1 0 0 0
2012 0 0 0 0 0 0 0 1 0 0 0 0
2013 0 0 0 0 0 0 0 0 1 0 0 0
2014 0 0 0 0 1 0 0 0 0 0 0 0
2015 0 0 0 0 1 0 0 0 0 0 0 0
2016 0 0 0 0 0 0 0 0 0 0 0 0
2017 0 0 0 0 0 0 1 1 0 0 0 0
I'm trying to generate code which will take the components (i.e, a-f) of various combination permutations (combo) one, two, three, or four units long using these six components and provide various non duplicating combinations of combinations (combo.combo) which contain all of the components (i.e., [ab + cdef and ac + bde + f] but not [ae + bc + df and aef + bc + d]).
It would be nice if this code could allow me to 1) input the number of components, 2) input the min and max unit length per combo, 3) input the min and max number of combos per combo.combo, and 4) randomize the output list of combo.combos.
Maybe start with some kind of iteration loop to generate each version of the 720 possible component combinations (a-f) and then start pruning that list based on the set limiting parameters? I've got some working knowledge of python and will get started, but any tips or suggestions are most welcome.
combo.combo a b c d e f
a.bcdef 1 1 1 1 1 1
ab.cdef 1 1 1 1 1 1
abc.def 1 1 1 1 1 1
abcd.ef 1 1 1 1 1 1
abcde.f 1 1 1 1 1 1
a.b.cdef 1 1 1 1 1 1
a.bc.def 1 1 1 1 1 1
a.bcd.ef 1 1 1 1 1 1
a.bcde.f 1 1 1 1 1 1
ab.c.def 1 1 1 1 1 1
I've found a lot of code which will generate combination permutations but not combinations of combinations. I've included a binary matrix for the combination components, but am stuck on where to proceed from here or if this matrix is a false start (although a helpful visual aide.)
combo a b c d e f
a 1 0 0 0 0 0
b 0 1 0 0 0 0
c 0 0 1 0 0 0
d 0 0 0 1 0 0
e 0 0 0 0 1 0
f 0 0 0 0 0 1
ab 1 1 0 0 0 0
ac 1 0 1 0 0 0
ad 1 0 0 1 0 0
ae 1 0 0 0 1 0
af 1 0 0 0 0 1
bc 0 1 1 0 0 0
bd 0 1 0 1 0 0
be 0 1 0 0 1 0
bf 0 1 0 0 0 1
cd 0 0 1 1 0 0
ce 0 0 1 0 1 0
cf 0 0 1 0 0 1
de 0 0 0 1 1 0
df 0 0 0 1 0 1
ef 0 0 0 0 1 1
abc 1 1 1 0 0 0
abd 1 1 0 1 0 0
abe 1 1 0 0 1 0
abf 1 1 0 0 0 1
acd 1 0 1 1 0 0
ace 1 0 1 0 1 0
acf 1 0 1 0 0 1
ade 1 0 0 1 1 0
adf 1 0 0 1 0 1
aef 1 0 0 0 1 1
bcd 0 1 1 1 0 0
bce 0 1 1 0 1 0
bcf 0 1 1 0 0 1
bde 0 1 0 1 1 0
bdf 0 1 0 1 0 1
bef 0 1 0 0 1 1
cde 0 0 1 1 1 0
cdf 0 0 1 1 0 1
cef 0 0 1 0 1 1
def 0 0 0 1 1 1
abcd 1 1 1 1 0 0
abce 1 1 1 0 1 0
abcf 1 1 1 0 0 1
abde 1 1 0 1 1 0
abdf 1 1 0 1 0 1
abef 1 1 0 0 1 1
acde 1 0 1 1 1 0
acdf 1 0 1 1 0 1
acef 1 0 1 0 1 1
adef 1 0 0 1 1 1
bcde 0 1 1 1 1 0
bcdf 0 1 1 1 0 1
bcef 0 1 1 0 1 1
bdef 0 1 0 1 1 1
cdef 0 0 1 1 1 1
The approach which first comes to mind is this:
generate all the combinations using the given components (which you already did :) )
treat the resulting combinations as a new set of components (so instead of a, b,...,f your set will contain a, ab, abc, ...)
generate all the combinations from the second set
from the new set of combinations only keep those which apply to your condition (it's not very clear from your example what the constraint is)
This, of course, has sky-high exponential complexity, since you'll have to backtrack twice and step 3 has way more possibilities.
It's very possible that there's a more efficient algorithm, starting from the constraint ("non duplicating combinations of combinations which contain all of the components").