pandas: count the occurrence of month of years

pandas: count the occurrence of month of years - python-2.7

I have a large number of rows dataframe(df_m) as following, I want to plot the number of occurrence of month for years(2010-2017) of date_m column in the dataframe. Since the year range of date_m is from 2010 -2017.
db num date_a date_m date_c zip_b zip_a
0 old HKK10032 2010-07-14 2010-07-26 NaT NaN NaN
1 old HKK10109 2011-07-14 2011-09-15 NaT NaN NaN
2 old HNN10167 2012-07-15 2012-08-09 NaT 177-003 NaN
3 old HKK10190 2013-07-15 2013-09-02 NaT NaN NaN
4 old HKK10251 2014-07-16 2014-05-02 NaT NaN NaN
5 old HKK10253 2015-07-16 2015-05-01 NaT NaN NaN
6 old HNN10275 2017-07-16 2017-07-18 2010-07-18 1070062 NaN
7 old HKK10282 2017-07-16 2017-08-16 NaT NaN NaN
............................................................
Firstly, I abstract the month occurrence of month(1-12) for every year(2010-2017). But there is error in my code:
lst_all = []
for i in range(2010, 2018):
lst_num = [sum(df_m.date_move.dt.month == j & df_m.date_move.dt.year == i) for j in range(1, 13)]
lst_all.append(lst_num)
print lst_all

You need add () to conditions:
lst_all = []
for i in range(2010, 2018):
lst_num = [((df_m.date_m.dt.month == j) & (df_m.date_m.dt.year == i)).sum() for j in range(1, 13)]
lst_all.append(lst_num)
Then get:
df1 = pd.DataFrame(lst_all, index=range(2010, 2018), columns=range(1, 13))
print (df1)
1 2 3 4 5 6 7 8 9 10 11 12
2010 0 0 0 0 0 0 1 0 0 0 0 0
2011 0 0 0 0 0 0 0 0 1 0 0 0
2012 0 0 0 0 0 0 0 1 0 0 0 0
2013 0 0 0 0 0 0 0 0 1 0 0 0
2014 0 0 0 0 1 0 0 0 0 0 0 0
2015 0 0 0 0 1 0 0 0 0 0 0 0
2016 0 0 0 0 0 0 0 0 0 0 0 0
2017 0 0 0 0 0 0 1 1 0 0 0 0

Related

What is the 5x5 equivalent of the 3x3 emboss kernel?

-2 -1 0
-1 1 1
0 1 2
This is 3x3 emboss kernel. How should I write this in 5x5?

As I understand, these filters take directional differences (see the wikipidea page).
We can decompose you filter into directions
0 -1 0 0 0 0 -2 0 0
0 0 0 -1 0 1 0 0 0
0 1 0 0 0 0 0 0 2
So, I think you can expand it over these 3 directions giving emphasis
0 0 -1 0 0 0 0 0 0 0 -2 0 0 0 0
0 0 -1 0 0 0 0 0 0 0 0 -2 0 0 0
0 0 0 0 0 -1 -1 0 1 1 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 2 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 2
So, the final kernel would be
-2 0 -1 0 0
0 -2 -1 0 0
-1 -1 1 1 1
0 0 1 2 0
0 0 1 0 2
May be you can also try interpolating filter coefficients marked as x
-2 x -1 0 0
x -2 -1 0 0
-1 -1 1 1 1
0 0 1 2 x
0 0 1 x 2

The simple solution to fitting any lower-dimensional convolution kernel into a higher-dimensional matrix of the same rank is to surround it by zero weights. This is especially true when you're dealing with a concept like embossing, which is arguably more interested in immediate vector of change than the rate at which it is changing. That is, for this embossing matrix,
You could equivalently use this in 5 x 5:
Granted, this will get you a different visual effect than anything with any part of the matrix filled in; but sometimes, especially with edge-detection, immediate clarity is more important. We aren't always displaying it. If this were something like a Guassian blur kernel, having a greater range could improve the effect, but embossing isn't that different conceptually from Sobel-Feldman and it may be better to keep it tight.

Notepad++ regex - search only in certain column all numbers between two numbers?

i have now 250 million lines of text from a database.
I want to highlight only certain values, that are only in the third column.
I use this \b1011(3[1-9]\d[1-9]|[4]\d\d\d|5[0-8][0-3][0-6])\b for highlight all values between 10113101 to 10115836.
Can one exclude the numbers from column 4?
Edit: a column means for me the text between the spaces
1 2 3 4 5 ..... columns
307607 1317011864 10113101 -25 13135611 2700 0 0 0 12 0 0 0 walk029h.rwx
2264 910115836 10114632 -15 20111192 900 0 0 0 11 0 0 0 walk029.rwx
326169 1010523891 10115836 -1 20911192 0 0 0 0 11 0 0 0 walk12h.rwx
38718 826265392 10113628 0 10114603 2700 0 0 0 11 0 0 0 street2.rwx
241512 1317011864 636346 0 10113987 900 0 0 0 12 0 0 0 walk029h.rwx
38718 826266129 10113448 0 10114310 900 0 0 0 10 0 0 0 tree5m.rwx
38718 826266243 10113898 0 10114810 900 0 0 0 10 0 0 0 tree9m.rwx

This pattern will capture the numbers you want in the third column only. Refer to capture group 1 for their values.
^(?:\S+\s){2}\b(1011(?:3[1-9]\d{2}|4\d{3}|5[0-8][0-3][0-6]))\b.*
All I did was modify yours to add the prefix and removed some redundancy.

time series sliding window with occurrence counts

I am trying to get a count between two timestamped values:
for example:
time letter
1 A
4 B
5 C
9 C
18 B
30 A
30 B
I am dividing time to time windows: 1+ 30 / 30
then I want to know how many A B C in each time window of size 1
timeseries A B C
1 1 0 0
2 0 0 0
...
30 1 1 0
this shoud give me a table of 30 rows and 3 columns: A B C of ocurancess
The problem is the data is taking to long to be break down because it iterates through all master table every time to slice the data eventhough thd data is already sorted
master = mytable
minimum = master.timestamp.min()
maximum = master.timestamp.max()
window = (minimum + maximum) / maximum
wstart = minimum
wend = minimum + window
concurrent_tasks = []
while ( wstart <= maximum ):
As = 0
Bs = 0
Cs = 0
for d, row in master.iterrows():
ttime = row.timestamp
if ((ttime >= wstart) & (ttime < wend)):
#print (row.channel)
if (row.channel == 'A'):
As = As + 1
elif (row.channel == 'B'):
Bs = Bs + 1
elif (row.channel == 'C'):
Cs = Cs + 1
concurrent_tasks.append([m_id, As, Bs, Cs])
wstart = wstart + window
wend = wend + window
Could you help me in making this perform better ? i want to use map function and i want to prevent python from looping through all the loop every time.
This is part of big data and it taking days to finish ?
thank you

There is a faster approach - pd.get_dummies():
In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 0 0
30 0 1 0
If you want to "compress" (group) it by time:
In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
A B C
time
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
or using sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)
r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
index=df['time'].unique(),
columns=df['letter'].unique(),
default_fill_value=0)
Result:
In [143]: r
Out[143]:
A B C
1 1 0 0
4 0 1 0
5 0 0 1
9 0 0 1
18 0 1 0
30 1 1 0
If we want to list all times from 1 to 30:
In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
A B C
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 1 0
19 0 0 0
20 0 0 0
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
or using Pandas approach:
In [159]: pd.get_dummies(df.set_index('time')['letter']) \
...: .groupby(level=0) \
...: .sum() \
...: .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
...:
Out[159]:
A B C
time
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 1
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 1
10 0 0 0
... .. .. ..
21 0 0 0
22 0 0 0
23 0 0 0
24 0 0 0
25 0 0 0
26 0 0 0
27 0 0 0
28 0 0 0
29 0 0 0
30 1 1 0
[30 rows x 3 columns]
UPDATE:
Timing:
In [163]: df = pd.concat([df] * 10**4, ignore_index=True)
In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop
In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop

Random function keeps on getting same result [duplicate]

This question already has answers here:
srand() — why call it only once?
(7 answers)
Closed 7 years ago.
I am writing a program to simulate Knight's tour randomly. (See wikipedia for what it means: http://en.wikipedia.org/wiki/Knight%27s_tour) First, I create a chess object, which is basically just a 8*8 array with numbers to indicate the position of the knight. I create a chess object and randomly assign a position for the knight. Then, I moved the knight randomly until there is no more legal moves and returns the number of moves performed.
int runTour ()
{
srand (time(NULL));
Chess knight(rand()%8, rand()%8); //Initialize random chess object.
knight.printBoard(); //Prints the board before moving
int moveNumber = 0; //A number from 0 to 7 that dictates how the knight moves
int counter = 0;
while (moveNumber != -1) //A moveNumber of -1 means there is no more legal move
{
moveNumber = knight.findRandMove(knight.getRow(), knight.getColumn()); //findRandMove is a function that returns a legal random move for the knight based on its position. It works perfectly.
knight.move(moveNumber); //move is a function that moves the knight
counter ++;
}
knight.printBoard(); // Returns board when move is exhausted
return counter; //Returns number of moves performed.
}
The interesting thing is that while it runs perfectly randomly from run to run, it keeps outputting the same thing in the same run. For example, this is the main() function:
int main(){
runTour();
runTour();
return 0;
}
And in BOTH runTour() it outputs: (where 0 represents positions not reached, 1 represents the current position of the knight, and 9 positions reached)
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 9 0 0 0
0 9 9 0 0 0 9 0
0 0 0 0 0 9 9 0
9 0 9 9 9 9 0 1
0 0 9 9 9 9 9 9
0 9 9 9 9 0 9 0
9 0 0 0 9 9 9 9
0 0 9 0 9 9 0 9
And when I run it again, BOTH runTour output:
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 9 9
0 9 0 0 9 9 9 0
0 0 9 9 9 9 9 9
1 0 9 0 9 9 0 9
So the random function is random in different runs, but is the same in each run. Why is this the case? How can I modify the code so that runTour() can have different performances when it is called? Thank you very much for reading this clumsy question.

As you´re using a timestamp as srand seed:
If both runTours are in the same second, what do you think will happen with your code?
...
srand is supposed to be called exactly one time, not one time per function call of runTour

Try moving your srand call to your main function. You should only have to seed the generator one time, rather than each time you call the function.

False Acceptance Rate and False Rejection Rate calculation using a n*n confusion matrix

FAR and FRR are used to express the results of biometric devices. Below is the confusion matrix produced by biometric data produced in weka. I couldn't find any resources explaining the procedure to calculate FAR and FRR using a n*n confusion matrix. Any help explaining the procedure would be of great help. Thanks in advance!
Weka also gives these values, TP Rate, FP Rate, Precision, Recall, F-Measure and ROC Area. Please suggest if the required values can be calculated using these.
=== Confusion Matrix ===
a b c d e f g h i j k l m n o <-- classified as
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | a = user1
0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 | b = user2
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | c = user3
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 | d = user4
0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 | e = user5
0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 | f = user6
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | g = user7
0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 | h = user9
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 | i = user10
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 | j = user11
0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 | k = user14
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 | l = user15
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 | m = user16
0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 | n = user17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 | o = user19

The accepted answer here by user "chl" has a reference to the Biometrics Literature: https://stats.stackexchange.com/questions/3489/calculating-false-acceptance-rate-for-a-gaussian-distribution-of-scores .
He says,
[the ROC curve] is a plot of (TAR=1-FRR, the false rejection rate) against false
acceptance rate (FAR).
However, commonly the ROC curve happens to be a plot of TP Rate as a function of False Positive Rate (FP Rate).
Seems you can use TP Rate and FP Rate.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

pandas: count the occurrence of month of years - python-2.7

Related

What is the 5x5 equivalent of the 3x3 emboss kernel?

Notepad++ regex - search only in certain column all numbers between two numbers?

time series sliding window with occurrence counts

Random function keeps on getting same result [duplicate]

False Acceptance Rate and False Rejection Rate calculation using a n*n confusion matrix

Categories

Resources