Can I use regular expressions to search for multiples of a number? - regex

I'm trying to search a big project for all examples of where I've declared an array with [48] as the size or any multiples of 48.
Can I use a regular expression function to find matches of 48 * n?
Thanks.

Here you go (In PHP's PCRE syntax):
^(0*|(1(01*?0)*?1|0)+?0{4})$
Usage:
preg_match('/^(0*|(1(01*?0)*?1|0)+?0{4})$/', decbin($number));
Now, why it works:
Well we know that 48 is really just 3 * 16. And 16 is just 2*2*2*2. So, any number divisible by 2^4 will have the 4 most bits in its binary representation 0. So by ending the regexp with 0{4}$ is equivalent to saying that the number is divisible by 2^4 (or 16). So then, the bits to the left need to be divisible by 3. So using the regexp from this answer, we can tell if they are divisible by 3. So if the whole regexp matches, the number is divisible by both 3 and 16, and hence 48...
QED...
(Note, the leading 0| case handles the failed match when $number is 0). I've tested this on all numbers from 0 to 48^5, and it correctly matches each time...

A generalization of your question is asking whether x is a string representing a multiple of n in base b. This is the same thing as asking whether the remainder of x divided by n is 0. You can easily create a DFA to compute this.
Create a DFA with n states, numbered from 0 to n - 1. State 0 is both the initial state and the sole accepting state. Each state will have b outgoing transitions, one for each symbol in the alphabet (since base-b gives you b digits to work with).
Each state represents the remainder of the portion of x we've seen so far, divided by n. This is why we have n of them (dividing a number by n yields a remainder in the range 0 to n - 1), and also why state 0 is the accepting state.
Since the digits of x are processed from left to right, if we have a number y from the first few digits of x and read the digit d, we get the new value of y from yb + d. But more importantly, the remainder r changes to (rb + d) mod n. So we now know how to connect the transition arcs and complete the DFA.
You can do this for any n and b. Here, for example, is one that accepts multiples of 18 in base-10 (states on the rows, inputs on the columns):
| 0 1 2 3 4 5 6 7 8 9
---+-------------------------------
→0 | 0 1 2 3 4 5 6 7 8 9 ←accept
1 | 10 11 12 13 14 15 16 17 0 1
2 | 2 3 4 5 6 7 8 9 10 11
3 | 12 13 14 15 16 17 0 1 2 3
4 | 4 5 6 7 8 9 10 11 12 13
5 | 14 15 16 17 0 1 2 3 4 5
6 | 6 7 8 9 10 11 12 13 14 15
7 | 16 17 0 1 2 3 4 5 6 7
8 | 8 9 10 11 12 13 14 15 16 17
9 | 0 1 2 3 4 5 6 7 8 9
10 | 10 11 12 13 14 15 16 17 0 1
11 | 2 3 4 5 6 7 8 9 10 11
12 | 12 13 14 15 16 17 0 1 2 3
13 | 4 5 6 7 8 9 10 11 12 13
14 | 14 15 16 17 0 1 2 3 4 5
15 | 6 7 8 9 10 11 12 13 14 15
16 | 16 17 0 1 2 3 4 5 6 7
17 | 8 9 10 11 12 13 14 15 16 17
These get really tedious as n and b get larger, but you can obviously write a program to generate them for you no problem.

1|48|2304|110592|5308416
You are unlikely to have declared an array of size 48^5 or larger.

No, regular expressions can't calculate multiples (except in the unary number system: decimal 4 = unary 1111; decimal 8 = unary 11111111, so the regex ^(1111)+$ matches multiples of 4).

import re
# For real example,
# construction of a chain with integers multiples of 48
# and integers not multiple of 48.
from random import *
w = [ 48*randint( 1,10) for j in xrange(10) ]
w.extend( 48*randint(11,20) for j in xrange(10) )
w.extend( 48*randint(21,70) for j in xrange(10) )
a = [ el if el%48!=0 else el+1 for el in sample(xrange(1000),40) ]
w.extend(a)
shuffle(w)
texte = [ ''.join(sample(' abcdefghijklmonopqrstuvwxyz',randint(1,7))) for i in xrange(40) ]
X = ''.join(texte[i]+str(w[i]) for i in xrange(40))
# Searching the multiples of 48 in the chain X
def mult48(match):
g1 = match.group()
if int(g1)%48==0:
return ( g1, X[0:match.end()] )
else:
return ( g1, 'not multiple')
for match in re.finditer('\d+',X):
print '%s %s\n' % mult48(match)

Any multiple is difficult, but here's a (python-style) regexp that matches the first 200 multiples of 48.
0$|1(?:0(?:08$|56$)|1(?:04$|52$)|2(?:00$|48$|96$)|3(?:44$|92$)|4(?:4(?:$|0$)|88$\
)|5(?:36$|84$)|6(?:32$|80$)|7(?:28$|76$)|8(?:24$|72$)|9(?:2(?:$|0$)|68$))|2(?:0(\
?:16$|64$)|1(?:12$|60$)|2(?:08$|56$)|3(?:04$|52$)|4(?:0(?:$|0$)|48$|96$)|5(?:44$\
|92$)|6(?:40$|88$)|7(?:36$|84$)|8(?:32$|8(?:$|0$))|9(?:28$|76$))|3(?:0(?:24$|72$\
)|1(?:20$|68$)|2(?:16$|64$)|3(?:12$|6(?:$|0$))|4(?:08$|56$)|5(?:04$|52$)|6(?:00$\
|48$|96$)|7(?:44$|92$)|8(?:4(?:$|0$)|88$)|9(?:36$|84$))|4(?:0(?:32$|80$)|1(?:28$\
|76$)|2(?:24$|72$)|3(?:2(?:$|0$)|68$)|4(?:16$|64$)|5(?:12$|60$)|6(?:08$|56$)|7(?\
:04$|52$)|8(?:$|0(?:$|0$)|48$|96$)|9(?:44$|92$))|5(?:0(?:40$|88$)|1(?:36$|84$)|2\
(?:32$|8(?:$|0$))|3(?:28$|76$)|4(?:24$|72$)|5(?:20$|68$)|6(?:16$|64$)|7(?:12$|6(\
?:$|0$))|8(?:08$|56$)|9(?:04$|52$))|6(?:0(?:00$|48$|96$)|1(?:44$|92$)|2(?:4(?:$|\
0$)|88$)|3(?:36$|84$)|4(?:32$|80$)|5(?:28$|76$)|6(?:24$|72$)|7(?:2(?:$|0$)|68$)|\
8(?:16$|64$)|9(?:12$|60$))|7(?:0(?:08$|56$)|1(?:04$|52$)|2(?:0(?:$|0$)|48$|96$)|\
3(?:44$|92$)|4(?:40$|88$)|5(?:36$|84$)|6(?:32$|8(?:$|0$))|7(?:28$|76$)|8(?:24$|7\
2$)|9(?:20$|68$))|8(?:0(?:16$|64$)|1(?:12$|6(?:$|0$))|2(?:08$|56$)|3(?:04$|52$)|\
4(?:00$|48$|96$)|5(?:44$|92$)|6(?:4(?:$|0$)|88$)|7(?:36$|84$)|8(?:32$|80$)|9(?:2\
8$|76$))|9(?:0(?:24$|72$)|1(?:2(?:$|0$)|68$)|2(?:16$|64$)|3(?:12$|60$)|4(?:08$|5\
6$)|5(?:04$|52$)|6(?:$|0$))

Related

Why does reduce operator does not work the way I expect it to?

I am trying to solve Euler 18 in Dyalog APL, and I am not able to understand why my solution does not work.
The problem is as follow:
By starting at the top of the triangle below and moving to adjacent
numbers on the row below, the maximum total from top to bottom is 23.
3
7 4
2 4 6
8 5 9 3
That is, 3 + 7 + 4 + 9 = 23.
Taking the example that I represent this way:
d ← (3 0 0 0) (7 4 0 0) (2 4 6 0) (8 5 9 3)
I am trying to solve it this way:
{⍵+((2⌈/⍺)),0}/⌽d
Which gives me this array: 22 19 15 0, where the bigger number is 22, which is not the right answer for the problem, which would be 23.
I am getting this behavior (left to right for ease of reading):
(2⌈/(8 5 9 3),0)+(2⌈/(2 4 6 0),0)+(2⌈/(7 4 0 0),0)+(2⌈/(3 0 0 0),0)
Which gives me the same result as the function.
What I would expect is this behavior (where each statement is substituted directly in the next line):
(2⌈/(8 5 9 3)),0
(2 4 6 0)+8 9 9 0
(2⌈/(10 13 15 0)),0
(7 4 0 0)+13 15 15 0
(2⌈/(20 19 15 0)),0
(3 0 0 0) + 20 19 15 0
23 19 15 0
Am I wondering where I am misunderstanding something in the APL process that leads to a different result than the one I am expecting.
Thank you!
/ works in the reverse way to what you expected - it evaluates through the array right-to-left.
F/a b c d is ⊂a F b F c F d, or, with parentheses, ⊂(a F (b F (c F d))).
After removing the ⌽ and swapping ⍺ and ⍵, you get {⍺+(2⌈/⍵),0}/d, which gives the result you want.

Merge pandas DataFrames based on irregular time intervals

I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
The other has time interval information (start_time, end_time, and associated interval_id).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
I'd like to merge these two dataframes more efficiently than the for loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
Example result:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
Any suggestions from time series-savvy people out there would be greatly appreciated.
Update, after Jeff's answer:
The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.
For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.
Using searchsorted in this way is about 5 times faster than using the for-loop:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
you may want to have the intervals of 'time' specified slightly different, but should give you a start.
In [34]: data['on'] = np.round(data['time']/10)
In [35]: data.merge(intervals,left_on=['on'],right_on=['interval_id'],how='outer')
Out[35]:
time value on end_time interval_id start_time
0 1.301658 -0.462594 0 7.630243 0 0.220746
1 2.202654 0.054903 0 7.630243 0 0.220746
2 10.253593 0.329947 1 17.715596 1 10.299464
3 13.803064 -0.601021 1 17.715596 1 10.299464
4 17.086290 0.484119 2 27.175455 2 24.710704
5 21.797655 0.988212 2 27.175455 2 24.710704
6 26.265165 0.491410 3 37.702968 3 30.670753
7 27.777182 -0.121691 3 37.702968 3 30.670753
8 34.066473 0.659260 3 37.702968 3 30.670753
9 34.786337 -0.230026 3 37.702968 3 30.670753
10 35.343021 0.364505 4 49.489028 4 42.948486
11 35.506895 0.953562 4 49.489028 4 42.948486
12 36.129951 -0.703457 4 49.489028 4 42.948486
13 38.794690 -0.510535 4 49.489028 4 42.948486
14 40.508702 -0.763417 4 49.489028 4 42.948486
15 43.974516 -0.149487 4 49.489028 4 42.948486
16 46.219554 0.893025 5 57.086065 5 53.124795
17 50.206860 0.729106 5 57.086065 5 53.124795
18 50.395082 -0.807557 5 57.086065 5 53.124795
19 50.410783 0.996247 5 57.086065 5 53.124795
20 51.602892 0.144483 5 57.086065 5 53.124795
21 52.006921 -0.979778 5 57.086065 5 53.124795
22 52.682896 -0.593500 5 57.086065 5 53.124795
23 52.836037 0.448370 5 57.086065 5 53.124795
24 53.052130 -0.227245 5 57.086065 5 53.124795
25 57.169775 0.659673 6 65.927106 6 61.590948
26 59.336176 -0.893004 6 65.927106 6 61.590948
27 60.297771 0.897418 6 65.927106 6 61.590948
28 61.151664 0.176229 6 65.927106 6 61.590948
29 61.769023 0.894644 6 65.927106 6 61.590948
30 64.221220 0.893012 6 65.927106 6 61.590948
31 67.907417 -0.859734 7 78.192671 7 72.463468
32 71.460483 -0.271364 7 78.192671 7 72.463468
33 74.514028 0.621174 7 78.192671 7 72.463468
34 75.822643 -0.351684 8 88.820139 8 83.183825
35 84.252778 -0.685043 8 88.820139 8 83.183825
36 84.838361 0.354365 8 88.820139 8 83.183825
37 85.770611 -0.089678 9 NaN NaN NaN
38 85.957559 0.649995 9 NaN NaN NaN
39 86.498339 0.569793 9 NaN NaN NaN
40 91.006735 0.731006 9 NaN NaN NaN
41 91.941862 0.964376 9 NaN NaN NaN
42 94.617522 0.626889 9 NaN NaN NaN
43 95.318288 -0.088918 10 NaN NaN NaN
44 95.595243 0.539685 10 NaN NaN NaN
45 95.818267 -0.989647 10 NaN NaN NaN
46 98.240444 0.931445 10 NaN NaN NaN
47 98.722869 0.442502 10 NaN NaN NaN
48 99.349198 0.585264 10 NaN NaN NaN
49 99.829372 -0.743697 10 NaN NaN NaN
[50 rows x 6 columns]

How to group data in kdb+ using customized groups?

I have a table (allsales) with a column for time (sale_time). I want to group the data by sale_time. But I want to be able to bucket this. ex any data where time is between 00:00:00-03:00:00 should be grouped together, 03:00:00-06:00:00 should be grouped together and so on. Is there a way to write such a query?
xbar is useful for rounding to interval values e.g.
q)5 xbar 1 3 5 8 10 11 12 14 18
0 0 5 5 10 10 10 10 15
We can then use this to group rows into time groups, for your example:
q)s:([] t:13:00t+00:15t*til 24; v:til 24)
q)s
t v
--------------
13:00:00.000 0
13:15:00.000 1
13:30:00.000 2
13:45:00.000 3
14:00:00.000 4
14:15:00.000 5
..
q)select count i,sum v by xbar[`int$03:00t;t] from s
t | x v
------------| ------
12:00:00.000| 8 28
15:00:00.000| 12 162
18:00:00.000| 4 86
"by xbar[`int$03:00t;t]" rounds the time column t to the nearest three hour value, then this is used as the group by.
There are few more ways to achieve the same results.
q)select count i , sum v by t:01:00u*3 xbar t.hh from s
q)select count i , sum v by t:180 xbar t.minute from s
t | x v
-----| ------
12:00| 8 28
15:00| 12 162
18:00| 4 86
But in all cases, be careful of the date column if present in the table, otherwise same time window across different dates will generate the wrong results.
q)s:([] d:24#2013.05.07 2013.05.08; t:13:00t+00:15t*til 24; v:til 24)
q)select count i , sum v by d, t:180 xbar t.minute from s
d t | x v
----------------| ----
2013.05.07 12:00| 4 12
2013.05.07 15:00| 6 78
2013.05.07 18:00| 2 42
2013.05.08 12:00| 4 16
2013.05.08 15:00| 6 84
2013.05.08 18:00| 2 44

_mm_cmpistri in reverse

Let's say I have these strings:
char ref[30] = "1234567891234567891";
char oth[30] = "1234567891234567891";
I want to use the SSE 4.2 _mm_cmpistri function in C++; Normally the string is parsed from left to right. Is there a way to tell the function to search in reverse (compare from right to left instead of left to right?
Instead of searching
--------------->
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
to search this way <-----------------
Later edit:
Here's what i want to do:
I have two strings and I need a function with this header:
int sse_cmp(const char *a, int posA, const char *b, int posB);
This function must compare the strings "backwords":
from posA to 0 or until posB == 0.
The function must return the number of of common chars from back.
Ex:
<--------- posA
a : 1 2 3 4 5 6 7 8 9
b : a b c d 7 8 9
<---- posB
will return 3 ( 987 )
What's the most efficient way to do it? ( with SSE )
You can use _SIDD_MOST_SIGNIFICANT as part of the mode parameter to _mm_cmpistri.
See Intel SSE4 programming reference

ROUNDUP? what does it do? in C++

Can someone explain to me what this does?
#define ROUNDUP(n,width) (((n) + (width) - 1) & ~unsigned((width) - 1))
Providing width is an even power of 2 (so 2,4,8,16,32 etc), it will return a number equal to or greater than n, which is a multiple of width, and which is the smallest value meeting that criteria.
So width = 16; 5->16, 7->16, 15->16, 16->16, 17->32, 18->32 etc.
EDIT I started out on providing an explanation of why this works as it does, as I sense that's really what the OP wants, but it turned into a rather convoluted story. If the OP is still confused, I'd suggest working through a few simple examples, say width = 16, n=15,16,17. Remember that & = bitwise AND, ~ = bitwise complement, and to use binary representation exclusively as you work through the examples.
It rounds n up to the next 'width' - but I think width needs to be a power of 2.
For example width == 8, n = 5:
(5 + 8 - 1) & ~(7)
= 12 & ~7
= 8
So 5 rounds to 8. Anything 1 - 8 rounds to 8. 9 to 16 rounds to 16. Etc. (0 rounds to 0)
It defines a macro called ROUNDUP which takes two parameters, n and width, and returns the value (n + width - 1) & ~unsigned(width - 1).
:)
Try this if you think you know what it does:
std::string s("WTF");
std::complex<double> c(-11,5);
ROUNDUP(s, c);
It won't work in C because of the unsigned. Here is what is does, as long as width is confined to powers of 2:
n width ROUNDUP(n,width)
----------------
0 4 0
1 4 4
2 4 4
3 4 4
4 4 4
5 4 8
6 4 8
7 4 8
8 4 8
9 4 12
10 4 12
11 4 12
12 4 12
13 4 16
14 4 16
15 4 16
16 4 16
17 4 20
18 4 20
19 4 20