Converting dict of dicts into pandas DataFrame - memory issues - python-2.7

I have a data structure that consists of a three-level nested dict that keeps counts of occurrences of a three part object. I'd like to build a DataFrame out of it with a specific shape, but I can't figure out a way to do it that doesn't involve consuming a lot of working memory---because the table is quite large (several GBs at full extent).
The basic functionality looks like this:
class SparseCubeTable:
def __init__(self):
self.table = {}
self.dim1 = []
self.dim2 = []
self.dim3 = []
def increment(self, dim1, dim2, dim3):
if dim1 in self.table:
if dim2 in self.table[dim1]:
if dim3 in self.table[dim1][dim2]:
self.table[dim1][dim2][dim3] += 1
else:
self.dim3.append(dim3)
self.table[dim1][dim2][dim3] = 1
else:
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1][dim2] = {dim3:1}
else:
self.dim1.append(dim1)
self.dim2.append(dim2)
self.dim3.append(dim3)
self.table[dim1] = {dim2:{dim3:1}}
This was constructed to make summing over keys easier, among other things. A SparseCubeTable is used like this:
In [23]: example = SparseCubeTable()
In [24]: example.increment("thing1", "thing2", "thing3")
In [25]: example.increment("thing1", "thing2", "thing3")
In [26]: example.increment("thing4", "thing5", "thing6")
In [27]: example.increment("thing1", "thing3", "thing5")
And you can get the data like this:
In [29]: example.table['thing1']['thing2']['thing3']
Out[29]: 2
The sort of DataFrame I want looks like this:
1 2 3 4
thing1 thing2 thing3 2
thing1 thing3 thing5 1
thing4 thing5 thing6 1
The DataFrame is going to be saved as an HDF5 db with columns 1-3 indexed and statistical transformations on column 4 (that require the whole table be temporarily in memory).
The problem is that the pandas.DataFrame.from_dict function builds a whole other sort of structure with the keys used as row labels, as far as I understand it. However, trying to use from_records forces me to copy out the whole data structure into a list, meaning that I now have double the memory size to worry about.
I tried implementing the solution in:
Create a pandas DataFrame from generator?
but in 0.12.0 what it ends up doing is first building a giant list of strings which is even worse. I assume writing out the structure to a csv and reading it back in is also going to be terrible on memory.
Is there a better way of doing this? Or should I just try to squeeze memory even further in the SparseCubeTable somehow? It seems so wasteful to have to build an intermediate list data structure to use from_records.

Here is a code for an efficient solution.
Create some data looking like yours. This is a list of 1000 3-tuples
In [1]: import random
In [2]: tags = [ 'thing{0}'.format(i) for i in xrange(100) ]
In [3]: data = [ (random.choice(tags),random.choice(tags),random.choice(tags)) for i in range(1000) ]
Our writing function, makes sure that when we write the index is globally unique (its not actually necessary, but since the index is actually written its 'nicer')
In [4]: def write(store,c):
...: df = DataFrame(c,columns=['dim1','dim2','dim3'])
...: try:
...: nrows = store.get_storer('df').nrows
...: except:
...: nrows = 0
...: df.index += nrows
...: store.append('df',df,data_columns=True)
...: return []
...:
In [5]: collector = []
In [6]: store = pd.HDFStore('data.h5',mode='w')
Iterate thru your data (or from a stream or whatever), and write it.
In [7]: for i, d in enumerate(data):
...: collector.append(d)
...: if i % 100 == 0 and i:
...: collector = write(store,collector)
...:
In [8]: write(store,collector)
Out[8]: []
The store
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: data.h5
/df frame_table (typ->appendable,nrows->1000,ncols->3,indexers->[index],dc->[dim1,dim2,dim3])
In [10]: store.select('df')
Out[10]:
dim1 dim2 dim3
0 thing28 thing87 thing29
1 thing62 thing70 thing50
2 thing64 thing12 thing98
3 thing33 thing98 thing46
4 thing46 thing5 thing76
5 thing2 thing9 thing21
6 thing1 thing63 thing68
7 thing42 thing30 thing45
8 thing56 thing71 thing77
9 thing99 thing10 thing91
10 thing40 thing9 thing10
11 thing70 thing54 thing59
12 thing94 thing65 thing3
13 thing93 thing24 thing25
14 thing95 thing94 thing86
15 thing41 thing55 thing3
16 thing88 thing10 thing47
17 thing89 thing58 thing33
18 thing16 thing66 thing55
19 thing68 thing20 thing99
20 thing34 thing71 thing28
21 thing67 thing87 thing97
22 thing77 thing74 thing6
23 thing63 thing41 thing30
24 thing14 thing62 thing66
25 thing20 thing36 thing67
26 thing33 thing19 thing58
27 thing0 thing71 thing24
28 thing1 thing48 thing42
29 thing18 thing12 thing4
30 thing85 thing97 thing20
31 thing73 thing71 thing70
32 thing91 thing43 thing48
33 thing45 thing6 thing87
34 thing0 thing28 thing8
35 thing56 thing38 thing61
36 thing39 thing92 thing35
37 thing69 thing26 thing22
38 thing16 thing16 thing79
39 thing4 thing16 thing12
40 thing81 thing79 thing1
41 thing77 thing90 thing83
42 thing53 thing17 thing89
43 thing53 thing15 thing37
44 thing25 thing7 thing20
45 thing44 thing14 thing25
46 thing62 thing84 thing23
47 thing83 thing50 thing60
48 thing68 thing64 thing24
49 thing73 thing53 thing43
50 thing86 thing67 thing31
51 thing75 thing63 thing82
52 thing8 thing10 thing90
53 thing34 thing23 thing12
54 thing66 thing97 thing26
55 thing66 thing53 thing27
56 thing79 thing22 thing37
57 thing43 thing82 thing66
58 thing87 thing53 thing92
59 thing33 thing71 thing97
... ... ...
[1000 rows x 3 columns]
In [11]: store.close()
Then you can do interesting things. If you are not reading the entire set in you may want to chunk this (which is a bit more involved if you are counting things).
In [56]: pd.read_hdf('data.h5','df').apply(lambda x: x.value_counts())
Out[56]:
dim1 dim2 dim3
thing0 12 6 8
thing1 14 7 8
thing10 10 10 7
thing11 8 10 14
thing12 11 14 11
thing13 11 12 7
thing14 8 14 3
thing15 12 11 11
thing16 7 10 11
thing17 16 9 13
thing18 13 8 10
thing19 11 7 8
thing2 9 5 17
thing20 6 7 11
thing21 7 8 8
thing22 4 17 14
thing23 14 11 7
thing24 10 5 14
thing25 11 11 12
thing26 13 10 15
thing27 12 15 16
thing28 11 10 8
thing29 7 7 8
thing3 11 14 14
thing30 11 16 8
thing31 7 6 12
thing32 8 12 9
thing33 13 12 12
thing34 12 8 5
thing35 6 10 8
thing36 6 9 13
thing37 8 10 12
thing38 7 10 4
thing39 14 11 7
thing4 9 7 10
thing40 12 8 9
thing41 8 16 11
thing42 9 11 13
thing43 8 6 13
thing44 9 13 11
thing45 7 13 7
thing46 12 8 13
thing47 9 10 9
thing48 8 9 9
thing49 4 8 7
thing5 13 7 7
thing50 14 12 9
thing51 5 7 11
thing52 9 11 12
thing53 9 15 15
thing54 7 9 13
thing55 6 10 10
thing56 12 11 11
thing57 12 9 11
thing58 12 12 10
thing59 6 13 10
thing6 8 5 7
thing60 12 9 6
thing61 5 9 9
thing62 8 10 8
... ... ...
[100 rows x 3 columns]
You can then do a 'groupby' like this:
In [69]: store = pd.HDFStore('data.h5')
In [61]: dim1 = Index(store.select_column('df','dim1').unique())
In [66]: store.close()
In [67]: groups = dim1[0:10]
In [68]: groups
Out[68]: Index([u'thing28', u'thing62', u'thing64', u'thing33', u'thing46', u'thing2', u'thing1', u'thing42', u'thing56', u'thing99'], dtype='object')
In [70]: pd.read_hdf('data.h5','df',where='dim1=groups').apply(lambda x: x.value_counts())
Out[70]:
dim1 dim2 dim3
thing1 14 2 1
thing10 NaN 1 1
thing11 NaN 1 2
thing12 NaN 5 NaN
thing13 NaN 1 NaN
thing14 NaN 1 1
thing15 NaN 1 1
thing16 NaN 1 3
thing17 NaN NaN 2
thing18 NaN 1 1
thing19 NaN 1 2
thing2 9 1 1
thing20 NaN 2 NaN
thing21 NaN NaN 1
thing22 NaN 2 2
thing23 NaN 2 3
thing24 NaN 2 1
thing25 NaN 3 2
thing26 NaN 2 2
thing27 NaN 3 1
thing28 11 NaN NaN
thing29 NaN 1 2
thing30 NaN 2 NaN
thing31 NaN 1 1
thing32 NaN 1 1
thing33 13 1 2
thing34 NaN 1 NaN
thing35 NaN 1 NaN
thing36 NaN 1 1
thing37 NaN 1 2
thing38 NaN 3 NaN
thing39 NaN 3 1
thing4 NaN 2 NaN
thing41 NaN NaN 1
thing42 9 1 1
thing43 NaN NaN 1
thing44 NaN 1 2
thing45 NaN NaN 2
thing46 12 NaN 1
thing47 NaN 1 1
thing48 NaN 1 NaN
thing49 NaN 1 NaN
thing5 NaN 2 2
thing50 NaN NaN 3
thing51 NaN 2 2
thing52 NaN 1 3
thing53 NaN 2 4
thing55 NaN NaN 2
thing56 12 1 1
thing57 NaN NaN 3
thing58 NaN 1 2
thing6 NaN NaN 1
thing60 NaN 1 1
thing61 NaN 1 4
thing62 8 2 1
thing63 NaN 1 1
thing64 15 NaN 1
thing66 NaN 1 2
thing67 NaN 2 NaN
thing68 NaN 1 1
... ... ...
[90 rows x 3 columns]

Related

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

how to print a list vertically python

>list1=[1,2,3,4]
>list2=[5,6,7,8]
>list3=[9,10,11,12]
>list4=[13,14,15,16]
>list5=[17,18,19,20]
>lists=[list1,list2,list3,list4,list5
I want to print the following code so that it outputs this way:
4 8 12 16 20
3 7 11 15 19
2 6 10 14 18
sorry didn't knew it ignored new lines:
1 5 9 13 17
Thanks in advance (new to python)
One way you could achieve this is to zip up the reversed lists and simply print all the elements out.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
list4=[13,14,15,16]
list5=[17,18,19,20]
for l1, l2, l3, l4, l5 in zip(reversed(list1), reversed(list2), reversed(list3), reversed(list4), reversed(list5)):
print(l1, l2, l3, l4, l5, end=' ')
output
4 8 12 16 20 3 7 11 15 19 2 6 10 14 18 1 5 9 13 17

Comparison of two CSV files in Python

I want to compare two csv files looking like below.
Here I want to find out unmatched signals.
I need some help in python. Please help me.
File 1
2
USER Name
7/31/2015 0:00
<XXXXXXX>
1 Signal_1 10
2 Signal_2 1 2 3 4 5
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_4 20
1 Signal_5 30
2 Signal_6 6 7 8 9 10 11 12 13
2 Signal_7 55 1.05 1.6 14.1
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
2 Signal_9 1 1 2 3 2
1 Signal_10 40
File 2
2
USER Name
7/31/2015 0:00
<XXXXXXX>
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
2 Signal_7 55 1.05 1.6 14.1
1 Signal_1 10
3 Signal_8 X 30 40 50 60 40 Y 14 15 26 14 26
1 Signal_10 14
2 Signal_9 1 1 2 3 2
2 Signal_6 6 7 8 59 10 15 12 13
1 Signal_4 20
2 Signal_2 1 2 3 4 5
Result:
File
3 Signal_3 X 5 10 15 20 25 Y 6 11 16 21 26
1 Signal_5 30
1 Signal_10 40
2 Signal_6 6 7 8 9 10 11 12 13
File 2
3 Signal_3 X 20 10 15 17 25 Y 6 11 16 21 26
1 Signal_5 5
1 Signal_10 14
2 Signal_9 1 1 2 3 2
If you want to check for fairly exact comparisons, you can use sets quite easily:
def sigset(fname):
with open(fname, 'rb') as f:
data = set(' '.join(line.split()) for line in f
if 'Signal' in line)
return data
s1 = sigset('sig1.txt')
s2 = sigset('sig2.txt')
print('File 1')
for line in sorted(s1 - s2):
print(line)
print('')
print('File 2')
for line in sorted(s2 - s1):
print(line)
with open('Sample1.csv', 'r') as t1, open('Sample2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
print fileone
print filetwo
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
for line in fileone:
if line not in filetwo:
outFile.write(line)

Merge pandas DataFrames based on irregular time intervals

I'm wondering how I can speed up a merge of two dataframes. One of the dataframes has time stamped data points (value col).
import pandas as pd
import numpy as np
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
The other has time interval information (start_time, end_time, and associated interval_id).
intervals = pd.DataFrame({'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
I'd like to merge these two dataframes more efficiently than the for loop below:
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
I keep imagining I'll be able to use pandas time series functionality, like a date range or TimeGrouper, but I have yet to figure out anything more pythonic (pandas-y?) than the above.
Example result:
time value interval_id start_time end_time
0 0.575976 0.022727 NaN NaN NaN
1 4.607545 0.222568 0 3.618715 8.294847
2 5.179350 0.438052 0 3.618715 8.294847
3 11.069956 0.641269 1 10.301728 19.870283
4 12.387854 0.344192 1 10.301728 19.870283
5 18.889691 0.582946 1 10.301728 19.870283
6 20.850469 -0.027436 NaN NaN NaN
7 23.199618 0.731316 2 21.488868 28.968338
8 26.631284 0.570647 2 21.488868 28.968338
9 26.996397 0.597035 2 21.488868 28.968338
10 28.601867 -0.131712 2 21.488868 28.968338
11 28.660986 0.710856 2 21.488868 28.968338
12 28.875395 -0.355208 2 21.488868 28.968338
13 28.959320 -0.430759 2 21.488868 28.968338
14 29.702800 -0.554742 NaN NaN NaN
Any suggestions from time series-savvy people out there would be greatly appreciated.
Update, after Jeff's answer:
The main problem is that interval_id has no relation to any regular time interval (e.g., intervals are not always approximately 10 seconds). One interval could be 10 seconds, the next could be 2 seconds, and the next could be 100 seconds, so I can't use any regular rounding scheme as Jeff proposed. Unfortunately, my minimal example above does not make that clear.
You could use np.searchsorted to find the indices representing where each value in data['time'] would fit between intervals['start_time']. Then you could call np.searchsorted again to find the indices representing where each value in data['time'] would fit between intervals['end_time']. Note that using np.searchsorted relies on interval['start_time'] and interval['end_time'] being in sorted order.
For each corresponding location in the arrays, where these two indices are equal, data['time'] fits in between interval['start_time'] and interval['end_time']. Note that this relies on the intervals being disjoint.
Using searchsorted in this way is about 5 times faster than using the for-loop:
import pandas as pd
import numpy as np
np.random.seed(1)
data = pd.DataFrame({'time':np.sort(np.random.uniform(0,100,size=50)),
'value':np.random.uniform(-1,1,size=50)})
intervals = pd.DataFrame(
{'interval_id':np.arange(9),
'start_time':np.random.uniform(0,5,size=9) + np.arange(0,90,10),
'end_time':np.random.uniform(5,10,size=9) + np.arange(0,90,10)})
def using_loop():
data['interval_id'] = np.nan
for index, ser in intervals.iterrows():
in_interval = (data['time'] >= ser['start_time']) & \
(data['time'] <= ser['end_time'])
data['interval_id'][in_interval] = ser['interval_id']
result = data.merge(intervals, how='outer').sort('time').reset_index(drop=True)
return result
def using_searchsorted():
start_idx = np.searchsorted(intervals['start_time'].values, data['time'].values)-1
end_idx = np.searchsorted(intervals['end_time'].values, data['time'].values)
mask = (start_idx == end_idx)
result = data.copy()
result['interval_id'] = result['start_time'] = result['end_time'] = np.nan
result['interval_id'][mask] = start_idx
result.ix[mask, 'start_time'] = intervals['start_time'][start_idx[mask]].values
result.ix[mask, 'end_time'] = intervals['end_time'][end_idx[mask]].values
return result
In [254]: %timeit using_loop()
100 loops, best of 3: 7.74 ms per loop
In [255]: %timeit using_searchsorted()
1000 loops, best of 3: 1.56 ms per loop
In [256]: 7.74/1.56
Out[256]: 4.961538461538462
you may want to have the intervals of 'time' specified slightly different, but should give you a start.
In [34]: data['on'] = np.round(data['time']/10)
In [35]: data.merge(intervals,left_on=['on'],right_on=['interval_id'],how='outer')
Out[35]:
time value on end_time interval_id start_time
0 1.301658 -0.462594 0 7.630243 0 0.220746
1 2.202654 0.054903 0 7.630243 0 0.220746
2 10.253593 0.329947 1 17.715596 1 10.299464
3 13.803064 -0.601021 1 17.715596 1 10.299464
4 17.086290 0.484119 2 27.175455 2 24.710704
5 21.797655 0.988212 2 27.175455 2 24.710704
6 26.265165 0.491410 3 37.702968 3 30.670753
7 27.777182 -0.121691 3 37.702968 3 30.670753
8 34.066473 0.659260 3 37.702968 3 30.670753
9 34.786337 -0.230026 3 37.702968 3 30.670753
10 35.343021 0.364505 4 49.489028 4 42.948486
11 35.506895 0.953562 4 49.489028 4 42.948486
12 36.129951 -0.703457 4 49.489028 4 42.948486
13 38.794690 -0.510535 4 49.489028 4 42.948486
14 40.508702 -0.763417 4 49.489028 4 42.948486
15 43.974516 -0.149487 4 49.489028 4 42.948486
16 46.219554 0.893025 5 57.086065 5 53.124795
17 50.206860 0.729106 5 57.086065 5 53.124795
18 50.395082 -0.807557 5 57.086065 5 53.124795
19 50.410783 0.996247 5 57.086065 5 53.124795
20 51.602892 0.144483 5 57.086065 5 53.124795
21 52.006921 -0.979778 5 57.086065 5 53.124795
22 52.682896 -0.593500 5 57.086065 5 53.124795
23 52.836037 0.448370 5 57.086065 5 53.124795
24 53.052130 -0.227245 5 57.086065 5 53.124795
25 57.169775 0.659673 6 65.927106 6 61.590948
26 59.336176 -0.893004 6 65.927106 6 61.590948
27 60.297771 0.897418 6 65.927106 6 61.590948
28 61.151664 0.176229 6 65.927106 6 61.590948
29 61.769023 0.894644 6 65.927106 6 61.590948
30 64.221220 0.893012 6 65.927106 6 61.590948
31 67.907417 -0.859734 7 78.192671 7 72.463468
32 71.460483 -0.271364 7 78.192671 7 72.463468
33 74.514028 0.621174 7 78.192671 7 72.463468
34 75.822643 -0.351684 8 88.820139 8 83.183825
35 84.252778 -0.685043 8 88.820139 8 83.183825
36 84.838361 0.354365 8 88.820139 8 83.183825
37 85.770611 -0.089678 9 NaN NaN NaN
38 85.957559 0.649995 9 NaN NaN NaN
39 86.498339 0.569793 9 NaN NaN NaN
40 91.006735 0.731006 9 NaN NaN NaN
41 91.941862 0.964376 9 NaN NaN NaN
42 94.617522 0.626889 9 NaN NaN NaN
43 95.318288 -0.088918 10 NaN NaN NaN
44 95.595243 0.539685 10 NaN NaN NaN
45 95.818267 -0.989647 10 NaN NaN NaN
46 98.240444 0.931445 10 NaN NaN NaN
47 98.722869 0.442502 10 NaN NaN NaN
48 99.349198 0.585264 10 NaN NaN NaN
49 99.829372 -0.743697 10 NaN NaN NaN
[50 rows x 6 columns]

Can I use regular expressions to search for multiples of a number?

I'm trying to search a big project for all examples of where I've declared an array with [48] as the size or any multiples of 48.
Can I use a regular expression function to find matches of 48 * n?
Thanks.
Here you go (In PHP's PCRE syntax):
^(0*|(1(01*?0)*?1|0)+?0{4})$
Usage:
preg_match('/^(0*|(1(01*?0)*?1|0)+?0{4})$/', decbin($number));
Now, why it works:
Well we know that 48 is really just 3 * 16. And 16 is just 2*2*2*2. So, any number divisible by 2^4 will have the 4 most bits in its binary representation 0. So by ending the regexp with 0{4}$ is equivalent to saying that the number is divisible by 2^4 (or 16). So then, the bits to the left need to be divisible by 3. So using the regexp from this answer, we can tell if they are divisible by 3. So if the whole regexp matches, the number is divisible by both 3 and 16, and hence 48...
QED...
(Note, the leading 0| case handles the failed match when $number is 0). I've tested this on all numbers from 0 to 48^5, and it correctly matches each time...
A generalization of your question is asking whether x is a string representing a multiple of n in base b. This is the same thing as asking whether the remainder of x divided by n is 0. You can easily create a DFA to compute this.
Create a DFA with n states, numbered from 0 to n - 1. State 0 is both the initial state and the sole accepting state. Each state will have b outgoing transitions, one for each symbol in the alphabet (since base-b gives you b digits to work with).
Each state represents the remainder of the portion of x we've seen so far, divided by n. This is why we have n of them (dividing a number by n yields a remainder in the range 0 to n - 1), and also why state 0 is the accepting state.
Since the digits of x are processed from left to right, if we have a number y from the first few digits of x and read the digit d, we get the new value of y from yb + d. But more importantly, the remainder r changes to (rb + d) mod n. So we now know how to connect the transition arcs and complete the DFA.
You can do this for any n and b. Here, for example, is one that accepts multiples of 18 in base-10 (states on the rows, inputs on the columns):
| 0 1 2 3 4 5 6 7 8 9
---+-------------------------------
→0 | 0 1 2 3 4 5 6 7 8 9 ←accept
1 | 10 11 12 13 14 15 16 17 0 1
2 | 2 3 4 5 6 7 8 9 10 11
3 | 12 13 14 15 16 17 0 1 2 3
4 | 4 5 6 7 8 9 10 11 12 13
5 | 14 15 16 17 0 1 2 3 4 5
6 | 6 7 8 9 10 11 12 13 14 15
7 | 16 17 0 1 2 3 4 5 6 7
8 | 8 9 10 11 12 13 14 15 16 17
9 | 0 1 2 3 4 5 6 7 8 9
10 | 10 11 12 13 14 15 16 17 0 1
11 | 2 3 4 5 6 7 8 9 10 11
12 | 12 13 14 15 16 17 0 1 2 3
13 | 4 5 6 7 8 9 10 11 12 13
14 | 14 15 16 17 0 1 2 3 4 5
15 | 6 7 8 9 10 11 12 13 14 15
16 | 16 17 0 1 2 3 4 5 6 7
17 | 8 9 10 11 12 13 14 15 16 17
These get really tedious as n and b get larger, but you can obviously write a program to generate them for you no problem.
1|48|2304|110592|5308416
You are unlikely to have declared an array of size 48^5 or larger.
No, regular expressions can't calculate multiples (except in the unary number system: decimal 4 = unary 1111; decimal 8 = unary 11111111, so the regex ^(1111)+$ matches multiples of 4).
import re
# For real example,
# construction of a chain with integers multiples of 48
# and integers not multiple of 48.
from random import *
w = [ 48*randint( 1,10) for j in xrange(10) ]
w.extend( 48*randint(11,20) for j in xrange(10) )
w.extend( 48*randint(21,70) for j in xrange(10) )
a = [ el if el%48!=0 else el+1 for el in sample(xrange(1000),40) ]
w.extend(a)
shuffle(w)
texte = [ ''.join(sample(' abcdefghijklmonopqrstuvwxyz',randint(1,7))) for i in xrange(40) ]
X = ''.join(texte[i]+str(w[i]) for i in xrange(40))
# Searching the multiples of 48 in the chain X
def mult48(match):
g1 = match.group()
if int(g1)%48==0:
return ( g1, X[0:match.end()] )
else:
return ( g1, 'not multiple')
for match in re.finditer('\d+',X):
print '%s %s\n' % mult48(match)
Any multiple is difficult, but here's a (python-style) regexp that matches the first 200 multiples of 48.
0$|1(?:0(?:08$|56$)|1(?:04$|52$)|2(?:00$|48$|96$)|3(?:44$|92$)|4(?:4(?:$|0$)|88$\
)|5(?:36$|84$)|6(?:32$|80$)|7(?:28$|76$)|8(?:24$|72$)|9(?:2(?:$|0$)|68$))|2(?:0(\
?:16$|64$)|1(?:12$|60$)|2(?:08$|56$)|3(?:04$|52$)|4(?:0(?:$|0$)|48$|96$)|5(?:44$\
|92$)|6(?:40$|88$)|7(?:36$|84$)|8(?:32$|8(?:$|0$))|9(?:28$|76$))|3(?:0(?:24$|72$\
)|1(?:20$|68$)|2(?:16$|64$)|3(?:12$|6(?:$|0$))|4(?:08$|56$)|5(?:04$|52$)|6(?:00$\
|48$|96$)|7(?:44$|92$)|8(?:4(?:$|0$)|88$)|9(?:36$|84$))|4(?:0(?:32$|80$)|1(?:28$\
|76$)|2(?:24$|72$)|3(?:2(?:$|0$)|68$)|4(?:16$|64$)|5(?:12$|60$)|6(?:08$|56$)|7(?\
:04$|52$)|8(?:$|0(?:$|0$)|48$|96$)|9(?:44$|92$))|5(?:0(?:40$|88$)|1(?:36$|84$)|2\
(?:32$|8(?:$|0$))|3(?:28$|76$)|4(?:24$|72$)|5(?:20$|68$)|6(?:16$|64$)|7(?:12$|6(\
?:$|0$))|8(?:08$|56$)|9(?:04$|52$))|6(?:0(?:00$|48$|96$)|1(?:44$|92$)|2(?:4(?:$|\
0$)|88$)|3(?:36$|84$)|4(?:32$|80$)|5(?:28$|76$)|6(?:24$|72$)|7(?:2(?:$|0$)|68$)|\
8(?:16$|64$)|9(?:12$|60$))|7(?:0(?:08$|56$)|1(?:04$|52$)|2(?:0(?:$|0$)|48$|96$)|\
3(?:44$|92$)|4(?:40$|88$)|5(?:36$|84$)|6(?:32$|8(?:$|0$))|7(?:28$|76$)|8(?:24$|7\
2$)|9(?:20$|68$))|8(?:0(?:16$|64$)|1(?:12$|6(?:$|0$))|2(?:08$|56$)|3(?:04$|52$)|\
4(?:00$|48$|96$)|5(?:44$|92$)|6(?:4(?:$|0$)|88$)|7(?:36$|84$)|8(?:32$|80$)|9(?:2\
8$|76$))|9(?:0(?:24$|72$)|1(?:2(?:$|0$)|68$)|2(?:16$|64$)|3(?:12$|60$)|4(?:08$|5\
6$)|5(?:04$|52$)|6(?:$|0$))