Multithreading in Python with the threading and queue modules - python-2.7

I have a file with hundreds of thousands of lines, each line of which needs to be undergo the same process (calculating a co-variance). I was going to multithread because it takes pretty long as is. All the examples/tutorials I have seen have been fairly complicated for what I want to do, however. If anyone could point me to a good tutorial that explains how to use the two modules together that would be great.

Whenever I have to process something in parallel, I use something similar to this (I just ripped this out of an existing script):
#!/usr/bin/env python2
# This Python file uses the following encoding: utf-8
import os, sys, time
from multiprocessing import Queue, Manager, Process, Value, Event, cpu_count
class ThreadedProcessor(object):
def __init__(self, parser, input_file, output_file, threads=cpu_count()):
self.parser = parser
self.num_processes = threads
self.input_file = input_file
self.output_file = output_file
self.shared_proxy = Manager()
self.input_queue = Queue()
self.output_queue = Queue()
self.input_process = Process(target=self.parse_input)
self.output_process = Process(target=self.write_output)
self.processes = [Process(target=self.process_row) for i in range(self.num_processes)]
self.input_process.start()
self.output_process.start()
for process in self.processes:
process.start()
self.input_process.join()
for process in self.processes:
process.join()
self.output_process.join()
def parse_input(self):
for index, row in enumerate(self.input_file):
self.input_queue.put([index, row])
for i in range(self.num_processes):
self.input_queue.put('STOP')
def process_row(self):
for index, row in iter(self.input_queue.get, 'STOP'):
self.output_queue.put([index, row[0], self.parser.parse(row[1])])
self.output_queue.put('STOP')
def write_output(self):
current = 0
buffer = {}
for works in range(self.num_processes):
for index, id, row in iter(self.output_queue.get, 'STOP'):
if index != current:
buffer[index] = [id] + row
else:
self.output_file.writerow([id] + row)
current += 1
while current in buffer:
self.output_file.writerow(buffer[current])
del buffer[current]
current += 1
Basically, you have two processes managing the reading/writing of the file. One reads and parses the input, the other reads from the "done" queue and writes to your output file. The other processes are spawned (in this case the number is equal to the number of total processor cores your CPU has) and they all process elements from the input queue.

Related

accessing multiple tables concurrently in pytable

I'm trying to store temporal embedding in pytable. There are 12 tables and each table has more than 130,000 rows, where each table has two columns (word varchar, embedding float(numpy.arry(300,))). What I want is to calculate cosine similarity for a given against all the word in a given table and repeat this for all 12 tables. Presently I'm doing it sequentially by iterating on each table but it takes around 15 minutes to calculate for all 12 tables.
So my question is, is it possible to read all the table concurrently? I used multithreading but I error
Segmentation fault: 11
Below is my code snippet
def synchronized_open_file():
with lock:
return tb.open_file(FILENAME, mode="r", title="embedding DB")
def synchronized_close_file(self, *args, **kwargs):
with lock:
return self.close(*args, **kwargs)
outqueue = queue.Queue()
for table in list_table :
thread = threading.Thread(target=self.top_n_similar, args=(table,))
thread.start()
threads.append(thread)
try:
for _ in range(len(threads)):
result = outqueue.get()
if isinstance(result, Exception):
raise result
else:
top_n_neighbor_per_period[result[0]] = result[1]
finally:
for thread in threads:
thread.join()
def top_n_similar(table_name):
H5FILE = synchronized_open_file()
do work()
outqueue.put(result)
finally :
synchronized_close_file(H5FILE)
Yes, you can access multiple pytable objects simultaneously. A simple example is provided below that creates 3 tables using a (300,2) numpy record array created with random data. It demonstrates you can access all 3 tables as table objects -OR- as numpy arrays (or both).
I have not done multi-threading with pytables, so can't help with that. I suggest you get your code to work in serial before adding multi-threading. Also, review the pytables docs. I know h5py has specific procedures to use mpi4py for multi-threading. Pytables may have similar requirements.
Code Sample
import tables as tb
import numpy as np
h5f = tb.open_file('SO_55445040.h5',mode='w')
mydtype = np.dtype([('word',float),('varchar',float)])
arr = np.random.rand(300,2)
recarr = np.core.records.array(arr,dtype=mydtype)
h5f.create_table('/', 'table1', obj=recarr )
recarr = np.core.records.array(2.*arr,dtype=mydtype)
h5f.create_table('/', 'table2', obj=recarr )
recarr = np.core.records.array(3.*arr,dtype=mydtype)
h5f.create_table('/', 'table3', obj=recarr )
h5f.close()
h5f = tb.open_file('SO_55445040.h5',mode='r')
# Return Table ojbects:
tb1 = h5f.get_node('/table1')
tb2 = h5f.get_node('/table2')
tb3 = h5f.get_node('/table3')
# Return numpy arrays:
arr1 = h5f.get_node('/table1').read
arr2 = h5f.get_node('/table2').read
arr3 = h5f.get_node('/table3').read
h5f.close()

Loading lightgbm model and using predict with parallel for loop freezes (Python)

I have the need to use my model to do predictions in batches and in parallel in python. If I load the model and create the data frames in a regular for loop and use the predict function it works with no issues. If I create disjoint data frames in parallel using multiprocessing in python and then use the predict function the for loop freezes indefinitely. Why does the behavior occur?
Here is a snippet of my code:
with open('models/model_test.pkl', 'rb') as fin:
pkl_bst = pickle.load(fin)
def predict_generator(X):
df = X
print(df.head())
df = (df.groupby(['user_id']).recommender_items.apply(flat_map)
.reset_index().drop('level_1', axis=1))
df.columns = ['user_id', 'product_id']
print('Merge Data')
user_lookup = pd.read_csv('data/user_lookup.csv')
product_lookup = pd.read_csv('data/product_lookup.csv')
product_map = dict(zip(product_lookup.product_id, product_lookup.name))
print(user_lookup.head())
df = pd.merge(df, user_lookup, on=['user_id'])
df = pd.merge(df, product_lookup, on=['product_id'])
df = df.sort_values(['user_id', 'product_id'])
users = df.user_id.values
items = df.product_id.values
df.drop(['user_id', 'product_id'], axis=1, inplace=True)
print('Prediction Step')
prediction = pkl_bst.predict(df, num_iteration=pkl_bst.best_iteration)
print('Prediction Complete')
validation = pd.DataFrame(zip(users, items, prediction),
columns=['user', 'item', 'prediction'])
validation['name'] = (validation.item
.apply(lambda x: get_mapping(x, product_map)))
validation = pd.DataFrame(zip(validation.user,
zip(validation.name,
validation.prediction)),
columns=['user', 'prediction'])
print(validation.head())
def get_items(x):
sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)[:20]
sorted_list = random.sample(sorted_list, 10)
return [k for k, _ in sorted_list]
relevance = validation.groupby('user').prediction.apply(get_items)
return relevance.reset_index()
This works but is very slow:
results = []
for d in df_list_sub:
r = predict_generator(d)
results.append(r)
This breaks:
from multiprocessing import Pool
import tqdm
pool = Pool(processes=8)
results = []
for x in tqdm.tqdm(pool.imap_unordered(predict_generator, df_list_sub), total=len(df_list_sub)):
results.append(x)
pass
pool.close()
pool.join()
I would be very thankful if someone could help me.
Stumbled onto this myself as well. This is because LightGBM only allows to access the predict function from a single process. The developers explicitly added this logic because it doesn't make sense to call the predict function from multiple processes, as the prediction function already makes use of all CPU's available. Next to that, allowing for multiprocess predicting would probably result in a worse performance. More information about this can be found in this GitHub issue.

Increase recursion limit and stack size in python 2.7

I'm working with large trees and need to increase the recursion limit on Python 2.7.
Using sys.setrecursionlimit(10000) crashes my kernel, so I figured I needed to increase the stack size.
However I don't know how large the stack size should be. I tried 100 MiB like this threading.stack_size(104857600), but the kernel still dies. Giving it 1 GiB throws an error.
I haven't worked with the threading module yet so am I using it wrong when I just put the above statement at the beginning of my script? I'm not doing any kind of parallel processing, everything is done in the same thread.
My computer has 128 GB of physical RAM, running Windows 10, iPython console in Spyder.
The error displayed is simply:
Kernel died, restarting
Nothing more.
EDIT:
Full code to reproduce the problem. The building of the tree works well thought it takes quite long, the kernel only dies during the recursive execution of treeToDict() when reading the whole tree into a dictionary. Maybe there is something wrong with the code of that function. The tree is a non-binary tree:
import pandas as pd
import threading
import sys
import random as rd
import itertools as it
import string
threading.stack_size(104857600)
sys.setrecursionlimit(10000)
class treenode:
# class to build the tree
def __init__(self,children,name='',weight=0,parent=None,depth=0):
self.name = name
self.weight = weight
self.children = children
self.parent = parent
self.depth = depth
self.parentname = parent.name if parent is not None else ''
def add_child(node,name):
# add element to the tree
# if it already exists at the given node increase weight
# else add a new child
for i in range(len(node.children)):
if node.children[i].name == name:
node.children[i].weight += 1
newTree = node.children[i]
break
else:
newTree = treenode([],name=name,weight=1,parent=node,depth=node.depth+1)
node.children.append(newTree)
return newTree
def treeToDict(t,data):
# read the tree into a dictionary
if t.children != []:
for i in range(len(t.children)):
data[str(t.depth)+'_'+t.name] = [t.name, t.children[i].name, t.depth, t.weight, t.parentname]
else:
data[str(t.depth)+'_'+t.name] = [t.name, '', t.depth, t.weight, t.parentname]
for i in range(len(t.children)):
treeToDict(t.children[i],data)
# Create random dataset that leads to very long tree branches:
# A is an index for each set of data B which becomes one branch
rd.seed(23)
testSet = [''.join(l) for l in it.combinations(string.ascii_uppercase[:20],2)]
A = []
B = []
for i in range(10):
for j in range(rd.randint(10,6000)):
A.append(i)
B.append(rd.choice(testSet))
dd = {"A":A,"B":B}
data = pd.DataFrame(dd)
# The maximum length should be above 5500, use another seed if it's not:
print data.groupby('A').count().max()
# Create the tree
root = treenode([],name='0')
for i in range(len(data.values)):
if i == 0:
newTree = add_child(root,data.values[i,1])
oldses = data.values[i,0]
else:
if data.values[i,0] == oldses:
newTree = add_child(newTree,data.values[i,1])
else:
newTree = add_child(root,data.values[i,1])
oldses = data.values[i,0]
result={}
treeToDict(root,result)
PS: I'm aware the treeToDict() function is faulty in that it will overwrite entries because there can be duplicate keys. For this error this bug is unimportant however.
To my experience you have a problem not with stack size, but with an algorithm itself.
It's possible to implement tree traversal procedure without recursion at all. You should implement stack-based depth/breadth first search algorithm.
Python-like pseudo-code might look like this:
stack = []
def traverse_tree(root):
stack.append(root)
while stack:
cur = stack.pop()
cur.do_some_awesome_stuff()
stack.append(cur.get_children())
This approach is incredibly scalable and allows you to deal with any trees.
As further reading you can try this and that.

Looping in Python Slowing Down with each Iteration

The following code produces the desired result, but the root of my issue stems from the process slowing down with each append. In the first 100 pages or so, it runs at less than one second per page, but by the 500s it gets up to 3 seconds and into the 1000s it runs at about 5 seconds per append. Are there suggestions for how to make this more efficient or explanations for why this is just the way things are?
import lxml
from lxml import html
import itertools
import datetime
l=[]
for pageno in itertools.count(start=1):
time = datetime.datetime.now()
url = 'http://example.com/'
parse = lxml.html.parse(url)
try:
for x in parse.xpath('//center'):
x.getparent().remove(x)
x.clear()
while x.getprevious() is not None:
del x.getparent()[0]
for n in parse.xpath('//tr[#class="rt"]'):
l.append([n.find('td/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').text.encode('utf8').decode('utf8').strip()\
,n.find('td/form/p/a').attrib['title'].encode('utf8').decode('utf8').strip()]\
+[c.text.encode('utf8').decode('utf8').strip() for c in n if c.text.strip() is not ''])
n.clear()
while n.getprevious() is not None:
del n.getparent()[0]
except:
print 'Page ' + str(pageno) + 'Does Not Exist'
print '{0} Pages Complete: {1}'.format(pageno,datetime.datetime.now()-time)
I have tried a number of solutions, such as disabling the garbage collector, writing one list as a row to file instead of appending to a large list, etc. I look forward to learning more from potential suggestions/answers.

Conditional quit multiprocess in python

I'm trying to build a python script that runs several processes in parallel. Basically, the processes are independent, work on different folders and leave their output as text files in those folders. But in some special cases, a process might terminate with a special (boolean) status. If so, I want all the other processes to terminate right away. What is the best way to do this?
I've fiddled with multiprocessing.condition() and multiprocessing.manager, after reading the excellent tutorial by Doug Hellmann:
http://pymotw.com/2/multiprocessing/communication.html
However, I do not seem to understand how to get a multiprocessing process to monitor a status indicator and quit if it takes a special value.
To examplify this, I've written the small script below. It somewhat does what I want, but ends in an exception. Suggestions on more elegant ways to proceed are gratefully welcomed:
br,
Gro
import multiprocessing
def input(i):
"""arbitrary chosen value(8) gives special status = 1"""
if i == 8:
value = 1
else:
value = 0
return value
def sum_list(list):
"""accumulative sum of list"""
sum = 0
for x in range(len(list)):
sum = sum + list[x]
return sum
def worker(list,i):
value = input(i)
list[i] = value
print 'processname',i
if __name__ == '__main__':
mgr = multiprocessing.Manager()
list = mgr.list([0]*20)
jobs = [ multiprocessing.Process(target=worker, args=(list,i))
for i in range(20)
]
for j in jobs:
j.start()
sumlist = sum_list(list)
print sumlist
if sumlist == 1:
break
for j in jobs:
j.join()