How this code for parallel task works in Python? - python-2.7

I've been using a script (above) to run some task in parallel in an Ubuntu server with 16 processors, it actually works but I have a few questions about it:
What is the code actually doing?
As more workers I set up the script run faster, but what is the limit of workers?, I've run 100.
How could improve it?
#!/usr/bin/env python
from multiprocessing import Process, Queue
from executable import run_model
from database import DB
import numpy as np
def worker(work_queue, db_conection):
try:
for phone in iter(work_queue.get, 'STOP'):
registers_per_number = retrieve_CDRs(phone, db_conection)
run_model(np.array(registers_per_number), db_conection)
#print("The phone %s was already run" % (phone))
except Exception:
pass
return True
def retrieve_CDRs(phone, db_conection):
return db_conection.retrieve_data_by_person(phone)
def main():
phone_numbers = np.genfromtxt("../listado.csv", dtype="int")[:2000]
workers = 16
work_queue = Queue()
processes = []
#print("Process started with %s" % (workers))
for phone in phone_numbers:
work_queue.put(phone)
#print("Phone %s put at the queue" % (phone))
#print("The queue %s" % (work_queue))
for w in xrange(workers):
#print("The worker %s" % (w))
# new conection to data base
db_conection = DB()
p = Process(target=worker, args=(work_queue, db_conection))
p.start()
#print("Process %s started" % (p))
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
if __name__ == '__main__':
main()
Cheers!

At first, start from the main function:
It's creating an numpy array of 2000 integers type phone numbers from a CSV file.
Then creating some variables and lists.
Next, you are creating a queue with all the phone numbers that you extracted from the CSV file
Next, for the 16 workers, you are creating a DB connection for each, setting up the processing arguments and started the process for all the worker processors.
Hope that helps you to understand the code. Actually, it's kind of multi-threading you are trying and it's behaving like parallel processing. So, the more number you use, it becomes more faster. You should be able to use 2000 processors as my common sense says that. After that it's not meaningful as master-slave philosophy. Also, parallel processing suggests you to minimize the number of idle processors/workers. If you have more than 2000 workers, then you will have some idle workers which will reduce your performance. Finally, improving parallel processing needs to improve this kind of ideology.
Hope that helps. Cheers!

Related

multiprocessing Queue deadlock when spawn multi threads in one process

I created two processes, one process that spawn multi threads is response for writing data to Queue, the other is reading data from Queue. It always deadblock in high frequent, fewer not. Especially when you add sleep in run method in write module(comment in codes). Let me put my codes below:
environments: python2.7
main.py
from multiprocessing import Process,Queue
from write import write
from read import read
if __name__ == "__main__":
record_queue = Queue()
table_queue = Queue()
pw = Process(target=write,args=[record_queue, table_queue])
pr = Process(target=read,args=[record_queue, table_queue])
pw.start()
pr.start()
pw.join()
pr.join()
write.py
from concurrent.futures import ThreadPoolExecutor, as_completed
def write(record_queue, table_queue):
thread_num = 3
pool = ThreadPoolExecutor(thread_num)
futures = [pool.submit(run, record_queue, table_queue) for _ in range (thread_num)]
results = [r.result() for r in as_completed(futures)]
def run(record_queue, table_queue):
while True:
if table_queue.empty():
break
table = table_queue.get()
# adding this code below reduce deadlock opportunity.
#import time
#import random
#time.sleep(random.randint(1, 3))
process_with_table(record_queue, table_queue, table)
def process_with_table(record_queue, table_queue, table):
#for short
for item in [x for x in range(1000)]:
record_queue.put(item)
read.py
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
import Queue
def read(record_queue, table_queue):
count = 0
while True:
item = record_queue.get()
count += 1
print ("item: ", item)
if count == 4:
break
I googled it and there are same questions on SO, but i cant see the similarity compared with my code, so can anyone help my codes, thanks...
I seem to find a solution, change run method in write module to :
def run(record_queue, table_queue):
while True:
try:
if table_queue.empty():
break
table = table_queue.get(timeout=3)
process_with_table(record_queue, table_queue, table)
except multiprocessing.queues.Empty:
import time
time.sleep(0.1)
and never see deadlock or blocking on get method.

Getting too many deadlock errors while updating MSSQL table with pyodbc in parallel with multiprocessing

I am trying to open pickle files that have data within them, then update a MSSQL table with that data. It was taking forever, 10 days to update 1,000,000 rows. So i wrote a script for more parallelism. The more processes i run it with the more errors i get like this
(<class 'pyodbc.Error'>, Error('40001', '[40001] [Microsoft][ODBC SQL Server Dri
ver][SQL Server]Transaction (Process ID 93) was deadlocked on lock resources wit
h another process and has been chosen as the deadlock victim. Rerun the transact
ion. (1205) (SQLExecDirectW)'), <traceback object at 0x0000000002791808>)
As you can see in my code i keep trying to process the update until successful and even sleep for a second here
while True:
try:
updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
break
except:
time.sleep(1)
print sys.exc_info()
Is this because when you use the multiprocessing module in windows it uses os.spawn instead of os.fork ?
Is there a way to do this that will provide more speed up?
I was told that the table can handle way more transactions then this...
#!C:/Python/python.exe -u
import pyodbc,re,pickle,os,glob,sys,time
from multiprocessing import Lock, Process, Queue, current_process
def UpDater(pickleQueue):
for pi in iter(pickleQueue.get, 'STOP'):
name = current_process().name
f=pi
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=database.windows.net;DATABASE=DB;UID=user;PWD=pwd');
cursor = cnxn.cursor()
update = ("""UPDATE DocumentList
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=?""")
r = re.compile('\d+')
pkl_file = open(pi, 'rb')
meta = pickle.load(pkl_file)
fileName = meta[0][0]
pl = r.findall(fileName)
l= int(len(pl)-1)
ext = meta[0][1]
url = meta[0][2]
uniqueID = pl[l]
dt = meta[0][4]
size = meta[0][5]
while True:
try:
updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
break
except:
time.sleep(1)
print sys.exc_info()
print uniqueID
cnxn.commit()
pkl_file.close()
os.remove(fileName+'.pkl')
cnxn.close()
if __name__ == '__main__':
os.chdir('Pickles')
pickles = glob.glob("*.pkl")
pickleQueue=Queue();processes =[];
for item in pickles:
pickleQueue.put(item)
workers = int(sys.argv[1]);
for x in xrange(workers):
p = Process(target=UpDater,args=(pickleQueue,))
p.start()
processes.append(p)
pickleQueue.put('STOP')
for p in processes:
p.join()
I am using Windows 7 and python 2.7 Anaconda Distribution
EDIT
The answer below to use row locks stopped the error from happening. However, the updates were still slow. Turns out an old fashion index on the primary key was needed for 100x speed up
A few things to try. Using sleeps is a bad idea. First, could you try row level locking?
update = ("""UPDATE DocumentList WITH (ROWLOCK)
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=? """)
Another option would be to wrap each in a transaction:
update = ("""
BEGIN TRANSACTION my_trans;
UPDATE DocumentList
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=?;
END TRANSACTION my_trans;
""")
Would either of these solutions work for you?

Python - Multiple processes running sequential

Here is the code:
// database_extractor.py
class DatabaseExtractor(object):
def __init__(self, ..):
...
def run_extraction(self):
// run sql query to extract data to a file
//driver.py
def extract__func(db_extractor):
db_extractor.run_extraction()
if __name__ == "__main__":
db1 = DatabaseExtractor(..)
db2 = DatabaseExtractor(..)
db3 = DatabaseExtractor(..)
db4 = DatabaseExtractor(..)
db5 = DatabaseExtractor(..)
db6 = DatabaseExtractor(..)
db7 = DatabaseExtractor(..)
db8 = DatabaseExtractor(..)
worker_l = [Process(extract_func, args=[db1]),
Process(extract_func, args=[db2]),
Process(extract_func, args=[db3]),
Process(extract_func, args=[db4]),
Process(extract_func, args=[db5]),
Process(extract_func, args=[db6]),
Process(extract_func, args=[db7]),
Process(extract_func, args=[db8])]
for worker in worker_l: worker.start()
for worker in worker_l: worker.join()
(In reality, the instances of DatabaseExtractor are being generated based on an input config file, so there could be more than 8 processes running)
I referred to the SO post: Reference, quoting the accepted answer "You'll either want to join your processes individually outside of your for loop (e.g., by storing them in a list and then iterating over it) or use something like numpy.Pool and apply_async with a callback". Even though I did the same, all my processes are running sequentially. The reason I know this is because 4 of the instances have queries running for couple of hours and when one of them is kicked off, I do not see the other queries populating their respective output file. How can I force parallel execution of the instances?
My guess is that something is happening at the DB layer. This example shows everything works as expected as far as processes are concerned. I would recommend checking your database locking etc.
from multiprocessing import Process
from random import randint
from time import sleep
def wait_proc(i, s):
print "%d - Working for %d seconds" % (i,s)
sleep(s)
print "%d - Done." % (i,)
wait_l = [Process(target=wait_proc, args=[i,randint(5,15)]) for i in range(10)]
for w in wait_l:
w.start()
for w in wait_l:
w.join()
print "All done."

ReactorNotRestartable error

I have a tool, where i am implementing upnp discovery of devices connected in network.
For that i have written a script and used datagram class in it.
Implementation:
whenever scan button is pressed on tool, it will run that upnp script and will list the devices in the box created in tool.
This was working fine.
But when i again press the scan button, it gives me following error:
Traceback (most recent call last):
File "tool\ui\main.py", line 508, in updateDevices
upnp_script.main("server", localHostAddress)
File "tool\ui\upnp_script.py", line 90, in main
reactor.run()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1191, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1171, in startRunning
ReactorBase.startRunning(self)
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 683, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Main function of upnp script:
def main(mode, iface):
klass = Server if mode == 'server' else Client
obj = klass
obj(iface)
reactor.run()
There is server class which is sending M-search command(upnp) for discovering devices.
MS = 'M-SEARCH * HTTP/1.1\r\nHOST: %s:%d\r\nMAN: "ssdp:discover"\r\nMX: 2\r\nST: ssdp:all\r\n\r\n' % (SSDP_ADDR, SSDP_PORT)
In server class constructor, after sending m-search i am stooping reactor
reactor.callLater(10, reactor.stop)
From google i found that, we cannot restart a reactor beacause it is its limitation.
http://twistedmatrix.com/trac/wiki/FrequentlyAskedQuestions#WhycanttheTwistedsreactorberestarted
Please guide me how can i modify my code so that i am able to scan devices more than 1 time and don't get this "reactor not restartable error"
In response to "Please guide me how can i modify my code...", you haven't provided enough code that I would know how to specifically guide you, I would need to understand the (twisted part) of the logic around your scan/search.
If I were to offer a generic design/pattern/mental-model for the "twisted reactor" though, I would say think of it as your programs main loop. (thinking about the reactor that way is what makes the problem obvious to me anyway...)
I.E. most long running programs have a form something like
def main():
while(True):
check_and_update_some_stuff()
sleep 10
That same code in twisted is more like:
def main():
# the LoopingCall adds the given function to the reactor loop
l = task.LoopingCall(check_and_update_some_stuff)
l.start(10.0)
reactor.run() # <--- this is the endless while loop
If you think of the reactor as "the endless loop that makes up the main() of my program" then you'll understand why no-one is bothering to add support for "restarting" the reactor. Why would you want to restart an endless loop? Instead of stopping the core of your program, you should instead only surgically stop the task inside that is complete, leaving the main loop untouched.
You seem to be implying that the current code will keep "sending m-search"s endlessly when the reactor is running. So change your sending code so it stops repeating the "send" (... I can't tell you how to do this because you didn't provide code, but for instance, a LoopingCall can be turned off by calling its .stop method.
Runnable example as follows:
#!/usr/bin/python
from twisted.internet import task
from twisted.internet import reactor
from twisted.internet.protocol import Protocol, ServerFactory
class PollingIOThingy(object):
def __init__(self):
self.sendingcallback = None # Note I'm pushing sendToAll into here in main()
self.l = None # Also being pushed in from main()
self.iotries = 0
def pollingtry(self):
self.iotries += 1
if self.iotries > 5:
print "stoping this task"
self.l.stop()
return()
print "Polling runs: " + str(self.iotries)
if self.sendingcallback:
self.sendingcallback("Polling runs: " + str(self.iotries) + "\n")
class MyClientConnections(Protocol):
def connectionMade(self):
print "Got new client!"
self.factory.clients.append(self)
def connectionLost(self, reason):
print "Lost a client!"
self.factory.clients.remove(self)
class MyServerFactory(ServerFactory):
protocol = MyClientConnections
def __init__(self):
self.clients = []
def sendToAll(self, message):
for c in self.clients:
c.transport.write(message)
# Normally I would define a class of ServerFactory here but I'm going to
# hack it into main() as they do in the twisted chat, to make things shorter
def main():
client_connection_factory = MyServerFactory()
polling_stuff = PollingIOThingy()
# the following line is what this example is all about:
polling_stuff.sendingcallback = client_connection_factory.sendToAll
# push the client connections send def into my polling class
# if you want to run something ever second (instead of 1 second after
# the end of your last code run, which could vary) do:
l = task.LoopingCall(polling_stuff.pollingtry)
polling_stuff.l = l
l.start(1.0)
# from: https://twistedmatrix.com/documents/12.3.0/core/howto/time.html
reactor.listenTCP(5000, client_connection_factory)
reactor.run()
if __name__ == '__main__':
main()
This script has extra cruft in it that you might not care about, so just focus on the self.l.stop() in PollingIOThingys polling try method and the l related stuff in main() to illustrates the point.
(this code comes from SO: Persistent connection in twisted check that question if you want to know what the extra bits are about)

using topic exchange to send message from one method to another

Recently, I have been going though celery & kombu documentation as i need them integrated in one of my projects. I have a basic understanding of how this should work but documentation examples using different brokers have me confused.
Here is the scenario:
Within my application i have two views ViewA and ViewB both of them does some expensive processing, so i wanted to have them use celery tasks for processing. So this is what i did.
views.py
def ViewA(request):
tasks.do_task_a.apply_async(args=[a, b])
def ViewB(request):
tasks.do_task_b.apply_async(args=[a, b])
tasks.py
#task()
def do_task_a(a, b):
# Do something Expensive
#task()
def do_task_b(a, b):
# Do something Expensive here too
Until now, everything is working fine. The problem is that do_task_a creates a txt file on the system, which i need to use in do_task_b. Now, in the do_task_b method i can check for the file existence and call the tasks retry method [which is what i am doing right now] if the file does not exist.
Here, I would rather want to take a different approach (i.e. where messaging comes in). I would want do_task_a to send a message to do_task_b once the file has been created instead of looping the retry method until the file is created.
I read through the documentation of celery and kombu and updated my settings as follows.
BROKER_URL = "django://"
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = "sqlite:///celery"
TASK_RETRY_DELAY = 30 #Define Time in Seconds
DATABASE_ROUTERS = ['portal.db_routers.CeleryRouter']
CELERY_QUEUES = (
Queue('filecreation', exchange=exchanges.genex, routing_key='file.create'),
)
CELERY_ROUTES = ('celeryconf.routers.CeleryTaskRouter',)
and i am stuck here.
don't know where to go from here.
What should i do next to make do_task_a to broadcast a message to do_task_b on file creation ? and what should i do to make do_task_b receive (consume) the message and process the code further ??
Any Ideas and suggestions are welcome.
This is a good example for using Celery's callback/linking function.
Celery supports linking tasks together so that one task follows another.
You can read more about it here
apply_async() functions has two optional arguments
+link : excute the linked function on success
+link_error : excute the linked function on an error
#task
def add(a, b):
return a + b
#task
def total(numbers):
return sum(numbers)
#task
def error_handler(uuid):
result = AsyncResult(uuid)
exc = result.get(propagate=False)
print('Task %r raised exception: %r\n%r' % (exc, result.traceback))
Now in your calling function do something like
def main():
#for error_handling
add.apply_async((2, 2), link_error=error_handler.subtask())
#for linking 2 tasks
add.apply_async((2, 2), link=add.subtask((8, )))
# output 12
#what you can do is your case is something like this.
if user_requires:
add.apply_async((2, 2), link=add.subtask((8, )))
else:
add.apply_async((2, 2))
Hope this is helpful