MapReduce code mpi time out

MapReduce code mpi time out - mapreduce

I am using a MapReduce code, and I met a problem: the code doesn't respond after the map (takes 1 hr) finished. I digged into the code and find this function doesn't respond:
def wait(self, running, tag): """Test if any worker has finished its job. If so, decrease its key and make it available """
atimer = Timer('Wait')
inittime = time()
status = MPI.Status()
while time() - inittime < self.config['jobwait']:
if world.Iprobe(source=MPI.ANY_SOURCE,tag=tag,status=status):
jobf = world.recv(source=status.source, tag=tag)
idx = 0
for ii, worker in enumerate(self.workers):
if worker.id == status.source: idx = ii; break
if self.config['verbosity'] >= 8:
print('Freeing worker '+str(self.workers[idx].id))
worker = self.workers[idx]
# faulty worker's job has already been cleaned
if not worker.isFaulty():
del running[jobf]
else:
self.nActive += 1
worker.setFree()
heapq._siftup(self.workers, idx)
This line doesn't respond :
if world.Iprobe(source=MPI.ANY_SOURCE,tag=tag,status=status):
I wonder if there is a time out for Iprobe() in mpi4py and how to set the timeout time for it? Is there a alternative for Iprobe() that has the same role here?
Here is the previous function that send the message via .send()
def execTask(self, task):
"""Wrapper function calling mapping/reducing/finalizing phase tasks,
dispatch tasks to workers until all finished and collect feedback.
Faulty workers are removed from active duty work list.
"""
atimer = Timer(task)
print( 'Entering {0:s} phase...'.format(task) )
taskDict = { 'Map':(self.mapIn, MAP_START, MAP_FINISH), \
'Init':(self.mapIn, INIT_START, MAP_FINISH), \
'Reduce':(self.reduceIn, REDUCE_START, REDUCE_FINISH) }
# line up jobs and workers into priority queues
jobs = taskDict[task][0][:]
heapq.heapify(jobs); running = {}
heapq.heapify(self.workers)
while (jobs or running) and self.nActive > 0:
# dispatch all jobs to all free workers
while jobs and self.workers[0].isFree():
job = heapq.heappop(jobs)
worker = heapq.heappop(self.workers)
world.send(job, dest=worker.id, tag=taskDict[task][1])
print('hi')
print job
worker.setBusy(); heapq.heappush(self.workers, worker)
running[job] = (time(), worker)
if self.config['verbosity'] >= 6:
print('Dispatching file '+os.path.basename(job)+' to worker '+str(worker.id))
# if no more free workers, break
if not self.workers[0].isFree(): break
# wait for finishing workers as well as do cleaning
self.wait(running, taskDict[task][2])
# print running
self.clean(running, jobs)
print( '{0:s} phase completed'.format(task) )
The whole code can be seen here:
https://drive.google.com/file/d/0B36fJi35SPIedWdjbW5NdzlCeTg/view?usp=sharing

Related

Django celery task getting stuck on blocking the chain

I am using the celery with python 3.5. I am using chain to perform serialize task and I only want to process single task at a time.
The follow piece of code works fine:
#shared_task(name='dummy1')
def dummy1(val):
logger.info('received: ' + val)
with RedisLock('testing.pid',60*60):
try:
logger.info("Lock acquired by " + val)
res = chain(dummy2.s(val), dummy3.s(val)).delay()
logger.info("Lock released")
except Exception as e:
logger.exception(str(e))
#shared_task(name='dummy2')
def dummy2(val):
logger.info('Dummy2 : ' + val)
time.sleep(3)
logger.info('Dummy2 complete')
return None
#shared_task(name='dummy3')
def dummy3(prev, val):
logger.info('Dummy3 : ' + val)
time.sleep(3)
logger.info('Dummy3 complete')
RedisLock is a distributed cache based lock for running a single task at a time.
However, when i do:
with RedisLock('testing.pid',60*60):
try:
logger.info("Lock acquired by " + val)
res = chain(dummy2.s(val), dummy3.s(val)).delay()
res.get() # blocking the process here
logger.info("Lock released")
the chain does not work as expected and gets crashed very often.

What does the "throughput-deadline-time" configuration option do?

I've stumbled on the throughput-deadline-time configuration property for Akka dispatchers, and it looks like an interesting option, however the only mention of it I could find in the whole documentation is the following:
# Throughput deadline for Dispatcher, set to 0 or negative for no deadline
throughput-deadline-time = 0ms
I think we can agree that this is not very helpful.
So what does throughput-deadline-time control, and what impact does it have when on my dispatcher?

So I had a look at the Akka source code, and found this method in the Mailbox that seems to implement the behavior of throughput-deadline-time:
/**
* Process the messages in the mailbox
*/
#tailrec private final def processMailbox(
left: Int = java.lang.Math.max(dispatcher.throughput, 1),
deadlineNs: Long = if (dispatcher.isThroughputDeadlineTimeDefined == true) System.nanoTime + dispatcher.throughputDeadlineTime.toNanos else 0L): Unit =
if (shouldProcessMessage) {
val next = dequeue()
if (next ne null) {
if (Mailbox.debug) println(actor.self + " processing message " + next)
actor invoke next
if (Thread.interrupted())
throw new InterruptedException("Interrupted while processing actor messages")
processAllSystemMessages()
if ((left > 1) && ((dispatcher.isThroughputDeadlineTimeDefined == false) || (System.nanoTime - deadlineNs) < 0))
processMailbox(left - 1, deadlineNs)
}
}
This piece of code makes it clear: throughput-deadline-time configures the maximum amount of time that will be spent processing the same mailbox, before switching to the mailbox of another actor.
In other words, if you configure a dispatcher with:
my-dispatcher {
throughput = 100
throughput-deadline-time = 1ms
}
Then the mailbox of the actors will process at most 100 messages at a time, during at most 1ms, whenever the first of those limits is hit, Akka switches to another actor/mailbox.

Karate - How to delay all scenarios?

I have a 10 scenarios, all of them must have 1 min delay after executing background. I call my delay function in background. The problem is that all scenarios call background, and I have to wait 10 minutes.
Is there a way to call my wait function one for all scenarios?
This is my background and one of my scenarios:
Background:
* call read('classpath:cleanup.feature')
* def login = call read('classpath:init/init.user.feature')
* def sleep =
"""
function(seconds){
for(i = 0; i <= seconds; i++)
{
java.lang.Thread.sleep(1*1000);
karate.log(i);
}
}
"""
* call sleep 60
Scenario: Correct
# Step one: requesting a verification code
Given url karate.get('urlBase') + "account/resendMobileActivationVerificationCode"
And request {"mobile": #(defaultMobile)}
And header X-Authorization = login.token
And header NESBA-Authorization = login.nesba
When method post
Then status 200
And match response ==
"""
{
"status":0,
"message":"#(status0persianMessage)",
"result": true
}
"""

Use callonce:
* callonce sleep 60

Python Twisted sending large a file across network

I am trying to send a file across the network using Twisted with the LineReceiver protocol. The issue I am seeing is that when I read a binary file and try to send the chunks they simply don't send.
I am reading the file using:
import json
import time
import threading
from twisted.internet import reactor, threads
from twisted.protocols.basic import LineReceiver
from twisted.internet import protocol
MaximumMsgSize = 15500
trySend = True
connectionToServer = None
class ClientInterfaceFactory(protocol.Factory):
def buildProtocol(self, addr):
return WoosterInterfaceProtocol(self._msgProcessor, self._logger)
class ClientInterfaceProtocol(LineReceiver):
def connectionMade(self):
connectionToServer = self
def _DecodeMessage(self, rawMsg):
header, body = json.loads(rawMsg)
return (header, json.loads(body))
def ProcessIncomingMsg(self, rawMsg, connObject):
# Decode raw message.
decodedMsg = self._DecodeMessage(rawMsg)
self.ProccessTransmitJobToNode(decodedMsg, connObject)
def _BuildMessage(self, id, msgBody = {}):
msgs = []
fullMsgBody = json.dumps(msgBody)
msgBodyLength = len(fullMsgBody)
totalParts = 1 if msgBodyLength <= MaximumMsgSize else \
int(math.ceil(msgBodyLength / MaximumMsgSize))
startPoint = 0
msgBodyPos = 0
for partNo in range(totalParts):
msgBodyPos = (partNo + 1) * MaximumMsgSize
header = {'ID' : id, 'MsgParts' : totalParts,
'MsgPart' : partNo }
msg = (header, fullMsgBody[startPoint:msgBodyPos])
jsonMsg = json.dumps(msg)
msgs.append(jsonMsg)
startPoint = msgBodyPos
return (msgs, '')
def ProccessTransmitJobToNode(self, msg, connection):
rootDir = '../documentation/configs/Wooster'
exportedFiles = ['consoleLog.txt', 'blob.dat']
params = {
'Status' : 'buildStatus',
'TaskID' : 'taskID',
'Name' : 'taskName',
'Exports' : len(exportedFiles),
}
msg, statusStr = self._BuildMessage(101, params)
connection.sendLine(msg[0])
for filename in exportedFiles:
with open (filename, "rb") as exportFileHandle:
data = exportFileHandle.read().encode('base64')
params = {
ExportFileToMaster_Tag.TaskID : taskID,
ExportFileToMaster_Tag.FileContents : data,
ExportFileToMaster_Tag.Filename : filename
}
msgs, _ = self._BuildMessage(MsgID.ExportFileToMaster, params)
for m in msgs:
connection.sendLine(m)
def lineReceived(self, data):
threads.deferToThread(self.ProcessIncomingMsg, data, self)
def ConnectFailed(reason):
print 'Connection failed..'
reactor.callLater(20, reactor.callFromThread, ConnectToServer)
def ConnectToServer():
print 'Connecting...'
from twisted.internet.endpoints import TCP4ClientEndpoint
endpoint = TCP4ClientEndpoint(reactor, 'localhost', 8181)
deferItem = endpoint.connect(factory)
deferItem.addErrback(ConnectFailed)
netThread = threading.Thread(target=reactor.run, kwargs={"installSignalHandlers": False})
netThread.start()
reactor.callFromThread(ConnectToServer)
factory = ClientInterfaceFactory()
protocol = ClientInterfaceProtocol()
while 1:
time.sleep(0.01)
if connectionToServer == None: continue
if trySend == True:
protocol.ProccessTransmitJobToNode(None, None)
trySend = False
Is there something I am doing wrong?file is sent, it's when the write is multi part or there are more than one file it struggles.
If a single write occurs then the m
Note: I have updated the question with a crude piece of sample code in the hope it makes sense.

_BuildMessage returns a two-tuple: (msgs, '').
Your network code iterates over this:
msgs = self._BuildMessage(MsgID.ExportFileToMaster, params)
for m in msgs:
So your network code first tries to send a list of json encoded data and then tries to send the empty string. It most likely raises an exception because you cannot send a list of anything using sendLine. If you aren't seeing the exception, you've forgotten to enable logging. You should always enable logging so you can see any exceptions that occur.
Also, you're using time.sleep and you shouldn't do this in a Twisted-based program. If you're doing this to try to avoid overloading the receiver, you should use TCP's native backpressure instead by registering a producer which can receive pause and resume notifications. Regardless, time.sleep (and your loop over all the data) will block the entire reactor thread and prevent any progress from being made. The consequence is that most of the data will be buffered locally before being sent.
Also, your code calls LineReceiver.sendLine from a non-reactor thread. This has undefined results but you can probably count on it to not work.
This loop runs in the main thread:
while 1:
time.sleep(0.01)
if connectionToServer == None: continue
if trySend == True:
protocol.ProccessTransmitJobToNode(None, None)
trySend = False
while the reactor runs in another thread:
netThread = threading.Thread(target=reactor.run, kwargs={"installSignalHandlers": False})
netThread.start()
ProcessTransmitJobToNode simply calls self.sendLine:
def ProccessTransmitJobToNode(self, msg, connection):
rootDir = '../documentation/configs/Wooster'
exportedFiles = ['consoleLog.txt', 'blob.dat']
params = {
'Status' : 'buildStatus',
'TaskID' : 'taskID',
'Name' : 'taskName',
'Exports' : len(exportedFiles),
}
msg, statusStr = self._BuildMessage(101, params)
connection.sendLine(msg[0])
You should probably remove the use of threading entirely from the application. Time-based events are better managed using reactor.callLater (your main-thread loop effectively generates a call to ProcessTransmitJobToNode once hundred times a second (modulo effects of the trySend flag)).
You may also want to take a look at https://github.com/twisted/tubes as a better way to manage large amounts of data with Twisted.

Terminating a flailing akka actor

I have the following actor setup, using Akka actors (2.10)
A -spawn-> B -spawn-> C
A -sendWork-> B -sendWork-> C
C -sendResults-> A (repeatedly)
However, at some point A notices that it should change the workload sent to B/C because C is sending a large number of messages that turn out to be useless. However, in such situations C's inbox seems to be very full, and/or C may be blocked.
How can A tell B to shutdown C immediately? Losing the state and messages of both B and C is acceptable, so destroying them and spawning new ones is an option.

Given the actors are started the way you described, then using stop in the right way will do what you require. According to the docs, calling stop will both:
1) stop additional messages from going into the mailbox (sent to deadletter)
2) take the current contents of the mailbox and also ship that to deadletter (although this is based on mailbox impl, but the point is they won't be processed)
Now if the actor will need to completely finish the message it's currently processing before it's all the way stopped, so if it's "stuck", stopping (or anything for that matter) won't fix that, but I don't think that's the situation you are describing.
I pulled a little code sample together to demonstrate. Basically, A will send a message to B to start sending work to C. B will flood C with some work and C will send the results of that work back to A. When a certain number of responses have been received by A, it will trigger a stop of B and C by stopping B. When B is completely stopped, it will then restart the process over again, up to 2 total times because it stops itself. The code looks like this:
case object StartWork
case class DoWork(i:Int, a:ActorRef)
case class WorkResults(i:Int)
class ActorA extends Actor{
import context._
var responseCount = 0
var restarts = 0
def receive = startingWork
def startingWork:Receive = {
case sw # StartWork =>
val myb = actorOf(Props[ActorB])
myb ! sw
become(waitingForResponses(myb))
}
def waitingForResponses(myb:ActorRef):Receive = {
case WorkResults(i) =>
println(s"Got back work results: $i")
responseCount += 1
if (responseCount > 200){
println("Got too many responses, terminating children and starting again")
watch(myb)
stop(myb)
become(waitingForDeath)
}
}
def waitingForDeath:Receive = {
case Terminated(ref) =>
restarts += 1
if (restarts <= 2){
println("children terminated, starting work again")
responseCount = 0
become(startingWork)
self ! StartWork
}
else{
println("too many restarts, stopping self")
context.stop(self)
}
}
}
class ActorB extends Actor{
import concurrent.duration._
import context._
var sched:Option[Cancellable] = None
override def postStop = {
println("stopping b")
sched foreach (_.cancel)
}
def receive = starting
def starting:Receive = {
case sw # StartWork =>
val myc = context.actorOf(Props[ActorC])
sched = Some(context.system.scheduler.schedule(1 second, 1 second, self, "tick"))
become(sendingWork(myc, sender))
}
def sendingWork(myc:ActorRef, a:ActorRef):Receive = {
case "tick" =>
for(j <- 1 until 1000) myc ! DoWork(j, a)
}
}
class ActorC extends Actor{
override def postStop = {
println("stopping c")
}
def receive = {
case DoWork(i, a) =>
a ! WorkResults(i)
}
}
It's a little rough around the edges, but it should show the point that cascading the stop from B through to C will stop C from sending responses back to A even though it still had messages in the mailbox. I hope this is what you were looking for.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MapReduce code mpi time out - mapreduce

Related

Django celery task getting stuck on blocking the chain

What does the "throughput-deadline-time" configuration option do?

Karate - How to delay all scenarios?

Python Twisted sending large a file across network

Terminating a flailing akka actor

Categories

Resources