Co-ordinating batch processing in Akka - akka

I need to batch process 2 large files one after the other in Akka and I'm trying to figure out the best way to co-ordinate that in a controlling actor. The lines in each file can be processed in parallel but all of the lines from the first file must be processed before any of the lines from the second file can be processed.
I was thinking of having the following actors:
File1WorkerActor - Processes a single line from the first file.
File2WorkerActor - Processes a single line from the second file.
File1Actor - Delegates the lines from the first file to multiple worker actors.
File2Actor - Delegates the lines from the second file to multiple worker actors.
TopLevelActor - Asks File1Actor to process file 1, waits for it to complete then asks File2Actor to process file 2.
The thing I'm not sure about is, how do the file actors know when all the workers have finished and how does the TopLevelActor know when File1Actor is finished?
I was thinking that the FileActor would just hold a counter for the number of lines in a given file and the workers would send a message back for each processed line. After the counter counts down it would send a message back to TopLevelActor. Is there any problem with this approach? Or would it be better to implement some kind of Future handling?

Your solution sounds correct to me. Also you me be interesting to check FSM and/or become/unbecome functionality to avoid another task submission to workers when previous task are not completed

Related

log4cpp stops working properly after sometime

I have a log4cpp implementation in a multiple process environment . Logger is configured once during initialization and then is shared among forked processes which server http requests.
During first minute or so , I see the logs rolls perfectly fine at the query per second load( say it runs at 100qps).
After that, the log slows down dramatically. So, I logged pid as well and notice that only one process gets to write to the log for a time duration ( around 10-15 seconds) and then another process starts writing and so on so forth . Processes don't die. They just don't get a chance to write.
This is different from what happens when the server starts . At that time, every other log line is written by a different process. ( Also, I write one-log-line per process at the end of serving the request. )
At this point, I can't think of what could be going wrong.
This is how my log4cpp conf file looks
log4cpp.rootCategory=DEBUG,rootAppender
log4cpp.appender.rootAppender=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.rootAppender.fileName=/tmp/mylogfile.log
log4cpp.appender.rootAppender.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.rootAppender.layout.ConversionPattern=%d|%p|%m%n
log4cpp.category.http.server.main=INFO,MAIN
log4cpp.additivity.http.server.main=false
log4cpp.appender.MAIN=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.MAIN.maxBackupIndex=10
log4cpp.appender.MAIN.maxFileAge=1
log4cpp.appender.MAIN.append=true
log4cpp.appender.MAIN.fileName=/tmp/mylogfile.log
log4cpp.appender.MAIN.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.MAIN.layout.ConversionPattern=%d|%p|%m%n
Edit: more updates : Thanks #Botje for your time.
I see that whenever a new child process is created , it is only that process that gets to write to the log. That tells me that all the reference other processes were holding become invalid.
I also tried setting additive property to true. With that , server starts properly writing into the /tmp/myfile.log and then switches to writing into /tmp/myfile.log.1 withing a minute . And then stops writing after a minute.
At that point logs gets directed to stderr which is directed to another log file.
Also,
I did notice that the log4cpp FileAppender uses seek to determine the file size before writing log entries. If the file handle is shared between processes that will cause writes to end up at the start of the file instead of the end. Even if you fix that, you still have multiple processes that think they are in charge of log file rotation.
I suggest you have all processes write to a common udp/tcp/Unix socket and designate one process that collects all log entries and actually writes it to a file. You don't have to reinvent the wheel, you can use the syslog protocol and either the system syslog or a copy running in userspace.

Threading - The fastest way to handle reoccuring threads?

I am writing my first threaded application for an industrial machine that has a very fast line speed. I am using the MFC for the UI and once the user pushes the "Start" machine button, I need to be simultaneously executing three operations. I need to collect data, process it and output results very quickly as well as checking to see if the user has turned the machine "off". When I say very quickly, I expect the analyze portion of the execution to take the longest and it needs to happen in well under a second. I am mostly concerned about overhead elimination associated with threads. What is the fastest way to implement the loop below:
void Scanner(CString& m_StartStop) {
std::thread Collect(CollectData);
while (m_StartStop == "Start") {
Collect.join();
std::thread Analyze(AnalyzeData);
std::thread Collect(CollectData);
Analyze.join();
std::thread Send(SendData);
Send.join();
}
}
I realize this sample is likely way off base, but hopefully it gets the point across. Should I be creating three threads and suspending them instead of creating and joining them over and over? Also, I am a little unclear if the UI needs its own thread since the user needs to able to pause or stop the line at anytime.
In case anyone is wondering why this needs to be threaded as opposed to sequential, the answer is that the line speed of the machine will cause the need to be collecting data for the second part while the first part is being analyzed. Every 1 second equates to 3 ft of linear part movement down this machine.
Think about functionnal problem before thinking about implementation.
So we have a continuous flow of data that need to be collected, analyzed and sent elsewhere, with a supervision point to be able to stop of pause the process.
collection should be limited by the input flow
analyze should only be cpu limited
sending should be io bound
You just need to make sure that the slowest part must be collection.
That is a correct use case for threads. Implementation could use:
a pool of input buffers that would be filled by collect task and used by analyze task
one thread that continuously:
controls if it should exit (a dedicated variable)
takes an input object from the pool
fills it with data
passes it to analyze task
one thread that continuously
waits for the first of an input object from collect task and a request to exit
analyzes the object and prepares output
send the output
Optionnaly, you can have a separate thread for processing the output. In that case, the last lines becomes
passes an output object to the sending task
and we must add:
one thread that continuously
waits for the first of an output object from analze task and a request to exit
send the output
And you must provide a way to signal the request for pause or exit, either with a completely external program and a signalisation mechanism, or a GUI thread
Any threads you need should already be running, waiting for work. You should not create or join threads.
If job A has to finish before job B can start, the completion of job A should trigger the start of job B. That is, when the thread doing job A finished doing job A, it should either do job B itself or trigger the dispatch of job B. There shouldn't need to be some other thread that's waiting for job A to finish so that it can start job B.

How to add callbacks to write a file while subscribing

This is probably a question about python callbacks as much as using pika. I'm trying to develop some code that subscribes to a queue in RabbitMQ, processes the payload of any delivered message and then write that payload to a series of (disk) files. So using the simple "Hello World" example at http://www.rabbitmq.com/tutorials/tutorial-one-python.html, I've added in logic to the callback function (that is co-incidentally called "callback") to write any received message payloads to a file.
Here's the main problem: I want to write some additional code that, if a certain time period has elapsed, for example 300sec (5 mins), then the process should close the file and create a new one and write any subsequent new messages received to that. And so on ...
BUT - the issue as I see it is that the callback function ONLY gets called when a message arrives in the queue. I think I need some process outside of that callback function that measures elapsed time ....
The rationale is that I want to create a set of disk files (all have unique names based on timestamp) that contain received messages in the MQ queue. If messages are slow in coming, then I close the current open file (so it can be processed further downstream) and open up another.
I also notice that after issuing the start consuming call (channel.start_consuming) then no code under that is reached - why ?
I've played around with python's multiprocessing module but no luck so far.
Here's some skeleton code with pseudo-code comments :-
#!/usr/bin/env python
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='hello')
print ' [*] Waiting for messages. To exit press CTRL+C'
def callback(ch, method, properties, body):
print " [x] Received %r" % (body,)
# want to put code here to write message payloads to a file (unique name)
# if n secs have elapsed then close the file and create a new file
channel.basic_consume(callback,queue='hello',no_ack=True)
channel.start_consuming()
Thanks !
It might be worth taking a look at an alternative implementation to Pika. As Pika is blocking by nature, it makes it difficult create something like this. You would essentially need another thread to watch the IO, to see if anything has been written within the last five minutes, else close it.
You could also keep a timestamp, and once you get a new callback if enough time has passed, you can close the file, and create a new file. This would however keep the file open for longer durations, but prevent the data from exceeding five minutes worth.
However, I would recommend that you take a look at Puka instead. It is a non-blocking alternative to Pika that would allow you to easier implement a solution to your problem.

update a file simultaneously without locking file

Problem- Multiple processes want to update a file simultaneously.I do not want to use file locking functionality as highly loaded environment a process may block for a while which i don't want. I want something like all process send data to queue or some shared place or something else and one master process will keep on taking data from there and write to the file.So that no process will get block.
One possibility using socket programming.All the processes will send data to to single port and master keep on listening this single port and store data to file.But what if master got down for few seconds.if it happen than i may write to some file based on timestamp and than later sync.But i am putting this on hold and looking for some other solution.(No data lose)
Another possibility may be tacking lock for the particular segment of the file on which the process want to write.Basically each process will write a line.I am not sure how good it will be for high loaded system.
Please suggest some solution for this problem.
Have a 0mq instance handle the writes (as you initially proposed for the socket) and have the workers connect to it and add their writes to the queue (example in many languages).
Each process can write to own file (pid.temp) and periodically rename file (pid-0.data, pid-1.data, ...) for master process that can grab all this files.
You may not need to construct something like this. If you do not want to get processes blocked just use the LOCK_NB flag of perl flock. Periodically try to flock. If not succeeds continue the processing and the values can stored in an array. If file locked, write the data to it from the array.

Multithreading – known similarity solution

I am looking for a known solution (as producer-consumer problem) for this situation .
In my case there are two options:
link to image,
text file with links to images and links to other text files (with other links).
I'm trying to create a multi-threading downloader in C++ (on unix) using posix mutex and posix semaphore.
Application has link to the first text file.
Threads sleep (semaphore = 0).
Main thread downloads first text file.
Parse for other links -- put links in some queue (semaphore += links_count --> other threads wake up).
Other threads produce other links.
What with main thread?
How to check other threads -- finish state?
With use finite queue there can be deadlock: text file contains many links (queue as full with other text files). No text file can be finished.
Thank you for your ideas.
Well, your problem is still kind of a producer/consumer problem but your consumers are also producers. Some ways to deal with the problem:
Do not limit your queue size. Simply fail when your process runs out of memory. Not very elegant but will probably work in 99.99% of all download scenarios (assuming 100 bytes per download link on average and about 2GB available memory you would have to store more than 20 million links in your queue before running out of memory).
Split your producer and consumer by using the hard drive as buffer. Download files into a temporary folder. Have a thread watch that folder for new files. Once a new file appears, parse it and put the items in the consumer queue. Once the file is finished parsing put it into the final download location. This way you are only limited by disk space. This way your producer (parser) is a different thread than your consumers (downloader).
Edit
You can wait on your worker threads with pthread_join in the main thread.