How to add callbacks to write a file while subscribing - python-2.7

This is probably a question about python callbacks as much as using pika. I'm trying to develop some code that subscribes to a queue in RabbitMQ, processes the payload of any delivered message and then write that payload to a series of (disk) files. So using the simple "Hello World" example at http://www.rabbitmq.com/tutorials/tutorial-one-python.html, I've added in logic to the callback function (that is co-incidentally called "callback") to write any received message payloads to a file.
Here's the main problem: I want to write some additional code that, if a certain time period has elapsed, for example 300sec (5 mins), then the process should close the file and create a new one and write any subsequent new messages received to that. And so on ...
BUT - the issue as I see it is that the callback function ONLY gets called when a message arrives in the queue. I think I need some process outside of that callback function that measures elapsed time ....
The rationale is that I want to create a set of disk files (all have unique names based on timestamp) that contain received messages in the MQ queue. If messages are slow in coming, then I close the current open file (so it can be processed further downstream) and open up another.
I also notice that after issuing the start consuming call (channel.start_consuming) then no code under that is reached - why ?
I've played around with python's multiprocessing module but no luck so far.
Here's some skeleton code with pseudo-code comments :-
#!/usr/bin/env python
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='hello')
print ' [*] Waiting for messages. To exit press CTRL+C'
def callback(ch, method, properties, body):
print " [x] Received %r" % (body,)
# want to put code here to write message payloads to a file (unique name)
# if n secs have elapsed then close the file and create a new file
channel.basic_consume(callback,queue='hello',no_ack=True)
channel.start_consuming()
Thanks !

It might be worth taking a look at an alternative implementation to Pika. As Pika is blocking by nature, it makes it difficult create something like this. You would essentially need another thread to watch the IO, to see if anything has been written within the last five minutes, else close it.
You could also keep a timestamp, and once you get a new callback if enough time has passed, you can close the file, and create a new file. This would however keep the file open for longer durations, but prevent the data from exceeding five minutes worth.
However, I would recommend that you take a look at Puka instead. It is a non-blocking alternative to Pika that would allow you to easier implement a solution to your problem.

Related

Launching all threads at exactly the same time in C++

I have Rosbag file which contains messages on various topics, each topic has its own frequency. This data has been captured from a hardware device streaming data, and data from all topics would "reach" at the same time to be used for different algorithms.
I wish to simulate this using the rosbag file(think of it as every topic has associated an array of data) and it is imperative that this data streaming process start at the same time so that the data can be in sync.
I do this via launching different publishers on different threads (I am open to other approaches as well, this was the only one I could think of.), but the threads do not start at the same time, by the time thread 3 starts, thread 1 would be considerably ahead.
How may I achieve this?
Edit - I understand that launching at the exact same time is not possible, but maybe I can get away with a launch extremely close to each other as well. Is there any way to ensure this?
Edit2 - Since the main aim is to get the data stream in Sync, I was wondering about the warmup effect of the thread(suppose a thread1 starts from 3.3GHz and reaches to 4.2GHz by the time thread2 starts at 3.2). Would this have a significant effect (I can always warm them up before starting the publishing process, but I am curious whether it would have a pronounced effect)
TIA
As others have stated in the comments you cannot guarantee threads launch at exactly the same time. To address your overall goal: you're going about solving this problem the wrong way, from a ROS perspective. Instead of manually publishing data and trying to get it in sync, you should be using the rosbag api. This way you can actually guarantee messages have the same timestamp. Note that this doesn't guarantee they will be sent out at the exact same time, because they won't. You can put a message into a bag file directly like this
import rosbag
from std_msgs.msg import Int32, String
bag = rosbag.Bag('test.bag', 'w')
try:
s = String()
s.data = 'foo'
i = Int32()
i.data = 42
bag.write('chatter', s)
bag.write('numbers', i)
finally:
bag.close()
For more complex types that include a Header field simply edit the header.stamp portion to keep timestamps consistent

log4cpp stops working properly after sometime

I have a log4cpp implementation in a multiple process environment . Logger is configured once during initialization and then is shared among forked processes which server http requests.
During first minute or so , I see the logs rolls perfectly fine at the query per second load( say it runs at 100qps).
After that, the log slows down dramatically. So, I logged pid as well and notice that only one process gets to write to the log for a time duration ( around 10-15 seconds) and then another process starts writing and so on so forth . Processes don't die. They just don't get a chance to write.
This is different from what happens when the server starts . At that time, every other log line is written by a different process. ( Also, I write one-log-line per process at the end of serving the request. )
At this point, I can't think of what could be going wrong.
This is how my log4cpp conf file looks
log4cpp.rootCategory=DEBUG,rootAppender
log4cpp.appender.rootAppender=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.rootAppender.fileName=/tmp/mylogfile.log
log4cpp.appender.rootAppender.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.rootAppender.layout.ConversionPattern=%d|%p|%m%n
log4cpp.category.http.server.main=INFO,MAIN
log4cpp.additivity.http.server.main=false
log4cpp.appender.MAIN=org.apache.log4cpp.RollingFileAppender
log4cpp.appender.MAIN.maxBackupIndex=10
log4cpp.appender.MAIN.maxFileAge=1
log4cpp.appender.MAIN.append=true
log4cpp.appender.MAIN.fileName=/tmp/mylogfile.log
log4cpp.appender.MAIN.layout=org.apache.log4cpp.PatternLayout
log4cpp.appender.MAIN.layout.ConversionPattern=%d|%p|%m%n
Edit: more updates : Thanks #Botje for your time.
I see that whenever a new child process is created , it is only that process that gets to write to the log. That tells me that all the reference other processes were holding become invalid.
I also tried setting additive property to true. With that , server starts properly writing into the /tmp/myfile.log and then switches to writing into /tmp/myfile.log.1 withing a minute . And then stops writing after a minute.
At that point logs gets directed to stderr which is directed to another log file.
Also,
I did notice that the log4cpp FileAppender uses seek to determine the file size before writing log entries. If the file handle is shared between processes that will cause writes to end up at the start of the file instead of the end. Even if you fix that, you still have multiple processes that think they are in charge of log file rotation.
I suggest you have all processes write to a common udp/tcp/Unix socket and designate one process that collects all log entries and actually writes it to a file. You don't have to reinvent the wheel, you can use the syslog protocol and either the system syslog or a copy running in userspace.

How to implement long running gRPC async streaming data updates in C++ server

I'm creating an async gRPC server in C++. One of the methods streams data from the server to clients - it's used to send data updates to clients. The frequency of the data updates isn't predictable. They could be nearly continuous or as infrequent as once per hour. The model used in the gRPC example with the "CallData" class and the CREATE/PROCESS/FINISH states doesn't seem like it would work very well for that. I've seen an example that shows how to create a 'polling' loop that sleeps for some time and then wakes up to check for new data, but that doesn't seem very efficient.
Is there another way to do this? If I use the "CallData" method can it block in the 'PROCESS' state until there's data (which probably wouldn't be my first choice)? Or better, can I structure my code so I can notify a gRPC handler when data is available?
Any ideas or examples would be appreciated.
In a server-side streaming example, you probably need more states, because you need to track whether there is currently a write already in progress. I would add two states, one called WRITE_PENDING that is used when a write is in progress, and another called WRITABLE that is used when a new message can be sent immediately. When a new message is produced, if you are in state WRITABLE, you can send immediately and go into state WRITE_PENDING, but if you are in state WRITE_PENDING, then the newly produced message needs to go into a queue to be sent after the current write finishes. When a write finishes, if the queue is non-empty, you can grab the next message from the queue and immediately start a write for it; otherwise, you can just go into state WRITABLE and wait for another message to be produced.
There should be no need to block here, and you probably don't want to do that anyway, because it would tie up a thread that should otherwise be polling the completion queue. If all of your threads wind up blocked that way, you will be blind to new events (such as new calls coming in).
An alternative here would be to use the C++ sync API, which is much easier to use. In that case, you can simply write straight-line blocking code. But the cost is that it creates one thread on the server for each in-progress call, so it may not be feasible, depending on the amount of traffic you're handling.
I hope this information is helpful!

Reading/writing Cap’n Proto messages partially

I'm trying to use Cap’n Proto in existing project consisting of client and server communicating over UDS. I don't have the resources (and I doubt it would be accepted) to redo all client-server RPC, but I wanted to benefit from Cap’n Proto serialization mechanisms. Unfortunately, it seems to me it's impossible.
The biggest problem is server side, which is single threaded (and it will remain so, if there aren't any serious arguments for multithreading) and uses it's own poll based loop. All events are read partially, server can't block waiting for any event to be fully read - and this is where I am stuck. We have our own protocol and classes which wrap the message, which can consume bytes from file descriptor and notify, when the event is fully read, so the server can process it. I think I've analysed most of Cap’n Proto interfaces (serialization, async serialization) and it seems, that it can't be used the same way without any modifications.
I really hope that I've missed something. Did I?
There are two ways you can solve this:
Hard way: You can attempt to integrate with the KJ async I/O framework (used by Cap'n Proto). The KJ event loop can actually integrate with other event loops and run on top of them -- but it's tricky. For example, node-capnp includes code to integrate the KJ event loop with libuv, as seen in the first part of this source file. Once you have the necessary glue, you can write KJ-style async code that uses the interfaces in capnp/serialize-async.h.
Easy way: Instead of trying to integrate KJ, you can write simple code using your event infrastructure which reads data from the file descriptor directly and then uses capnp::expectedSizeInWordsFromPrefix() (from capnp/serialize.h) to figure out if it has received the whole message yet. If that function returns a number greater than what you already have, then you don't have the full message and have to keep waiting. Once you have the full message, you can then use capnp::FlatArrayMessageReader to parse it.

How to prevent other workers from accessing a message which is being currently processed?

I am working on a project that will require multiple workers to access the same queue to get information about a file which they will manipulate. Files are ranging from size, from mere megabytes to hundreds of gigabytes. For this reason, a visibility timeout doesn't seem to make sense because I cannot be certain how long it will take. I have though of a couple of ways but if there is a better way, please let me know.
The message is deleted from the original queue and put into a
‘waiting’ queue. When the program finished processing the file, it
deletes it, otherwise the message is deleted from the queue and put
back into the original queue.
The message id is checked with a database. If the message id is
found, it is ignored. Otherwise the program starts processing the
message and inserts the message id into the database.
Thanks in advance!
Use the default-provided SQS timeout but take advantage of ChangeMessageVisibility.
You can specify the timeout in several ways:
When the queue is created (default timeout)
When the message is retrieved
By having the worker call back to SQS and extend the timeout
If you are worried that you do not know the appropriate processing time, use a default value that is good for most situations, but don't make it so big that things become unnecessarily delayed.
Then, modify your workers to make a ChangeMessageVisiblity call to SQS periodically to extend the timeout. If a worker dies, the message stops being extended and it will reappear on the queue to be processed by another worker.
See: MessageVisibility documentation