MessageProducer.send() is too slow for a particular topic - c++

I've narrowed down the area of the problem I'm facing and it turned out that MessageProducer.send() is too slow when it is created for a particular topic "replyfeserver":
auto producer = context.CreateProducerFromTopic("replyfeserver");
producer->send(textMessage); //it is slow
Here the call to send() blocks for up to 55-65 seconds occasionally — almost every after 4-5 calls, and up to 5-15 seconds in general.
However, if I use some other topic, say "feserver.action.status".
auto producer = context.CreateProducerFromTopic("feserver.action.status");
producer->send(textMessage); //it is fast!
Now the call to send() returns immediately, within a fraction of second. I've tried send() with several other topics and all of them work fast enough.
What could be the possible issues with this particular topic "replyfeserver"? What are the things I should look at in order to diagnose the issue with it? I was using this topic for last 2 months.
I'm using XMS C++ API and please assume that context object is an abstraction which creates session, destination, consumer, producer and so on.
I'd also like to know if there is any difference between these two approaches:
xms::Destination dest("topic://replyfeserver");
vs
xms::Destination dest = session.createTopic("replyfeserver");
I tried with both, it doesn't make any difference — at least I didn't notice it.

There shouldn't be any difference. Personally, I like to have my topics in a hierarchy. i.e. A.B.C
I would run an MQ trace then open a PMR with IBM and give them the trace and say please explain the delay.

Related

Launching all threads at exactly the same time in C++

I have Rosbag file which contains messages on various topics, each topic has its own frequency. This data has been captured from a hardware device streaming data, and data from all topics would "reach" at the same time to be used for different algorithms.
I wish to simulate this using the rosbag file(think of it as every topic has associated an array of data) and it is imperative that this data streaming process start at the same time so that the data can be in sync.
I do this via launching different publishers on different threads (I am open to other approaches as well, this was the only one I could think of.), but the threads do not start at the same time, by the time thread 3 starts, thread 1 would be considerably ahead.
How may I achieve this?
Edit - I understand that launching at the exact same time is not possible, but maybe I can get away with a launch extremely close to each other as well. Is there any way to ensure this?
Edit2 - Since the main aim is to get the data stream in Sync, I was wondering about the warmup effect of the thread(suppose a thread1 starts from 3.3GHz and reaches to 4.2GHz by the time thread2 starts at 3.2). Would this have a significant effect (I can always warm them up before starting the publishing process, but I am curious whether it would have a pronounced effect)
TIA
As others have stated in the comments you cannot guarantee threads launch at exactly the same time. To address your overall goal: you're going about solving this problem the wrong way, from a ROS perspective. Instead of manually publishing data and trying to get it in sync, you should be using the rosbag api. This way you can actually guarantee messages have the same timestamp. Note that this doesn't guarantee they will be sent out at the exact same time, because they won't. You can put a message into a bag file directly like this
import rosbag
from std_msgs.msg import Int32, String
bag = rosbag.Bag('test.bag', 'w')
try:
s = String()
s.data = 'foo'
i = Int32()
i.data = 42
bag.write('chatter', s)
bag.write('numbers', i)
finally:
bag.close()
For more complex types that include a Header field simply edit the header.stamp portion to keep timestamps consistent

gRPC C++ client calls against Bigtable hangs occasionally

I am having a problem with gRPC C++ client making calls against google cloud Bigtable. These calls occasionally hang and it is only if the call deadline is set the call returns. There is an issue filed with gRPC team: https://github.com/grpc/grpc/issues/6278 with stack trace and a piece of gRPC tracing log provided.
The call that hangs most often is ReadRows stream read call. I have seen MutateRow call hanging a few times as well but that is rather rare.
gRPC tracing shows that there is some response coming back from the server, however that response seems to be insufficient for gRPC client to go on.
I did spend a fair amount of time debugging the code, no obvious problems found so far on the client side, no memory corruptions seen. This is a single-threaded application, making one call at a time, client side concurrency is not a suspect. Client runs on google compute engine box, so the network likely is not an issue as well. gRPC client is kept up to date with the github repository main line.
Any suggestions would be appreciated. If anyone have debugging ideas that would be great as well. Using valgrind, gdb, reducing the application to a subset with reproducible results did not help so far, I have not been able to find out what the problem is. The problem is random and shows up occasionally.
Additional note on May 17, 2016
There was a suggestion that re-tries may help to deal with the issue.
Unfortunately re-tries do not work very well for us because we would have to carry that over into the application logic. We can easily re-try updates, which is MutateRow calls, and we do that, these are not streaming calls and easy to re-try. However once the iteration of the DB query results has begun by the application, if it fails, the re-trying means that the application needs to re-issue the query and start iteration of the results again. Which is problematic. It is always possible to consider a change that would make our applications to read the whole result set at once and then at the application level iterations can be done in memory. Then re-tries can be handled. But that is problematic for all kinds of reasons, like memory footprint and application latencies. We want to process DB query results as soon as they arrive, not when all of them are in memory. There is also timeout added to the call latency when the call hangs. So, re-tries of the query result iterations are really costly to such a degree that they are not practical.
We've experienced hanging issues with gRPC in various languages. The gRPC team is investigating.

How to setup ZERO-MQ architecture to deal with workers of different speed

[as a small context provider: I am new to networking and ZERO-MQ, but I did spend quite a bit of time on the guide and examples]
I have the following challenge (done in C++, but irrelevant to the question). I have a single source that generates tasks. I have multiple engines that need to process those tasks, and send back the result.
First attempt:
I created a client with a ZMQ_PUSH socket. The engines have a ZMQ_PULL socket. To get the answers back to the client, I created the reverse: a ZMQ_PUSH on the workers and a ZMQ_PULL on the client. It worked out of the box. Only to find out that after some time the client ran out of memory since I was pushing way more requests than the workers could process. I need some backpressure.
Second attempt:
I added a counter on the client that took care of only pushing when no more than say 1000 tasks were 'in progress'. The out of memory issue was solved, since I was never having more than 1000 'in progress' tasks. But ... some workers were slower than others. Since PUSH/PULL uses fair queueing, the amount of work for that slow worker kept increasing and increasing...until the slowest worker had all 1000 requests queued and the others were starved. I was not using my workers effectively.
Now, what architecture could I use that solves the issue of 'workers with different speed'? Is the 'count the number of in progress tasks' approach a good way of balancing the number of pushed requests? Or is there a way I can PUSH tasks to the workers, and the pushing blocks on a predefined point? Can I do that with HWM?
I am sure this problem is of such a generic nature that I should be able to easily deal with this. Can anyone point me in the right direction?
Thanks!
we used the Paranoid Pirate Protocol http://rfc.zeromq.org/spec:6,
but in case of many very small jobs, where the overhead of communication might be high, a credit-based flow control pattern might be more efficient. http://unprotocols.org/blog:15
in both cases it is necessary for the requester to directly assign jobs to individual workers. this is abstracted away of course and, depending on the use-case, could be made available as a sync call, which returns when all tasks have been processed.

Use Post or PostAndAsyncReply with F#'s MailboxProcessor?

I've seen different snippets demonstrating a Put message that returns unit with F#'s MailboxProcessor. In some, only the Post method is used while others use PostAndAsyncReply, with the reply channel immediately replying once the message is being processed. In doing some testing, I found a significant time lag when awaiting the reply, so it seems that unless you need a real reply, you should use Post.
Note: I started asking this in another thread but thought it useful to post as a full question. In the other thread, Tomas Petricek mentioned that the reply channel could be used a wait mechanism to ensure the caller delayed until the Put message was processed.
Does using PostAndAsyncReply help with message ordering, or is it just to force a pause until the first message is processed? In terms of performance Post appears the right solution. Is that accurate?
Update:
I just thought of a reason why PostAndAsyncReply might be necessary in the BlockingQueueAgent example: Scan is used to find Get messages when the queue is full, so you don't want to Put and then Get before the previous Put has completed.
I think I generally agree with your summary - it makes sense that PostAndAsyncReply is slower than Post, so if the caller doesn't need to get a notification from the agent when the operation (such as putting value into the queue) completes, it should definitely expose a way to do that using just Post. The fact that PostAndAsyncReply is a lot slower probably means that some agents should expose both options and let the caller decide.
Regarding the specific example of BlockingQueueAgent (or a similar one that I used to implement one-place buffer), the typical application of the agent is to solve the consumer-producer problem. In consumer-producer problem, we want to block the producer when the queue is full and block the consumer when it is empty. The .NET BlockingCollection supports only synchronous blocking, which is a bit bad (i.e. it can block the whole thread pool).
The using the BlockingQueueAgent that sends the Put messsage using PostAndAsyncReply, we can wait until the element is added to the queue asynchronously (so it blocks the producer, but without blocking threads!) An example of typical usage is the image processing pipeline that I wrote some time ago. Here is one snippet from that:
// Phase 2: Scale to a thumbnail size and add frame
let scalePipelinedImages = async {
while true do
let! info = loadedImages.AsyncGet()
scaleImage info
do! scaledImages.AsyncAdd(info) }
This loop repeatedly gets an image from the loadedImages queue, does some processing and writes the result to scaledImages. The blocking using the queue (both when reading and when writing) controls the parallelism, so that the steps of pipeline run in parallel, but it does not keep loading more and more images if the pipeline cannot handle them at the required speed.
My advice is to design your system so you can use Post as much as possible.
This technology was designed for asynchronous concurrency where the objective is to fire-and-forget messages. The idea of waiting for a response goes directly against the grain of this.

Platform independent parallelization without changing the framework?

I hope the title did not mislead you.
My problem is the following: Currently I try to speed up a raytracer and this is done with the help of the graphics card. It works fine despite the fact that it got slower by this. :)
This is caused by the fact, that I trace one ray on the whole geometry at once on the graphics card(my "tracing server") and then fetch the results, which is awfully slow, so I have to gather some rays and calc them and fetch the results together to speed this up.
The next problem is, that I am not allowed to rewrite the surrounding framework that should know nothing or least possible about this parallelization.
So here is my approach:
I thought about using several threads, where each one gets a ray and requests my "tracing server" to calc the intersections. Then the thread is stopped until enough rays were gathered to calc the intersections on the graphics card and get the results back efficiently. This means that each thread will wait until the results were fetched.
You see I already have some plan but following I do not know:
Which threading framework should I take to be platformindependent?
Should I use a threadpool of fixed size or create them as needed?
Can any given thread library handle at least 1000 waiting threads(because that would be the number that I need to gather for my fetch to be efficient)?
But I also could imagine doing this with one thread that
dumps its load (a new ray) to the "tracing server" and fetches the next load until
there is enough to fetch the results.
Then the thread would take the results one by one, do the further calculations until all results are processed and then goes back to step one until all rays are done.
Also if you have some better idea how to parallelize this, tell me about it.
Regards,
Nobody
PS
If you need this information: The two platforms I want to use are Linux and Windows.
use either Thread Building Blocks or boost::thread.
http://www.boost.org/doc/libs/1_46_0/doc/html/thread.html
http://threadingbuildingblocks.org/
As far as threadpool/on-demand-threads - threadpool is generally better idea as it avoids creation overhead.
Number of waiting threads is gonna depend on the underlying system more than anything else:
Maximum number of threads per process in Linux?