Fast synchronised cout for multithreading - c++

Recently I ran into a rather common problem about using cout in a multithreading application but with a little twist. I've got several callbackfunctions which get called by external hardware via a driver. Main objective of the callback funtions is to receive some data and store it in a queue and signal a processing-task as soon as a certain amout of datasets got collected. The callback-function needs to run as fast as possible in order to respond to the hardware in soft realtime.
My problem is this: From time to time my queue gets full and I have to handle this case by printing out a warning to the console (hard requirement). As I work with several threads I've created a wrapper function which uses a mutex to synchronise cout. Unfortunately, in some cases waiting for access to cout can take so much time that my callback function doesn't end fast enough to respond to the hardware before a timeout. My solution was to use a atomic variable for each possible error to count the number of occurences and a further task to check these variables periodically and print out the messages afterwards, but I'm pretty sure that this is not the best approach to solve my performance problems.
Are there any general approaches for this type of problem?
Any recommendations how I could improve or simplify my solution?
Thank you in advance

Don't write output in the hotpath.
Instead, queue up the stuff you want to log (prefereably raw data rather than a fully formatted string). Have another OOB thread running which picks up this stuff and logs it.

Related

multithreaded one read one write time_t

I'm writing a multithreaded application that have 2 threads.
One of the threads receives data from a queue and aggregates it and the other one send the aggregated data to a server.
I want to be able to know the last time that a data was received so I use:
time_t last_data = time(NULL)
to get the correct time on each event (I dont need it to be super accurate but I need it to be fast) and then the other send this value with the aggregated data.
My questions are:
Do I have to synchronize this even if this is not very important that I get the most recent update?
I tested it with std::atomic<time_t> and it seems to have some performance issues, is there any other faster way?
What would be the worst case that can happen if I will not synchronize the read/write?
Is there a faster way to get the current time then time(NULL) (don't have to be super accurate)?
UPDATE
Here is an explanation of my application workflow.
Application needs:
1. Consume data from external sources using IPC (currently nanomsg).
2. Aggregate the data to bulks.
3. Send the aggregated data to remote server every given interval (1 second).
Current implementation:
Create 2 buffers to hold the aggregated data (one for receiving and one for sending).
Create a consumer thread to consume data from IPC and fill the receiving buffer.
Create a sending thread that will send the data to the server.
Every iteration of the interval the sending thread will swap the buffers (swap pointers and locking using mutex) and send the data to the server.
I don't want that the consumer will wait on network IO so I have created this flow.
Can I use event driven here instead of this complex mechanism without all the locking (currently it is working fine but i'm sure it can be better)?
Don't do it that way. You only need one thread. You can use select/poll/epoll. These can wait on your inputs, and at the same time for you output to finish. You will be doing event driven programming, and non-blocking output. It is something worth learning. It is a bit harder at first, but soon makes life easier i.e. not having the problem that you have now. Also the program will be faster.
Supposing one thread executes:
last_data = time(NULL);
And the other uses last_data but there is no synchronization event between the two then there are no guarantees when or if the revised value of last_data will become visible to the reading thread.
However the most serious possibility is that write of time_t (maybe long) isn't atomic and another thread could read a corrupt 'part written' value.
That could cause glitches in delay and time calculations that might foul downstream process.
You might analyse your program and find that because the two interact there is a sufficient memory fence at some point that guarantees eventual update.
NB: This is an odd situation where I'm suspecting you think something isn't synchronized but it is! The usual experience is the other way around...
Basically there's not really enough information to understand what problem you're having.
For example, if the reader thread is the only process to read the time I'd expect to see code like:
Thread 1:
If data received, lock L, update time, add to queue, unlock L.
Thread 2:
If items in queue L, read queue and update time, unlock L .. process item.
In which case the time will be synchronized already.
Please provide a minimum, complete, verifiable example...

How to write-off an expensive file I/O in the middle of a C++ program

I am working on some code which is performance wise extremely demanding (I am using microsecond timers!). The thing is, it has a server<->client architecture where a lot of data is being shared at high speeds. To maintain a sync between the client and the server a simple "sequence number" based approach is followed. Such that if the client's program crashes, the client can "resume" communication by sending the server the last sequence number and they can "resume operations" without missing out on anything.
The issue with this is that I am forced to write sequence numbers to disk. Sadly this has to be done on every "transaction". This file write causes huge time costs (as we would expect).
So I thought I would use threads to get around this problem. However, if I create a regular thread, I would have to wait until the file write finishes anyway and if I used a detached thread, I am doing something risky, as the thread might not finish when my actual process is killed (let's say) and thus the sequence number gets messed up.
What are my options here. Kindly note that sadly I do not have access to C++11. I am using lpthread on linux.
You can just add the data to a queue, and have the secondary threads dequeue, write, and signal when they're done.
You can also get some inspiration from log-based file systems. They get around this problem by having the main thread first writing a small record to a log file and returning control immediately to the rest of the program. Meanwhile, secondary threads can carry out the actual data write, and signal when done by also writing to the log file. This helps your maintain throughput by deferring writes to when more system resources are available, and doesn't block the main thread. Read more about it here

Are there performance issues with CSocket::Send?

UPDATE 14 June 2011
A quick update... Most respondents have focused on the dodgy method for handling the queue of messages to be logged however while there is certainly a lack of optimisation there it's certainly not the root of the problem. We switched the Yield over to a short sleep (yes, the Yield did result in 100% CPU once the system went quiet) however the system still can't keep up with the logging even when it's going nowhere near that sleep. From what I can see the Send is just not very efficient. One respondent commented that we should block up the Send() together in to one send and that would seem like the most appropriate solution to the larger underlying issue and that's why I have marked this as the answer to the original question. I certainly agree the queue model is very flawed though, so thanks for feedback on that and I have up-voted all answers that have contributed to the discussion.
However, this exercise has got us to review why we're using the external logging over a socket like we are, and while it may well have made sense previously when the logging server did lots of processing over the log entries... it no longer does any of that and therefore we have opted to remote that entire module and go for a direct-to-file approach via some pre-existing logging framework, this should eliminate the problem entirely as well as remove unnecessary complexity in the system.
Thanks again for all the feedback.
ORIGINAL QUESTION
In our system we have two components important to this problem - one is developed in Visual C++ and the other is Java (don't ask, historic reasons).
The C++ component is the main service and generates log entries. These log entries are sent via a CSocket::Send out to a Java logging service.
The problem
Performance of sending data seems very low. If we queue on the C++ side then the queue gets backed up progressively on busier systems.
If I hit the Java Logging Server with a simple C# application then I can hammer it way faster then I will ever need to from the C++ tool and it keeps up beautifully.
In the C++ world, the function that adds messages to the queue is:
void MyLogger::Log(const CString& buffer)
{
struct _timeb timebuffer;
_ftime64_s( &timebuffer );
CString message;
message.Format("%d%03d,%04d,%s\r\n", (int)timebuffer.time, (int)timebuffer.millitm, GetCurrentThreadId(), (LPCTSTR)buffer);
CString* queuedMessage = new CString(message);
sendMessageQueue.push(queuedMessage);
}
The function run in a separate thread that sends to the socket is:
void MyLogger::ProcessQueue()
{
CString* queuedMessage = NULL;
while(!sendMessageQueue.try_pop(queuedMessage))
{
if (!running)
{
break;
}
Concurrency::Context::Yield();
}
if (queuedMessage == NULL)
{
return;
}
else
{
socket.Send((LPCTSTR)*queuedMessage, queuedMessage->GetLength());
delete queuedMessage;
}
}
Note that ProcessQueue is run repeatedly by the outer loop thread itself, which excluding a bunch of nonsense preamble:
while(parent->running)
{
try
{
logger->ProcessQueue();
}
catch(...)
{
}
}
The queue is:
Concurrency::concurrent_queue<CString*> sendMessageQueue;
So the effect we're seeing is that the queue is just getting bigger and bigger, log entries are being sent out to the socket but at a much lower rate than they're going in.
Is this a limitation of CSocket::Send that makes it less than useful for us? A mis-use of it? Or an entire red-herring and the problem lies elsewhere?
Your advice is much appreciated.
Kind Regards
Matt Peddlesden
Well, you could start by using a blocking producer-consumer queue and to get rid of the 'Yield'. I'm not surprised that messages get blocked up - when one is posted, the logger thread is typically, on a busy system, ready but not running. This will introduce a lot of avoidable latency before any message on the queue can be processed. The background thread than has a quantum to try an get rid of all the messages that have accumulated on the queue. If there are a lot of ready threads on a busy system, it could well be that the thread just does not sufficient time to handle the messages. especially if a lot have built up and the socket.send blocks.
Also, almost competely wasting one CPU core on queue polling cannot be good for overall performance.
Rgds,
Martin
In my opinion, you're definitely not looking at the most efficient solution. You should definitely call Send() once. For all messages. Concatenate all the messages in the queue on the user side, send them all at once with Send(), then yield.
In addition, this really isn't how you're meant to do it. The PPL contains constructs explicitly intended for asynchronous callbacks- like the call object. You should use that instead of hand-rolling your own.
Things here that might be slowing you up:
The queue you are using. I think this is a classic example of premature optimization. There's no reason here to use the Concurrency::concurrent_queue class and not a regular message queue with a blocking pop() method. If I understand correctly, the Concurrency classes use non-blocking algorithms when in this case you do want to block while the queue is empty and release the CPU for other threads to use.
The use of new and delete for each message and the inner allocations of the CString class. You should try and see if recycling the messages and strings (using a pool) will help performance for two reasons: 1. The allocation and deallocation of the messages and string objects. 2. The allocations and deallocations done inside the strings can maybe be avoided if the string class will internally recycle its buffers.
Have you tried profiling to see where your application is having trouble? Is it only when logging that there are issues with the sender? Is it CPU bound or blocking?
The only thing I can see is that you don't protect the message queue with any sort of locking so the container state could get weird causing all sorts of unexpected behavior.

Save data periodically during execution

I have a program which executes constantly and I need to save data every minute.
The program process data and every minute I want to save the value of a variable and do some statistical operations to know the variation of this variable.
I thought i can make it with a signal, SIGALRM and alarm(60). My subquestion is, can I put a class method as the destiny method for SIGALRM?
Any other idea to execute a method to save data and do some operations every minute ??
The program is written in C++, runs in Linux an a mono-core processor.
Your solution using alarm will work, both open and write being asynchronous-signal-safe. Though you have to be aware that interactions between alarm and sleep are undefined, so don't use them in the same program.
A different solution, especially in case you already use an epoll, would be to have a timerfd trigger the epoll. That will avoid possible undefined interactions.
As for the actual saving, consider forking. This is a technique that I learned from redis (maybe someone else invented it, but that's where I learned it from), and which I consider totally cool. The point being that the forked process can take all time in the universe to finish writing as much data as you want to disk. It can access the snapshot at the time of forking while the other process keeps running and modifying data. And thanks to page magic done in the kernel, it still all works seamlessly without any risk of corruption, without ever stalling, and without ever needing to look at something like asynchronous IO, which is great.
You can call a class method using something like boost bind
Apart from that I wouldn't recommend to use signals for that, they are not that reliable, and could, for example, make one of your syscalls to return prematurely.
I would spawn a thread, assuming your monocore doesn't mean no threads, that waits 60 seconds, takes locks, makes calcs, outputs and releases locks.
As they have already suggested, if you have an async compatible system(driven by events) you could use timerfd to generate events.
Saving data from a signal handler is a very bad idea. Even if open and write are async-signal-safe, your data could very well be in an inconsistent state due to a signal interrupting a function that was modifying it.
A much better approach would be to add to all functions which modify the data:
if (current_time > last_save_time + 60) save();
This will avoid useless saves when the data has not been modified, too. If you don't want the overhead of making a system call to determine the current time on every operation, you could instead install a timer/signal handler that updates current_time, as long as you declare it volatile.
Another good approach would be to use threads instead of signals. Then you should use a mutex (or better, rwlock) to synchronize access to the data.

console out in multi-threaded applications

Usually developing applications I am used to print to console in order to get useful debugging/tracing information. The application I am working now since it is multi-threaded sometimes I see my printf overlapping each other.
I tried to synchronize the screen using a mutex but I end up in slowing and blocking the app. How to solve this issue?
I am aware of MT logging libraries but in using them, since I log too much, I slow ( a bit ) my app.
I was thinking to the following idea..instead of logging within my applications why not log outside it? I would like to send logging information via socket to a second application process that actually print out on the screen.
Are you aware of any library already doing this?
I use Linux/gcc.
thanks
afg
You have 3 options. In increasing order of complexity:
Just use a simple mutex within each thread. The mutex is shared by all threads.
Send all the output to a single thread that does nothing but the logging.
Send all the output to a separate logging application.
Under most circumstances, I would go with #2. #1 is fine as a starting point, but in all but the most trivial applications you can run in to problems serializing the application. #2 is still very simple, and simple is a good thing, but it is also quite scalable. You still end up doing the processing in the main application, but for the vast majority of applications you gain nothing by spinning this off to it's own, dedicated application.
Number 3 is what you're going to do in preformance-critical server type applications, but the minimal performance gain you get with this approach is 1: very difficult to achieve, 2: very easy to screw up, and 3: not the only or even most compelling reason people generally take this approach. Rather, people typically take this approach when they need the logging service to be seperated from the applications using it.
Which OS are you using?
Not sure about specific library's, but one of the classical approaches to this sort of problem is to use a logging queue, which is worked by a writer thread, who's job is purely to write the log file.
You need to be aware, either with a threaded approach, or a multi-process approach that the write queue may back up, meaning it needs to be managed, either by discarding entries or by slowing down your application (which is obviously easier if it's the threaded approach).
It's also common to have some way of categorising your logging output, so that you can have one section of your code logging at a high level, whilst another section of your code logs at a much lower level. This makes it much easier to manage the amount of output that's being written to files and offers you the option of releasing the code with the logging in it, but turned off so that it can be used for fault diagnosis when installed.
As I know critical section has less weight.
Critical section
Using critical section
If you use gcc, you could use atomic accesses. Link.
Frankly, a Mutex is the only way you really want to do that, so it's always going to be slow in your case because you're using so many print statements.... so to solve your question then, don't use so many print_f statements; that's your problem to begin with.
Okay, is your solution using a mutex to print? Perhaps you should have a mutex to a message queue which another thread is processing to print; that has a potential hang up, but I think will be faster. So, use an active logging thread that spins waiting for incoming messages to print. The networking solution could work too, but that requires more work; try this first.
What you can do is to have one queue per thread, and have the logging thread routinely go through each of these and post the message somewhere.
This is fairly easy to set up and the amount of contention can be very low (just a pointer swap or two, which can be done w/o locking anything).