How to avoid cached data in multithread application C++ - c++

Can someone clarify the next moment. I saw some implementations of std::queue for multithreading purposes, where all operations of pushing/poping/erasing elements where protected with mutex, but when I see that I imagine next scenario: we have two threads (thread1 and thread2) and they are running on different cores of the processor, thus they have different L1 caches. And we have the next queue:
struct Message {
char* buf;
size_t size;
};
std::queue<Message> messageQueue;
Thread1 adds some element to queue, then thread2 tries to access an element with front() method, but what if that piece of memory was previosly cached for this core of the processor (so the size variable may not indicate current size of buf variable, or buf pointer may hold wrong (not updated) address)?
I have such problem while designing client/server application on the server side. In my app server is running in one thread, it works directly with sockets, and when it receives new message, it allocates memory for that message and adds this message to some message queue, then other thread accesses this queue, processes message and then deletes it. I am always afraid of caching problems, and because of that I have created my own implementation of queue with volatile pointers.
What is the proper way to work with such things? Do I have to avoid using std::list, std::queue with locks? If this problem is impossible, could you please explain why?

It's not your problem. You're writing C++ code. It's the compiler's job to ensure your code makes the CPU and its caches do the right thing, not yours. You just have to comply with the rules for whatever threading standard you are using.
But if i lock some mutex, how can i be shure, that this memory isn't cached, or mutex locking somehow guarantee reading directly from the memory?
You can't. And that's a good thing. Caching massively improves performance and main memory is terribly slow. Fortunately, no modern CPU that you're likely to write multi-threaded code on requires you to sacrifice performance like that. They have incredibly sophisticated optimizations such as cache coherency hardware and prefetch pinning to avoid things that hurt performance that much. What you want is for your code to work, not for it to work awfully.

You can safely use mutex locking as long as you use it everywhere you MIGHT access / update in a threaded fashion. This means for this example that adding to / removing from / parsing your list is mutex-guarded.
You also have to be careful that the objects you're storing in your queue are properly protected.
Also, please please please do not use char *. That's what std::string is for.

Related

Simple mutex protected queue vs. thread safe queue in C++

I want to use a simple thread-safe std::queue in my program that has multiple threads accessing the same queue. The first thing that came to my mind is protecting the queue operation using mutex as below:
/*Enqueue*/
mutex.lock();
queue.push();
mutex.unlock();
/*Dequeue*/
mutex.lock();
val = queue.front
mutex.unlock();
/*some operation*/
mutex.pop();
I've seen many robust implementations using condition variables for thread-safe queue implementation For e.g. https://stackoverflow.com/a/16075550/3598205 . Will there be a significant difference in the performance if I only have two threads accessing the same queue?
Mutexes and condition variables do two different things, although they are often used together.
To ensure that only one thread can access a resource at a time, use a mutex. The code you posted shows an example of this.
To block a worker thread until there is something for it to do, have it wait on a condition variable (which is then signalled by another thread providing some kind of work item). There is an example of this over at cppreference.
Your first thought when writing multi-threaded code should be to write robust, safe code. It's very easy to make mistakes, especially if you're new to the area, and bugs are very hard to diagnose since they lead to sporadic, unpredictable errors. Worry about performance later.
Your application will be limited by one main thing. We call this the bottleneck. In my experience, 90% of the applications I've seen were limited by the bus bandwidth, transferring memory to/from main memory/CPU.
Different apps will have different bottleneck. It could be GPU performance, or disk access. It could be raw CPU power or maybe, contention accessing the queue above.
Very likely, the mutex will be just as good as the fancy lock-free queue. But you don't know until you profile.
It might very well happen that your application is strictly limited by the access to this queue. For example, if your app is a low-latency market data exchange for a financial institution and the queue is holding the buy/sell directives, then it will make a difference. The two threads could be constantly writing to different locations on a queue that has a couple hundred items (so, on different memory pages).
Or it might be that your application is always waiting on the GPU to render frames and the queue holds the player weapon changes that rendering and gameplay threads access just a couple times per frame.
Profile and check.

Is it safe to kill a thread that is executing a memcpy?

Context:
I am working on an application that needs fast access to large files, so I use memory-mapping. Reading and writing becomes a simple memcpy in consequence. I am now trying to add the ability to abort any reads or writes in progress.
The first thing that came to mind (since I don't know about any interruptable memcpy function) was to periodically memcpy a few KB and check if the operation should be aborted. This should ensure a near-instantanious abortion, if the read is reasonably fast.
If it isn't however the application shouldn't take ages to abort, so my second idea was using multithreading. The memcpy happens in its own thread, and a controlling thread uses WaitForMultipleObjects on an event that signals abortion, and the memcpy-thread. It would then kill the memcpy-thread, if the abortion-event was signaled. However, the documentation on TerminateThread states that one should be absolutely sure that one does not leave the system in a bad state by not releasing ressource for instance.
Question:
Does a memcpy do anything that would make it unsafe to kill it when copying mapped memory? Is it safe to do so? Is it implementation dependant (using different operating systems/architectures than Windows x86-64)?
I do realize that using the second approach may be complete overkill, since no 1KB read/write is every going to take that long realistically, but I just want to be safe.
If at all possible you should choose a different design, TerminateThread should not be thought of as a normal function, it is more for debugging/power tools.
I would recommend that you create a wrapper around memcpy that copies in chunks. The chunk size is really up to you, depends on your responsiveness requirements. 1 MiB is probably a good starting point.
If you absolutely want to kill threads you have to take a couple of things into account:
You obviously don't know anything about how memcpy works internally nor how much it copied so you have to assume that the whole range is undefined when you abort.
Terminating a thread will leak memory on some versions of Windows. There are workarounds for that.
Don't hold any locks in the thread.

C/C++ - ring buffer in shared memory (POSIX compatible)

I've an application where producers and consumers ("clients") want to send broadcast messages to each other, i.e. a n:m relationship. All could be different programs so they are different processes and not threads.
To reduce the n:m to something more maintainable I was thinking of a setup like introducing a little, central server. That server would offer an socket where each client connects to.
And each client would send a new message through that socket to the server - resulting in 1:n.
The server would also offer a shared memory that is read only for the clients. It would be organized as a ring buffer where the new messages would be added by the server and overwrite older ones.
This would give the clients some time to process the message - but if it's too slow it's bad luck, it wouldn't be relevant anymore anyway...
The advantage I see by this approach is that I avoid synchronisation as well as unnecessary data copying and buffer hierarchies, the central one should be enough, shouldn't it?
That's the architecture so far - I hope it makes sense...
Now to the more interesting aspect of implementing that:
The index of the newest element in the ring buffer is a variable in shared memory and the clients would just have to wait till it changes. Instead of a stupid while( central_index == my_last_processed_index ) { /* do nothing */ } I want to free CPU resources, e.g. by using a pthread_cond_wait().
But that needs a mutex that I think I don't need - on the other hand Why do pthreads’ condition variable functions require a mutex? gave me the impression that I'd better ask if my architecture makes sense and could be implemented like that...
Can you give me a hint if all of that makes sense and could work?
(Side note: the client programs could also be written in the common scripting languages like Perl and Python. So the communication with the server has to be recreated there and thus shouldn't be too complicated or even proprietary)
If memory serves, the reason for the mutex accompanying a condition variable is that under POSIX, signalling the condition variable causes the kernel to wake up all waiters on the condition variable. In these circumstances, the first thing that consumer threads need to do is check is that there is something to consume - by means of accessing a variable shared between producer and consumer threads. The mutex protects against concurrent access to the variable used for this purpose. This of course means that if there are many consumers, n-1 of them are needless awoken.
Having implemented precisely the arrangement described above, the choice of IPC object to use is not obvious. We were buffering audio between high priority real-time threads in separate processes, and didn't want to block the consumer. As the audio was produced and consumed in real-time, we were already getting scheduled regularly on both ends, and if there wasn't to consume (or space to produce into) we trashed the data because we'd already missed the deadline.
In the arrangement you describe, you will need a mutex to prevent the consumers concurrently consuming items that are queued (and believe me, on a lightly loaded SMP system, they will). However, you don't need to have the producer contend on this as well.
I don't understand you comment about the consumer having read-only access to the shared memory. In the classic lockless ring buffer implementation, the producer writes the queue tail pointer and the consumer(s) the head - whilst all parties need to be able to read both.
You might of course arrange for the queue head and tails to be in a different shared memory region to the queue data itself.
Also be aware that there is a theoretical data coherency hazard on SMP systems when implementing a ring buffer such as this - namely that write-back to memory of the queue content with respect to the head or tail pointer may occur out of order (they in cache - usually per-CPU core). There are other variants on this theme to do with synchonization of caches between CPUs. To guard against these, you need to an memory, load and store barriers to enforce ordering. See Memory Barrier on Wikipedia. You explicitly avoid this hazard by using kernel synchronisation primitives such as mutex and condition variables.
The C11 atomic operations can help with this.
You do need a mutex on a pthread_cond_wait() as far as I know. The reason is that pthread_cond_wait() is not atomic. The condition variable could change during the call, unless it's protected by a mutex.
It's possible that you can ignore this situation - the client might sleep past message 1, but when the subsequent message is sent then the client will wake up and find two messages to process. If that's unacceptable then use a mutex.
You probably can have a bit of different design by using sem_t if your system has them; some POSIX systems are still stuck on the 2001 version of POSIX.
You probably don't forcably need a mutex/condition pair. This is just how it was designed long time ago for POSIX.
Modern C, C11, and C++, C++11, now brings you (or will bring you) atomic operations, which were a feature that is implemented in all modern processors, but lacked support from most higher languages. Atomic operations are part of the answer for resolving a race condition for a ring buffer as you want to implement it. But they are not sufficient because with them you can only do active wait through polling, which is probably not what you want.
Linux, as an extension to POSIX, has futex that resolves both problems: to avoid races for updates by using atomic operations and the ability to putting waiters to sleep via a system call. Futexes are often considered as being too low level for everyday programming, but I think that it actually isn't too difficult to use them. I have written up things here.

STL containers thread-safeness for producer/consumer pattern

I am planning to do the following:
store a deque of pre-built objects to be consumed. The main thread might consume these objects here and there. I have another junky thread used for logging and other not time-critical but expensive things. When the pre-built objects are running low, I will refill them in the junky thread.
Now my question is, is there going to be race condition here? Technically one thread is consuming objects from the front, and another thread is pushing objects into the back. As long as I don't let the size run down to zero, it should be fine. The only thing that concerns me is the "size" of this deque. Do they store a integer "size" variable in STL containers? should modifying that size variable introduce race conditions?
What's the best way of solving this problem? I don't really want to use locks, because the main thread is performance critical (the reason I pre-built these objects in the first place!)
STL containers are not thread safe, period, don't play with this. Specifically the deque elements are usually stored in a chain of short arrays and that chain will be modified when operating with the deque, so there's a lot of room for messing things up.
Another option would be to have 2 deques, one for read another for write. The main thread reads, and the other writes. When the read deque is empty, switch the deques (just move 2 pointers), which would involve a lock, but only occasionally.
The consumer thread would drive the switch so it would only need to do a lock when switching. The producer thread would need to lock per write in case the switch happens in the middle of a write, but as you mention the consumer is less performance-critical, so no worries there.
What you're suggesting regarding no locks is indeed dangerous as others mention.
As #sharptooth mentioned, STL containers aren't thread-safe. Are you using a C++11 capable compiler? If so, you could implement a lock-free queue using atomic types. Otherwise you'd need to use assembler for compare-and-swap or use a platform specific API (see here). See this question to get information on how to do this.
I would emphasise that you should measure performance when using standard thread synchronisation and see if you do actually need a lock-free technique.
There will be a data race even with non-empty deque.
You'll have to protect all accesses (not just writes) to the deque through locks, or use a queue specifically designed for consumer-producer model in multi-threaded environment (such as Microsoft's unbounded_buffer).

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.
Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.
Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.
I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.
Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.
It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.
I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.
I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.