Cache efficiency with static member in thread - c++

I'm currently making an application with multiple worker threads, running in parallel. The main part of the program is executed before the workers, and each workers are put to sleep when they have finished their tasks:
MainLoop()
{
// ...
SoundManager::PlaySound("sound1.mp3"); // Add a sound to be played, store the sound in a list in SoundManager
SoundManager::PlaySound("sound2.mp3");
SoundManager::PlaySound("sound3.mp3");
// ...
SoundThreadWorker.RunJob(); // Wake up thread and play every sound pushed in SoundManager
// Running other threads
SoundThreadWorker.WaitForFinish(); // Wait until the thread have finished its tasks, thread is put to sleep(but not closed)
// Waiting other threads
// ...
}
// In SoundThreadWorker class, running in a different thread from the main loop
RunJob()
{
SoundManager::PlayAllSound(); // Play all sound stored in SoundManager
}
In this case, the static variable storing all sounds should be safe because no sound are added when the thread is running.
Is this cache efficient?
I have read here that: https://www.agner.org/optimize/optimizing_cpp.pdf
"The different threads need separate storage. No function or class
that is used by multiple threads should rely on static or global
variables. (See thread-local storage p. 28) The threads have each
their stack. This can cause cache contentions if the threads share
the same cache."
I have a hard time understand how static variable are stored in cache, and how they are used by each thread. Do I have two instance of SoundManager in cache, since thread does not share their stack? Do I need to create a shared memory to avoid this problem?

That passage is about memory that is changed, not about memory that remains constant. Sharing constants between threads is fine.
When you have multiple CPUs each updating the same place, they have to be sending their changes back and forth to each other all the time. This results in contention for 'owning' a particular piece of memory.
Often the ownership isn't explicit. But when one CPU tells all the others that a particular cache line needs to be invalidated because it just changed something there, then all the other CPUs have to evict the value from their caches. This has the effect of the CPU to last modify a piece of memory effectively 'owning' the cache line it was in.
And, again, this is only an issue for things that are changed.
Also, the view of memory and cache that I gave you is rather simplistic. Please don't use it when reasoning about the thread safety of a particular piece of code. It's sufficient to understand why multiple CPUs updating the same piece of memory is bad for your cache, but it's not sufficient for understanding which CPU's version of a particular memory location ends up being used by the others.
A memory location that doesn't change during the lifetime of a thread being used by multiple threads will result in that memory location appearing in multiple CPU caches. But this isn't a problem. Nor is it a problem for a particular memory location that doesn't change to be stored in the L2 and L3 caches that are shared between CPUs.

Related

How to solve deadlock in multiple mutexes

I have a code that required to lock multiple mutexes.
void AttackAoeRequest(Player* attacker, int range)
{
std::lock_guard<std::mutex> lk_attacker(attacker->mtx);
if (attacker->isInVehicle)
{
return;
}
//there are a lot of code that need to check before the loop, and these code need to access attacker properties.
//s_map is the global map class that contains all player in the map.
for (Player* defender : s_map.GetAllPlayers())
{
if (attacker == defender) continue;
std::lock_guard<std::mutex> lk_defender(defender->mtx);
if (GetDistance(attacker->position, defender->position) <= 5)
{
printf("%d attack %d damage : %d\n", attacker->id, defender->id
, attacker->attackUpgrade - defender->defenseUpgrade);
}
}
}
There is a deadlock occured when the attacker is the defender as the same time.
e.g.
//playerA and playerB are in the global map class.
std::thread threadA = std::thread(AttackAoeRequest, &playerA, 5);
std::thread threadB = std::thread(AttackAoeRequest, &playerB, 5);
UPDATE
Actually the threadA, threadB illustrate which situation the cause the deadlock.
AttackAoeRequest is calling from a multithread networking.
The networking is going to handle messages from client and call AttackAoeRequest. There are might be a situation that clientA(playerA) and clientB(playerB) attack each others.
As the code described. There is a situation the player might be the attacker and defender in the same times, and this cause the deadlock.
I had searched about std::lock to lock multiple mutexes in same time, but in this case the mutex aren't lock in the same time.
Presumably who is "attacker" and who is "defender" is very fluid, and so you are getting opposite locking order issues.
One defense against deadlocks is to write the code so that it avoids holding multiple locks at the same time. Or, going the other way, make the locking more coarse-grained so that a single lock covers all the objects.
If you have to lock an attacker and defender, you could have the code do it always in the same order. For instance, by address. The object with the lower address in memory is locked first, then the higher one. Acquire both locks this way, and then execute the all the code that has to work with both of them.
You could have some scoped lock for this which takes two objects. Make a template class supporting lock_double_guard<std::mutex> dbl_lk(attacker->mtx, defender->mtx); which puts the two objects in sorted order, and locks them in that order.
In C and C++, pointer to distinct objects may not be compared other than for exact equality, but being able to do ptrObj1 < ptrObj2 is a common extension. If that makes you nervous, you could just assign an unsigned integer serial number to each object which is incremented whenever a new object is made. The object with a lower serial number is locked first.
There is no universal answer to your question. You will have to evaluate what makes most sense in your design and possibly redesign your code. Here are a few avenues to explore:
Avoid locking in the first place. Use atomics and lock-free techniques to work with player structures. This is not always easy or even possible to do, but may provide good performance.
Make locking more coarse grained. For example, don't lock individual players, instead lock all players with a single lock. This, obviously, limits parallelism, but this may not be an issue in your code at a large scale.
Avoid locking multiple players at the same time. For example, complete all you need to do with attacker in AttackAoeRequest, release lk_attacker and then proceed to iterate over defenders. Copy/cache the necessary data from attacker if you have to to avoid having to access attacker during iteration. Your design should allow that some of the cached data will become stale during iteration, if another thread modifies attacker while you're iterating.
Introduce asynchronicity or retries. For example, try locking the defender opportunistically, using try_lock. If it fails, postpone processing that player and go on with the rest. After you've completed the iteration, release all locks and retry the whole operation on the leftover defenders a bit later. Hopefully, by that time other threads will have completed their work with the defenders and released their locks. You may need to redo some work on the attacker on the second retry, or reuse the previously cached data.
Separate players processing to different threads. Or, more generally make sure that a given player is never accessed by multiple threads concurrently. Use message passing between threads to implement interaction between players. The message passing mechanism does not need to lock any players, and in fact, locking the players should not be necessary at all. This will also introduce some asynchronicity in the sense that the effects of AttackAoeRequest may be applied to defenders with a delay - when the corresponding thread processes damage notifications from the attacker.
I'm sure there are other ideas as well.

Synchronization of particular shared memory write operations (MPI)

To keep things simple and in order to concentrate on the core of my problem, let's assume that a memory location, addressed locally by a pointer variable ptr, is shared among several processes. I in particular use MPI shared memory windows in C/++ to allocate and share the memory. To be concrete, let's say ptr references a floating point variable, so locally we have
float* ptr;
Now assume that all processes attempt to write the same value const float f to ptr, i.e.
*ptr = f;
My question is: Does this operation require synchronization, or can it be executed concurrently, given the fact that all processes attempt to modify the bytes in the same way, i.e. given the fact that f has the same value for every process. My question therefore boils down to: For concurrent write operations to e.g. floating point variables, is there the possibility that the race condition results in an inconsistent byte pattern, although every process attempts to modify the memory in the same way. I.e. if I know for sure that every process writes the same data, can I then omit synchronization?
Yes, you must synchronize the shared memory. the fact that the modifying threads reside in different processes has no meaning, it is still data race (writing to a shared memory from different threads).
do note that there are other problems that synchronization objects solve, like visibility and memory reordering, what is written to the shared memory is irrelevant.
currently, the standard does not define the idea of a process (only thread), and does not provide any means of synchronizing between processes easily.
you allocate a std::mutex in a shared memory and use that as you synchronization primitive, or rely on a win32 inter-process synchronization primitives like a mutex, semaphore or event.
alternatively, if you only want to synchronize a primitive, you can allocate a std::atomic<T> on a shared memory and use that as your synchronized primitive.
In C++, if multiple processes write to the same memory location without proper use of synchronization primitives or atomic operations, undefined behavior occurs. (That is, it might work, it might not work, the computer might catch on fire.)
In practice, on your computer, it's basically certain to work the way you think it should work. It actually is plausible that on some architectures things don't go the way you expect, though: If the CPU cannot read/write a block of memory as small as your shared value, or if the storage of the shared value crosses an alignment boundary, such a write can actually involve a read as well, and that read-modify-write can have the effect of reverting or corrupting other changes to memory.
The easiest way to get what you want is simply to do the write as a "relaxed" atomic operation:
std::atomic_store_explicit(ptr, f, std::memory_order_relaxed);
That ensures that the write is "atomic" in the sense of not causing a data race, and won't incur any overhead except on architectures where there would be potential problems with *ptr = f.

C++ constructor memory synchronization

Assume that I have code like:
void InitializeComplexClass(ComplexClass* c);
class Foo {
public:
Foo() {
i = 0;
InitializeComplexClass(&c);
}
private:
ComplexClass c;
int i;
};
If I now do something like Foo f; and hand a pointer to f over to another thread, what guarantees do I have that any stores done by InitializeComplexClass() will be visible to the CPU executing the other thread that accesses f? What about the store writing zero into i? Would I have to add a mutex to the class, take a writer lock on it in the constructor and take corresponding reader locks in any methods that accesses the member?
Update: Assume I hand a pointer over to a bunch of other threads once the constructor has returned. I'm not assuming that the code is running on x86, but could be instead running on something like PowerPC, which has a lot of freedom to do memory reordering. I'm essentially interested in what sorts of memory barriers the compiler has to inject into the code when the constructor returns.
In order for the other thread to be able to know about your new object, you have to hand over the object / signal other thread somehow. For signaling a thread you write to memory. Both x86 and x64 perform all memory writes in order, CPU does not reorder these operations with regards to each other. This is called "Total Store Ordering", so CPU write queue works like "first in first out".
Given that you create an object first and then pass it on to another thread, these changes to memory data will also occur in order and the other thread will always see them in the same order. By the time the other thread learns about the new object, the contents of this object was guaranteed to be available for that thread even earlier (if the thread only somehow knew where to look).
In conclusion, you do not have to synchronise anything this time. Handing over the object after it has been initialised is all the synchronisation you need.
Update: On non-TSO architectures you do not have this TSO guarantee. So you need to synchronise. Use MemoryBarrier() macro (or any interlocked operation), or some synchronisation API. Signalling the other thread by corresponding API causes also synchronisation, otherwise it would not be synchronisation API.
x86 and x64 CPU may reorder writes past reads, but that is not relevant here. Just for better understanding - writes can be ordered after reads since writes to memory go through a write queue and flushing that queue may take some time. On the other hand, read cache is always consistent with latest updates from other processors (that have went through their own write queue).
This topic has been made so unbelievably confusing for so many, but in the end there is only a couple of things a x86-x64 programmer has to be worried about:
- First, is the existence of write queue (and one should not at all be worried about read cache!).
- Secondly, concurrent writing and reading in different threads to same variable in case of non-atomic variable length, which may cause data tearing, and for which case you would need synchronisation mechanisms.
- And finally, concurrent updates to same variable from multiple threads, for which we have interlocked operations, or again synchronisation mechanisms.)
If you do :
Foo f;
// HERE: InitializeComplexClass() and "i" member init are guaranteed to be completed
passToOtherThread(&f);
/* From this point, you cannot guarantee the state/members
of 'f' since another thread can modify it */
If you're passing an instance pointer to another thread, you need to implement guards in order for both threads to interact with the same instance. If you ONLY plan to use the instance on the other thread, you do not need to implement guards. However, do not pass a stack pointer like in your example, pass a new instance like this:
passToOtherThread(new Foo());
And make sure to delete it when you are done with it.

Is double-check locking safe in C++ for unidirectional data transfer?

I have inherited an application which I'm trying to improve the performance of and it currently uses mutexes (std::lock_guard<std::mutex>) to transfer data from one thread to another. One thread is a low-frequency (slow) one which simply modifies the data to be used by the other.
The other thread (which we'll call fast) has rather stringent performance requirements (it needs to do maximum number of cycles per second possible) and we believe this is being impacted by the use of the mutexes.
Basically, the current logic is:
slow thread: fast thread:
occasionally: very-often:
claim mutex claim mutex
change data use data
release mutex release mutex
In order to get the fast thread running at maximum throughput, I'd like to experiment with removing the number of mutex locks it has to do.
I suspect a variation of the double locking check pattern may be of use here. I know it has serious issues with bi-directional data flow (or singleton creation) but the areas of responsibility in my case are a little more limited in terms of which thread performs which operations (and when) on the shared data.
Basically, the slow thread sets up the data and never reads or writes to it again unless a new change comes in. The fast thread uses and changes the data but never expects to pass any information back to the other thread. In other words, ownership mostly flows strictly one way.
I wanted to see if anyone could pick any holes in the strategy I'm thinking of.
The new idea is to have two sets of data, one current and one pending. There is no need for a queue in my case as incoming data overwrites previous data.
The pending data will only ever be written to by the slow thread under the control of the mutex and it will have an atomic flag to indicate that it has written and relinquished control (for now).
The fast thread will continue to use current data (without the mutex) until such time as the atomic flag is set. Since it is responsible for transferring pending to current, it can ensure the current data is always consistent.
At the point where the flag is set, it will lock the mutex and, transfer pending to current, clear the flag, unlock the mutex and carry on.
So, basically, the fast thread runs at full speed and only does mutex locks when it knows the pending data needs to be transferred.
Getting into more concrete details, the class will have the following data members:
std::atomic_bool m_newDataReady;
std::mutex m_protectData;
MyStruct m_pendingData;
MyStruct m_currentData;
The method for receiving new data in the slow thread would be:
void NewData(const MyStruct &newData) {
std::lock_guard<std::mutex> guard(m_protectData);
m_newDataReady = false;
Transfer(m_newData, 'to', m_pendingData);
m_newDataReady = true;
}
Clearing the flag prevents the fast thread from even trying to check for new data until the immediate transfer operation is complete.
The fast thread is a little trickier, using the flag to keep mutex locks to a minimum:
while (true) {
if (m_newDataReady) {
std::lock_guard<std::mutex> guard(m_protectData);
if (m_newDataReady) {
Transfer(m_pendingData, 'to', m_currentData);
m_newDataReady = false;
}
}
Use (m_currentData);
}
Now it appears to me that the use of this method in the fast thread could improve performance quite a bit:
There is only one place where the atomic flag is used outside the control of the mutex and the fact that it's an atomic means its state should be consistent there.
Even if it's not consistent, the second check inside the mutex-locked area should provide a safety valve (it's rechecked when we know it's consistent).
The transfer of data is only ever performed under the control of the mutex so that should always be consistent.
The outer loop in the fast thread means that unnecessary mutex locks will be avoided - they'll only be done if the flag is true (or "half-true", a possibly inconsistent state).
The inner if will take care of that "half-true" possibility that, between checking the and locking the mutex, the flag has been cleared.
I can't see any holes in this strategy but, given I'm only just getting into atomics/threading in the standard-C++ world, it may be I'm missing something.
Are there any clear problems in using this method?

The significance of separate stack-space for threads

I have long known that Threads each have separate stack-space, but shared heap-memory.
But I recently found some code that made me question exactly what that meant.
Here is a shortened version of the code:
void SampleFunction()
{
CRemoteMessage rmessage;
rMessage.StartBackgroundAsync(); // Kickoff a background thread.
/* Do other long-running work here...
* but don't leave function SampleFunction
*/
rMessage.GetReply(); // Blocks if needed, but the message-background is mostly done by now.
rMessage.ProcessReply();
}
In this code, the rmessage is a local, stack-variable, but spends most of its time in a background thread. Is this safe?? How exactly is the background thread able to access the stack-variable of this thread?
Generally speaking, the stack and heap are part of the memory space that can be shared between threads. No one is preventing you from sharing stack addressed variables.
Each thread however has its own set of registers, including a stack pointer (and the derivatives), so you can maintain separate stacks if you need (otherwise it would be impossible), so the threads can call functions and do whatever they need. You can choose to break this separation if you want.
I think the confusion here is that you think of the stack of a thread as a separate entity that can only be accessed by the one thread. That's not how this works.
Every process has one large memory space to its use and every thread can read (and write!) everything in this space; the separation into stack-space and heap is a higher level design decision. For the background thread it doesn't matter whether the memory it receives is allocated on another thread's stack or on the heap.
There are even rare situations where you want to create a new stack for a thread yourself - makes no difference to the thread itself.