Windows SetThreadAffinityMask has no effect - c++

I have written a small test program in which I try to use the Windows API call SetThreadAffinityMask to lock the thread to a single NUMA node. I retrieve the CPU bitmask of a node with the GetNumaNodeProcessorMask API call, then pass that bitmask to SetThreadAffinityMask along with the thread handle returned by GetCurrentThread. Here is a greatly simplified version of my code:
// Inside a function called from a boost::thread
unsigned long long nodeMask = 0;
GetNumaNodeProcessorMask(1, &nodeMask);
HANDLE thread = GetCurrentThread();
SetThreadAffinityMask(thread, nodeMask);
DoWork(); // make-work function
I of course check whether the API calls return 0 in my code, and I've also printed out the NUMA node mask and it is exactly what I would expect. I've also followed advice given elsewhere and printed out the mask returned by a second identical call to SetThreadAffinityMask, and it matches the node mask.
However, from watching the resource monitor when the DoWork function executes, the work is split among all cores instead of only those it is ostensibly bound to. Are there any trip-ups I may have missed when using SetThreadAffinityMask? I am running Windows 7 Professional 64-bit, and the DoWork function contains a loop parallelized with OpenMP which performs operations on the elements of three very large arrays (which combined are still able to fit in the node).
Edit: To expand on the answer given by David Schwartz, on Windows any threads spawned with OpenMP do NOT inherit the affinity of the thread which spawned them. The problem lies with that, not SetThreadAffinityMask.

Did you confirm that the particular thread whose affinity mask was running on a core in another numa node? Otherwise, it's working as intended. You are setting the processor mask on one thread and then observing the behavior of a group of threads.

Related

c++: can I give a new value to a thread, while it is still running or do I have to end it first?

I have just started using threads when coding, and I'm not sure whether I understood properly how they work.
If I got it right, with a thread you can make two functions run at the same time. Is it possible to change the value given to one of the functions while it is still running in parallel?
In my case I read instructions from a csv file such as:
colAction=VELOCITY; colTime=0; colParam1=-30; colParam2=2;
colAction=VELOCITY; colTime=10; colParam1=-15; colParam2=0.4;
colAction=VELOCITY; colTime=0; colParam1=-10; colParam2=1;
colAction=VELOCITY; colTime=45; colParam1=-60; colParam2=11;
colAction=TEMPERATURE; colTime=120; colParam1=95;
colAction=TEMPERATURE; colTime=20; colParam1=57;
colAction=TEMPERATURE; colTime=25; colParam1=95;
colAction=LOOP; colParam1=22; colParam2=7; colParam3=23;
colAction=TEMPERATURE; colTime=20; colParam1=95;
colAction=VELOCITY; colTime=0; colParam1=-10; colParam2=11;
colAction=VELOCITY; colTime=1; colParam1=-1; colParam2=5;
colAction=VELOCITY; colTime=5; colParam1=-20; colParam2=11;
I have a function that sets a temperature and a function that sets a velocity. the parameter colTime tells me how long I have to hold the velocity or the temperature, without following the next instruction. when the colTime has expired, I need to follow the next instruction: if a temperature is followed by another temperatur I just give the function the next value, but if temperature is followed by velocity, I need to keep the temperature function running, while starting the velocity function.
The problem arises, when after setting temperature then velocity follows another temperature. Now I need to keep the velocity running and setting another temperature. And I don't know how to do this.
I hope I could make my problem clear somehow and it is not too confused.
Typically a process can be seen a stack of function/procedure/method calls. At any point in time your program will be in a single point of your code.
When we add multiple threads into the program, instead of a single stack, we now will have multiple stacks of function/procedure/method calls. Each of them will be at any point in time in a different points of your code.
In C/C++, your main thread (the one that started it all) will have at the bottom of the stack the int main(int argc,char**argv) function. That applies to both single threaded and the main thread of a multithreaded program.
How about the rest of the threads. Well, for each of them you will specify a starting function. Threads (and main threads) will start execution at the beginning of the start function and run til the end. The thread will be alive while its starting function is being executed.
Now thinking of your code. One of many possibilities is spawn a thread to execute your temperature or velocity functions. That would be your starting function for newly spawned thread. You can join() from the spawning thread to wait for it to complete.
The thing about threads vs. other multiprocessing ways of organizing code (e.g. heavy processes) is that although we've seen that each of the threads have its own call stack, they all share the same memory.
So, while is not possible to modify the arguments of the thread starting function (that train has already passed)... However, other parts of the code could simply change some value in the memory (which is shared) and the starting function of the thread could be periodically checking that memory to detect the change and modify its behaviour.
But that brings a different problem: accessing/reading/writing shared memory potentially from multiple threads can lead to unpredictable results. So (any kind of) access must be protected with some sort of synchronization (mutexes, atomics...)

Making a gather/barrier function with System V Semaphores

I'm trying to implement a gather function that waits for N processes to continue.
struct sembuf operations[2];
operaciones[0].sem_num = 0;
operaciones[0].sem_op = -1; // wait() or p()
operaciones[1].sem_num = 0;
operaciones[1].sem_op = 0; // wait until it becomes 0
semop ( this->id,operations,2 );
Initially, the value of the semaphore is N.
The problem is that it freezes even when all processes have executed the semop function. I think it is related to the fact that the operations are executed atomically (but I don't know exactly what it means). But I don't understand why it doesn't work.
Does the code subtract 1 from the semaphore and then block the process if it's not the last or is the code supposed to act in a different way?
It's hard to see what the code does without the whole function and algorithm.
By the looks of it, you apply 2 action in a single atomic action: subtract 1 from the semaphore and wait for 0.
There could be several issues if all processes freeze; the semaphore is not a shared between all processes, you got the number of processes wrong when initiating the semaphore or one process leaves the barrier, at a later point increases the semaphore and returns to the barrier.
I suggest debugging to see that all processes are actually in barrier, and maybe even printing each time you do any action on the semaphore (preferably on the same console).
As for what is an atomic action is; it is a single or sequence of operation that guarantied not to be interrupted while being executed. This means no other process/thread will interfere the action.

Creating parallel threads using Win 32 API

Here is the problem:
I have two sparse matrices described as vector of triplets.
The task is to write multiplication function for them using parallel processing with Win 32 API. So I need to know how do I:
1) Create a thread in Win 32 API
2) Pass input parameters for it
3) Get return value.
Thanks in advance!
Edit: "Process" changed for "Thread"
Well, the answer to your question is CreateProcess and GetExitCodeProcess.
But the solution to your problem isn't another process at all, it's more threads. And probably OpenMP is a much more suitable mechanism than creating your own threads.
If you have to use the Win32 API directly for threads, the process is something like:
Build a work item descriptor by allocating some memory, storing pointers to the real data, indexes for what this thread is going to work on, etc. Use a structure to keep this organized.
Call CreateThread and pass the address of the work item descriptor.
In your thread procedure, cast the pointer back to a structure pointer, access your work item descriptor, and process the data.
In your main thread, call WaitForMultipleObjects to join with the worker threads.
For even greater efficiency, you can use the Windows thread pool and call QueueUserWorkItem. But while you won't have to create threads yourself, you'd then need event handles to join tasks back to the main thread. It's about the same amount of code I suspect.

thread local and context switch

I'm got some C++ code making use of thread local storage, each thread has a vector it can push data into.
I use TLS to store an index ID per thread, this can be used to look up which vector to push data into. It then executes a fair amount of code which pushes data into the vector.
What I'm wondering is if it is possible that the OS might reschedule my code to execute on a different thread after it has acquired the pointer to the thread local object. (So far the code executes fine and I have not seen this happen). But if it were possible, this would seem certain to break my program, since it would now be possible for two threads to have the same object.
Assuming this is true, this seems like it would be a problem even for any code that uses TLS of any complexity, is TLS only intended for simple objects where you don't take the address?
Thanks!
Thread Local Storage is just that - storage per thread. Each thread has it's own private data structure. That thread, whichever processor it runs on, is the same thread. The OS doesn't schedule work WITHIN threads, it schedules which of the threads runs.
The thread local storage is acomplished by having some sort of indirection, which is changed along with the thread itself. There are several ways to do this, for example, the OS may have a particular page at a particular offset from the start of virtual memory in the process, and when a thread is scheduled, the page-table is updated to match the thread.
In x86 processors, FS or GS is typically used for "per-thread" data, so the OS will switch the FS register [or the content of the base-address of the register in case of 64-bit processors]. When reading the TLS, the compiler will use the FS or GS segment register to prefix the memory read/write operations, and thus you always get "your private data", not some other threads.
Of course, OS's may have bugs, but this is something quite a few things will rely on, so if it's broken, it would show up pretty soon (unless it's very subtle, and you have to be standing just in the right place, with the moon in the right phase, wearing the right colour clothes, and the wind in the right direction, the date divisibly by both 3 and 7, etc,etc).
TLS means thread local, from your description, each thread access a shared vector of vector through TLS(I'm not sure), you should use some kind of lock. Any sample codes?

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.