Why my std::atomic<int> variable isn't thread-safe? - c++

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}

I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.

Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.

The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Related

Synchronization with "versioning" in c++

Please consider the following synchronization problem:
initially:
version = 0 // atomic variable
data = 0 // normal variable (there could be many)
Thread A:
version++
data = 3
Thread B:
d = data
v = version
assert(d != 3 || v == 1)
Basically, if thread B sees data = 3 then it must also see version++.
What's the weakest memory order and synchronization we must impose so that the assertion in thread B is always satisfied?
If I understand C++ memory_order correctly, the release-acquire ordering won't do because that guarantees that operations BEFORE version++, in thread A, will be seen by the operations AFTER v = version, in thread B.
Acquire and release fences also work in the same directions, but are more general.
As I said, I need the other direction: B sees data = 3 implies B sees version = 1.
I'm using this "versioning approach" to avoid locks as much as possible in a data structure I'm designing. When I see something has changed, I step back, read the new version and try again.
I'm trying to keep my code as portable as possible, but I'm targeting x86-64 CPUs.
You might be looking for a SeqLock, as long as your data doesn't include pointers. (If it does, then you might need something more like RCU to protect readers that might load a pointer, stall / sleep for a while, then deref that pointer much later.)
You can use the SeqLock sequence counter as the version number. (version = tmp_counter >> 1 since you need two increments per write of the payload to let readers detect tearing when reading the non-atomic data. And to make sure they see the data that goes with this sequence number. Make sure you don't read the atomic counter a 3rd time; use the local tmp that you read it into to verify match before/after copying data.)
Readers will have to retry if they happen to attempt a read while data is being modified. But it's non-atomic, so there's no way if thread B sees data = 3 can ever be part of what creates synchronization; it can only be something you see as a result of synchronizing with a version number from the writer.
See:
Implementing 64 bit atomic counter with 32 bit atomics - my attempt at a SeqLock in C++, with lots of comments. It's a bit of a hack because ISO C++'s data-race UB rules are overly strict; a SeqLock relies on detecting possible tearing and not using torn data, rather than avoiding concurrent access entirely. That's fine on a machine without hardware race detection so that doesn't fault (like all real CPUs), but C++ still calls that UB, even with volatile (although that puts it more into implementation-defined territory). In practice it's fine.
GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? - A GCC bug fixed in 8.1 that could break a seqlock implementation.
If you have multiple writers, you can use the sequence-counter itself as a spinlock for mutual exclusion between writers. e.g. using an atomic_fetch_or or CAS to attempt to set the low bit to make the counter odd. (tmp = seq.fetch_or(1, std::memory_order_acq_rel);, hopefully compiling to x86 lock bts). If it previously didn't have the low bit set, this writer won the race, but if it did then you have to try again.
But with only a single writer, you don't need to RMW the atomic sequence counter, just store new values (ordered with writes to the payload), so you can either keep a local copy of it, or just do a relaxed load of it, and store tmp+1 and tmp+2.

output 10 with memory_order_seq_cst

When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work.
Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10
10 (c=1, d=0) is easily explained: bar happened to run first, and finished before foo read b.
Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though thread(foo) ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose for foo, delaying it from actually getting into that code1.
Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the a.load2 in bar separately from the d.store of the resulting int temporary.
b.store(1,memory_order_seq_cst); // bar1. b=1
d.store(a,memory_order_seq_cst); // bar2. a.load reads 0, d=0
a.store(1,memory_order_seq_cst); // foo1
c.store(b,memory_order_seq_cst); // foo2. b.load reads 1, c=1
// final: c=1, d=0
atomic_thread_fence(seq_cst) has no impact anywhere because all your operations are already seq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.
(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1:
Since all your atomic vars are probably in the same cache line, even if execution did reach the top of foo and bar at the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of 1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get 11 if we get b=1 then a=1 before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.
Footnote 2: The lvalue-to-rvalue evaluation of bare a uses the overloaded (int) conversion which is equivalent to a.load(seq_cst). The operations from foo could happen between that load and the d.store that gets a temporary value from it. d.store(a) is not an atomic copy; it's equivalent to int tmp = a; d.store(tmp);. That isn't necessary to explain your observations.
The printf statements are unsynchronized so output of 10 can be just a reordered 01.
01 happens when the functions before the printf run serially.

EnterSynchronizationBarrier hangs in windows 8

I tried to use new API for synchronization barriers from Windows 8, but the following simple code sometimes hangs in Windows 8:
#undef WINVER
#define WINVER 0x0603
#include "windows.h"
#include <thread>
#include <vector>
int main()
{
SYNCHRONIZATION_BARRIER barrier;
int count = 32;
InitializeSynchronizationBarrier (&barrier, count, -1);
std::vector<std::thread> threads;
for (int thr_num = 0; thr_num < count; thr_num++)
{
threads.emplace_back ([thr_num]
{
for (int i = 0; i < 100000; i++)
EnterSynchronizationBarrier (&barrier, 0);
});
}
for (auto &thr : threads)
thr.join ();
return 0;
}
Tested on Windows 8.1 64-bit on 32-core dual-Xeon E5 2630. It hangs roughly one time out of ten launches.
It seems that in windows 10 it works normally (on another machine). Is this a bug in windows 8 that got fixed, or this is not a correct usage of EnterSynchronizationBarrier (maybe you can't call it in a loop?). There're not much information about this function, have anybody even used it?
Not that it matters years later, except perhaps to show that some problems are too obscure for Stack Overflow to deliver close attention in useful time, but your usage is correct, if extreme, and the stress you have put the called function to does look to have exposed its problems with memory barriers.
In your fragment, a synchronisation barrier is prepared for 32 threads and you create 32 threads which each proceed to 100000 phases of synchronised work. All 32 reach their call number N to EnterSynchronizationBarrier before all are released on their way to their call number N+1. It should work. It likely would if your phases had any substance.
The stress is that each phase between calls is just however few instructions are involved in looping back to repeat the call. While the last thread to end phase N is in its call, it signals the others to leave, and they have a good chance of leaving (and even of reentering the function to end their phase N+1) while the thread that ends phase N is still doing its internal bookkeeping.
In this bookkeeping are two counters. One, named Barrier according to Microsoft's symbol files, is decremented as threads enter the synchronisation barrier. The other, named LeftBarrier, is incremented as they leave it. The thread that ends a phase resets Barrier from LeftBarrier (which should be the count of all participating threads) and resets LeftBarrier to 1. Or so it goes as a simplification.
The complicated reality is that the Barrier count is overloaded: its high bit signifies the change of phase. If a thread that waits at the synchronisation barrier is spinning rather than blocking on an event, then what it checks for while spinning is whether the high bit in Barrier has changed. It therefore really matters exactly how the counters get reset in the ending thread's bookkeeping. The sequence is: read LeftBarrier; write LeftBarrier as 1; write Barrier as the old LeftBarrier with the high bit toggled.
What I think happens is that without a memory barrier, the Barrier count can be written before LeftBarrier, but because Barrier has a toggled high bit, a spinning thread comes out of its spin and increments LeftBarrier from another processor before the first resets it to 1. The increment gets lost, after which all bets are off because subsequent phases will find that LeftBarrier at the end of a phase is no longer the count of participating threads.
Windows 8 and 8.1 have no memory barrier here. Windows 10 does, though I believe it's in the wrong place and that Windows Vista and Windows 7 had it correctly between the two writes. The implementation was anyway reworked completely for Version 1607 so that it now uses the WaitOnAddress functionality, much as sketched by a later Raymond Chen blog than the one cited by one of your correspondents. At the time of the cited blog, Microsoft, though possibly not Raymond, surely knew of the function's two earlier code changes regarding memory barriers.

do integer reads need to be critical section protected?

I have come across C++03 some code that takes this form:
struct Foo {
int a;
int b;
CRITICAL_SECTION cs;
}
// DoFoo::Foo foo_;
void DoFoo::Foolish()
{
if( foo_.a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
Does the read from foo_.a need to be protected? e.g.:
void DoFoo::Foolish()
{
EnterCriticalSection(&foo_.cs);
int a = foo_.a;
LeaveCriticalSection(&foo_.cs);
if( a == 4 )
{
PerformSomeTask();
EnterCriticalSection(&foo_.cs);
foo_.b = 7;
LeaveCriticalSection(&foo_.cs);
}
}
If so, why?
Please assume the integers are 32-bit aligned. The platform is ARM.
Technically yes, but no on many platforms. First, let us assume that int is 32 bits (which is pretty common, but not nearly universal).
It is possible that the two words (16 bit parts) of a 32 bit int will be read or written to separately. On some systems, they will be read separately if the int isn't aligned properly.
Imagine a system where you can only do 32-bit aligned 32 bit reads and writes (and 16-bit aligned 16 bit reads and writes), and an int that straddles such a boundary. Initially the int is zero (ie, 0x00000000)
One thread writes 0xBAADF00D to the int, the other reads it "at the same time".
The writing thread first writes 0xBAAD to the high word of the int. The reader thread then reads the entire int (both high and low) getting 0xBAAD0000 -- which is a state that the int was never put into on purpose!
The writer thread then writes the low word 0xF00D.
As noted, on some platforms all 32 bit reads/writes are atomic, so this isn't a concern. There are other concerns, however.
Most lock/unlock code includes instructions to the compiler to prevent reordering across the lock. Without that prevention of reordering, the compiler is free to reorder things so long as it behaves "as-if" in a single threaded context it would have worked that way. So if you read a then b in code, the compiler could read b before it reads a, so long as it doesn't see an in-thread opportunity for b to be modified in that interval.
So possibly the code you are reading is using these locks to make sure that the read of the variable happens in the order written in the code.
Other issues are raised in the comments below, but I don't feel competent to address them: cache issues, and visibility.
Looking at this it seems that arm has quite relaxed memory model so you need a form of memory barrier to ensure that writes in one thread are visible when you'd expect them in another thread. So what you are doing, or else using std::atomic seems likely necessary on your platform. Unless you take this into account you can see updates out of order in different threads which would break your example.
I think you can use C++11 to ensure that integer reads are atomic, using (for example) std::atomic<int>.
The C++ standard says that there's a data race if one thread writes to a variable at the same time as another thread reads from that variable, or if two threads write to the same variable at the same time. It further says that a data race produces undefined behavior. So, formally, you must synchronize those reads and writes.
There are three separate issues when one thread reads data that was written by another thread. First, there is tearing: if writing requires more than a single bus cycle, it's possible for a thread switch to occur in the middle of the operation, and another thread could see a half-written value; there's an analogous problem if a read requires more than a single bus cycle. Second, there's visibility: each processor has its own local copy of the data that it's been working on recently, and writing to one processor's cache does not necessarily update another processor's cache. Third, there's compiler optimizations that reorder reads and writes in ways that would be okay within a single thread, but will break multi-threaded code. Thread-safe code has to deal with all three problems. That's the job of synchronization primitives: mutexes, condition variables, and atomics.
Although the integer read/write operation indeed will most likely be atomic, the compiler optimizations and processor cache will still give you problems if you don't do it properly.
To explain - the compiler will normally assume that the code is single-threaded and make many optimizations that rely on that. For example, it might change the order of instructions. Or, if it sees that the variable is only written and never read, it might optimize it away entirely.
The CPU will also cache that integer, so if one thread writes it, the other one might not get to see it until a lot later.
There are two things you can do. One is to wrap in in critical section like in your original code. The other is to mark the variable as volatile. That will signal the compiler that this variable will be accessed by multiple threads and will disable a range of optimizations, as well as placing special cache-sync instructions (aka "memory barriers") around accesses to the variable (or so I understand). Apparently this is wrong.
Added: Also, as noted by another answer, Windows has Interlocked APIs that can be used to avoid these issues for non-volatile variables.

Proper Usage of SetThreadAffinityMask

There are 12 cores, and 12 threads running..I want to bind 1 thread to each core. this is what I call at the beginning of each thread.
int core=12;
SetThreadAffinityMask(GetCurrentThread(),(1<<core)-1);
This is what I have...I don't know if this would be the proper way to call it. I'm not sure if i'm understanding how the 2nd parameter works..
Do i also need to call SetProcessaffinitymask as well?
The second parameter to SetThreadAffinityMask() is a bit vector. Each bit corresponds to a logical processor: a CPU core or a hyper-thread. If a bit in the second parameter is set to 1, the thread is allowed to run on the corresponding core.
For core equal to 12, your mask (1<<core)-1 contains 0..11 bits set, so every thread is allowed to run on any of the 12 cores. Presumably you wanted to set each thread to run on a dedicated core. For this, you need each thread to have a unique number between 0 and 11, and set only the corresponding bit of the affinity mask. Hint: you may use InterlockedIncrement() to get the unique number. Alternatively, if your threads are all started in a loop, the unique number is already known (it's the loop trip count) and you may use it, e.g. pass to each thread as an argument, or set affinity for new threads in that same loop.
And please, pay attention to the caution in David Heffernan's answer: unless you know how to use affinity for good, you better do not play with affinity. In addition to the reasons David already mentioned, I will add application portability across computers having different number of sockets, cores, and hyper-threads.
You appear to be setting affinity to all 12 processors which is not what you intend.
I would, in the main thread, loop over all 12 threads setting affinity. Don't set the affinity inside the thread because that requires the thread to know its index which it often does not need to know. I'd declare a mask variable and assign it the value 1. Each time round the loop you set the thread affinity and then shift by 1. You should not change the process affinity.
A word of caution. Setting affinity is dangerous. If the user changes process affinity then you may end up with a thread that is not able to run on any processor. Be careful.
Also, it is my experience that manually setting affinity has no performance benefits and sometimes is slower. Usually the system does a good job.
You could write code like below.
GetThreadHandle(i) is the function that get the handle of each thread.
int core = 12;
for(int i=0; i<core; i++)
SetThreadAffinityMask(GetThreadHandle(i), 1<<i);
The bitmask is typically 64 bit. A more portable solution that avoids arithmetic overflow, for cases where there are more than 32 processors would be:
auto mask = (static_cast<DWORD_PTR>(1) << core);//core number starts from 0
auto ret = SetThreadAffinityMask(GetCurrentThread(), mask);