several (2 or more) client threads need to run at a high frequency, but once every 1 minute a background service thread updates a variable used by the main threads.
whats is the best method of locking a variable -- in fact, a vector -- during the small moment of update with little impact on the client threads.
there is no need to protect the vector during 'normal' (no background thread) operation since all threads utilize the values.
boost::thread is used with a endless while loop to update the vector and sleep for 60 seconds.
This seems like a good occasion for a Reader-Writer lock. All the clients lock the vector for reading only, and the background service thread locks it for writing only once every minute.
SharedLockable concept from c++14
which is implemented in Boost Thread as boost::shared_mutex
The class boost::shared_mutex provides an implementation of a multiple-reader / single-writer mutex. It implements the SharedLockable concept.
Multiple concurrent calls to lock(), try_lock(), try_lock_for(), try_lock_until(), timed_lock(), lock_shared(), try_lock_shared_for(), try_lock_shared_until(), try_lock_shared() and timed_lock_shared() are permitted.
That said, depending on your actual platform and CPU model you could get more lucky with an atomic variable.
If it's a primitive value, just using boost::atomic_int or similar would be fine. For a vector, consider using std::shared_ptr (which has atomic support).See e.g.
Confirmation of thread safety with std::unique_ptr/std::shared_ptr
You can also do without the dynamic allocation (although, you're using vector already) by using two vectors, and switching a reference to the "actual" version atomically.
Related
Currently working on a light weight filter in the NDIS stack. I'm trying to inject a packet which set in a global variable as an NBL. During receive NBL, if an injected NBL is pending, than a lock is taken by the thread before picking the injected NBL up to process it. Originally I was looking at using a spin lock or FAST_MUTEX. But according to the documentation for FAST_MUTEX, any other threads attempting to take the lock will wait for the lock to release before continuing.
The problem is, that receive NBL is running in DPC mode. This would cause a DPC running thread to pause and wait for the lock to release. Additionally, I'd like to be able to assert ownership of a thread's ownership over a lock.
My question is, does windows kernel support unique mutex locks in the kernel, can these locks be taken in DPC mode and how expensive is assertion of ownership in the lock. I'm fairly new to C++ so forgive any syntax errors.
I attempted to define a mutex in the LWF object
// Header file
#pragma once
#include <mutex.h>
class LWFobject
{
public:
LWFobject()
std::mutex ExampleMutex;
std::unique_lock ExampleLock;
}
//=============================================
// CPP file
#include "LWFobject.h"
LWFobject::LWFObject()
{
ExmapleMutex = CreateMutex(
NULL,
FALSE,
NULL);
ExampleLock(ExampleMutex, std::defer_lock);
}
Is the use of unique_locks supported in the kernel? When I attempt to compile it, it throws hundreds of compilation errors when attempting to use mutex.h. I'd like to use try_lock and owns_lock.
You can't use standard ISO C++ synchronization mechanisms while inside a Windows kernel.
A Windows kernel is a whole other world in itself, and requires you to live by its rules (which are vast - see for example these two 700-page books: 1, 2).
Processing inside a Windows kernel is largely asynchronous and event-based; you handle events and schedule deferred calls or use other synchronization techniques for work that needs to be done later.
Having said that, it is possible to have a mutex in the traditional sense inside a Windows driver. It's called a Fast Mutex and requires raising IRQL to APC_LEVEL. Then you can use calls like ExAcquireFastMutex, ExTryToAcquireFastMutex and ExReleaseFastMutex to lock/try-lock/release it.
A fundamental property of a lock is which priority (IRQL) it's synchronized at. A lock can be acquired from lower priorities, but can never be acquired from a higher priority.
(Why? Imagine how the lock is implemented. The lock must raise the current task priority up to the lock's natural priority. If it didn't do this, then a task running at a low priority could grab the lock, get pre-empted by a higher priority task, which would then deadlock if it tried to acquire the same lock. So every lock has a documented natural IRQL, and the lock will first raise the current thread to that IRQL before attempting to acquire exclusivity.)
The NDIS datapath can run at any IRQL between PASSIVE_LEVEL and DISPATCH_LEVEL, inclusive. This means that anything on the datapath must only ever use locks that are synchronized at DISPATCH_LEVEL (or higher). This really limits your choices: you can use KSPIN_LOCKs, NDIS_RW_LOCKs, and a handful of other uncommon ones.
This gets viral: if you have one function that can sometimes run at DISPATCH_LEVEL (like the datapath), it forces the lock to be synchronized at DISPATCH_LEVEL, which forces any other functions that hold the lock to also run at DISPATCH_LEVEL. That can be inconvenient, for example you might want to hold the locks while reading from the registry too.
There are various approaches to design your driver:
* Use spinlocks everywhere. When reading from the registry, read into temporary variables, then grab a spinlock and copy the temporary variables into global state.
* Use mutexes (or better yet: pushlocks) everywhere. Quarantine the datapath into a component that runs at dispatch level, and carefully copy any configuration state into this component's private state.
* Somehow avoid having your datapath interact with the rest of your driver, so there's no shared state, and thus no shared locks.
* Have the datapath rush to PASSIVE_LEVEL by queuing all packets to a worker thread.
I have a few threads writing in a vector. It's possible that different threads try to write the same byte. There is no reads. Can I use only an atomic_fecth_or(), like in the example, so the vector will become thread safe? It compiled with GCC without errors or warnings.
std::vector<std::atomic<uint8_t>> MapVis(1024*1024);
void threador()
{
...
std::atomic_fetch_or(&MapVis[i], testor1);
}
It compiled with GCC without errors or warnings
That doesn't mean anything because compilers don't perform that sort of concurrency analysis. There are dedicated static analysis tools that may do this with varying levels of success.
Can I use only an atomic_fetch_or ...
you certainly can, and it will be safe at the level of each individual std::atomic<uint8_t>.
... the vector will become thread safe?
it's not sufficient that each element is accessed safely. You specifically need to avoid any operation that invalidates iterators (swap, resize, insert, push_back etc.).
I'd hesitate to say vector is thread-safe in this context - but you're limiting yourself to a thread-safe subset of its interface, so it will work correctly.
Note that as VTT suggests, keeping a separate partial vector per thread is better if possible. Partly because it's easier to prove correct, and partly because it avoids false sharing between cores.
Yes this is guaranteed to be thread safe due to atomic opperations being guaranteed of:
Isolation from interrupts, signals, concurrent processes and threads
Thus when you access an element of MapVis atomically you're guaranteed that any other process writing to it has already completed. And that your process will not be interrupted till it finishes writing.
The concern if you were using non-atomic variables would be that:
Thread A fetches the value of MapVis[i]
Thread B fetches the value of MapVis[i]
Thread A writes the ored value to MapVis[i]
Thread B writes the ored value to MapVis[i]
As you can see Thread B needed to wait until Thread A had finished writing otherwise it's just going to stomp Thread A's changes to MapVis[i]. With atomic variables the fetch and write cannot be interrupted by concurrent threads. Meaning that Thread B couldn't interrupt Thread A's read-write operations.
In a program I have a class M:
class M{
/*
very big immutable fields
*/
int status;
};
And I need a linked-list of objects of type M.
Three types of threads are accessing the list:
Producers: Produce and append objects to the end of the list. All of the newly produced objects have the status=NEW. (Operation time = O(1))
Consumers: Consume objects at the beginning of the list. An object can be consumed by a consumer if it has status=CONSUMER_ID. Each of the consumers keeps the first item in the linked-list that it can consume so the consumption is (amortized?) O(1)(see note below).
Destructor: Deletes consumed objects when there is a notification that says the object has been consumed correctly (Operation time = O(1)).
Modifier: Changes the status of the objects based on a state diagram. The final status of any object is the id of a consumer (Operation time = O(1) per object).
The number of consumers is less than 10. The number of Producers may be as big as a couple of hundreds. There is one modifier.
note: The modifier may modify the already consumed objects and thus the stored items of consumers may move back and forth. I did not find any better solutions for this problem (Although, the comparison between objects is O(1), the operation is no more amortized O(1)).
The performance is very important. Therefore, I want to use atomic operations or fine-grained locks (one per object) to avoid unnecessary blocking.
My questions are:
Atomic operations are preferred because they are lighter. I guess I must use locks for updating the pointers in destructor thread only and I can use atomic operations for handling contention between other threads. Please let me know if I am missing something and there is a reason that I cannot use atomic operations on status field.
I think I cannot use STL list because it does not support fine-grained locks. But would you recommend using Boost::Intrusive lists (instead of writing my own)? Here it is mentioned that intrusive data structures are harder to make thread-safe? Is this true for fine-grained locks?
The producers, consumers and destructor would be called asynchronously based on some events (I am planning to use Boost::asio. But I don't know how to run the modifier to minimize its contention with other threads. The options are:
Asynchronously from producers.
Asynchronously from consumers.
Using its own timer.
Any such call would operate on the list only if some conditions hold. My own intuition is that there is no difference between how I call the modifier. Am I missing something?
My system is Linux/GCC and I am using boost 1.47 in case it matters.
Similar question: Thread-safe deletion of a linked list node, using the fine-grained approach
The performance is very important. Therefore, I want to use atomic operations or fine-grained locks (one per object) to avoid unnecessary blocking.
This will make performance worse by increasing the probability that threads that contend (access the same data) will run at the same time on different cores. If the locks are too fine, threads may contend (ping-pong data between their caches) and run in slow lock step without ever blocking on a lock, causing terrible performance.
You want to use coarse enough locks that threads that contend over the same data block each other as soon as possible. That will force the scheduler to schedule non-contending threads, eliminating the cache ping-ponging that destroys performance.
You have a common misconception that blocking is bad. In fact, contention is bad, because it slows cores down to bus speeds. Blocking ends contention. Blocking is good because it de-schedules contending threads, allowing non-contending threads (that can run concurrently at full speed) to be scheduled.
If you're already planning to use Boost Asio, then good news! You can stop writing your custom asynchronous producer-consumer queue right now.
The Boost Asio io_service class is an asynchronous queue, so you can easily use it to pass objects from producers to consumers. Use the io_service::post() method to enqueue a bound function object for asychronous callback by another thread.
boost::asio::io_service io_service_;
void produce()
{
M* m = new M;
io_service_.post(boost::bind(&consume, m));
}
void consume(M* m)
{
delete m;
}
Have your producer threads call produce(), then have your consumers threads call io_service_.run(), and then consume() will be called back on your consumer threads. Instant producer-consumer!
Plus, you can enqueue all kinds of other heterogeneous events into the io_service_ to be handled by your consumer threads if you like, such as network reads and waiting for signals. Boost Asio is more than just a network library-- it's also an easy way to express a proactor, reactor, producer-consumer, thread-pool, or any other kind of threading architecture.
EDIT
Oh, and one more tip. Don't make separate pools of dedicated producer threads and dedicated consumer threads. Just make one thread for each core available on your machine (4 core machine => 4 threads). Then have all those threads call io_service_.run(). Use the io_service_ to asynchronously read stuff to produce, from files or the network or whatever, then use the io_service_ again to asynchronously consume whatever was produced.
That's the most performant threading architecture. One thread per core.
As #David Schwartz fairly noted, blocking is not always slow and spinning (in user space multithreaded applications) can be quite dangerous.
Moreover, linux pthread library has "smart" implementation of pthread_mutex. It's designed to be "lightweight", i.e. when a thread tries to lock already acquired mutex, it spins some time making several attempts to get the lock before it blocks. Number of attempts is not big enough to harm your system or even break real-time requirements (if any). Additional linux specific feature is so-called fast user space mutex (FUTEX), which reduces number of syscalls. The main idea is that it'll do mutex_lock syscall only when a thread really needs to block on a mutex (when a thread locks unacquired mutex, it doesn't do a syscall).
Actually in most cases you don't need to reinvent the wheel or introduce some very specific locking techniques. If you have to, then either something wrong with design or you're dealing with highly concurrent environment (for the first sight 10 consumers don't seem that and all these seem like over engineering).
If I were you I'd prefer to use conditional variable + mutex protecting the list.
Another thing I'd do is to go over the design again. Why use one global list when consumer needs to do a search to find out whether the list contains the item with its ID (and if so, remove/dequeue it)? May be it's better to make a separate list for each consumer? In this case you probably can get rid of status field.
Does read access is more frequent than write access? If so it would be better to use R/W locks or RCU
If I wouldn't satisfied with pthread primitives and futex stuff (and if I wouldn't, I would have proved by the tests that locking primitives are bottleneck, not the number of consumers or the algorithm I chosen), then I'd try to think about complicated algorithm with reference counting, separate GC thread and restriction of all updates to be atomic.
I would advice you on a slightly different approach to the problem:
Producers: Enqueue objects at the end of a shared queue (SQ). Wakes up
the Modifier via a semaphore.
producer()
{
while (true)
{
o = get_object_from_somewhere ()
atomic_enqueue (SQ.queue, o)
signal(SQ.sem)
}
}
Consumers: Deque objects from the front of a per consumer queue (CQ[i]).
consumer()
{
while (true)
{
wait (CQ[self].sem)
o = atomic_dequeue (CQ[self].queue)
process (o)
destroy (o)
}
}
Destructor: Destructor does not exist, after a consumer is done with
an object, the consumer destroys it.
Modifier: The modifier dequeues objects from the shared queue,
processed them and enqueues them to the private queue of the appropriate consumer.
modifier()
{
while (true)
{
wait (SQ.sem)
o = atomic_dequeue (SQ.queue)
FSM (o)
atomic_enqueue (CQ [o.status].queue, o)
signal (CQ [o.status].sem)
}
}
A note to the various atomic_xxx functions in the pseudo code: this
does not necessarily mean using atomic instructions like CAS, CAS2,
LL/SC, etc. It can be using atomics, spinlocks or plain mutexes. I
would advice implementing it in the most straighforward way
(e.g. mutexes) and optimizing it later if it proves to be a
performance issue.
I have a thread pool with some threads (e.g. as many as number of cores) that work on many objects, say thousands of objects. Normally I would give each object a mutex to protect access to its internals, lock it when I'm doing work, then release it. When two threads would try to access the same object, one of the threads has to wait.
Now I want to save some resources and be scalable, as there may be thousands of objects, and still only a hand full of threads. I'm thinking about a class design where the thread has some sort of mutex or lock object, and assigns the lock to the object when the object should be accessed. This would save resources, as I only have as much lock objects as I have threads.
Now comes the programming part, where I want to transfer this design into code, but don't know quite where to start. I'm programming in C++ and want to use Boost classes where possible, but self written classes that handle these special requirements are ok. How would I implement this?
My first idea was to have a boost::mutex object per thread, and each object has a boost::shared_ptr that initially is unset (or NULL). Now when I want to access the object, I lock it by creating a scoped_lock object and assign it to the shared_ptr. When the shared_ptr is already set, I wait on the present lock. This idea sounds like a heap full of race conditions, so I sort of abandoned it. Is there another way to accomplish this design? A completely different way?
Edit:
The above description is a bit abstract, so let me add a specific example. Imagine a virtual world with many objects (think > 100.000). Users moving in the world could move through the world and modify objects (e.g. shoot arrows at monsters). When only using one thread, I'm good with a work queue where modifications to objects are queued. I want a more scalable design, though. If 128 core processors are available, I want to use all 128, so use that number of threads, each with work queues. One solution would be to use spatial separation, e.g. use a lock for an area. This could reduce number of locks used, but I'm more interested if there's a design which saves as much locks as possible.
You could use a mutex pool instead of allocating one mutex per resource or one mutex per thread. As mutexes are requested, first check the object in question. If it already has a mutex tagged to it, block on that mutex. If not, assign a mutex to that object and signal it, taking the mutex out of the pool. Once the mutex is unsignaled, clear the slot and return the mutex to the pool.
Without knowing it, what you were looking for is Software Transactional Memory (STM).
STM systems manage with the needed locks internally to ensure the ACI properties (Atomic,Consistent,Isolated). This is a research activity. You can find a lot of STM libraries; in particular I'm working on Boost.STM (The library is not yet for beta test, and the documentation is not really up to date, but you can play with). There are also some compilers that are introducing TM in (as Intel, IBM, and SUN compilers). You can get the draft specification from here
The idea is to identify the critical regions as follows
transaction {
// transactional block
}
and let the STM system to manage with the needed locks as far as it ensures the ACI properties.
The Boost.STM approach let you write things like
int inc_and_ret(stm::object<int>& i) {
BOOST_STM_TRANSACTION {
return ++i;
} BOOST_STM_END_TRANSACTION
}
You can see the couple BOOST_STM_TRANSACTION/BOOST_STM_END_TRANSACTION as a way to determine a scoped implicit lock.
The cost of this pseudo transparency is of 4 meta-data bytes for each stm::object.
Even if this is far from your initial design I really think is what was behind your goal and initial design.
I doubt there's any clean way to accomplish your design. The problem that assigning the mutex to the object looks like it'll modify the contents of the object -- so you need a mutex to protect the object from several threads trying to assign mutexes to it at once, so to keep your first mutex assignment safe, you'd need another mutex to protect the first one.
Personally, I think what you're trying to cure probably isn't a problem in the first place. Before I spent much time on trying to fix it, I'd do a bit of testing to see what (if anything) you lose by simply including a Mutex in each object and being done with it. I doubt you'll need to go any further than that.
If you need to do more than that I'd think of having a thread-safe pool of objects, and anytime a thread wants to operate on an object, it has to obtain ownership from that pool. The call to obtain ownership would release any object currently owned by the requesting thread (to avoid deadlocks), and then give it ownership of the requested object (blocking if the object is currently owned by another thread). The object pool manager would probably operate in a thread by itself, automatically serializing all access to the pool management, so the pool management code could avoid having to lock access to the variables telling it who currently owns what object and such.
Personally, here's what I would do. You have a number of objects, all probably have a key of some sort, say names. So take the following list of people's names:
Bill Clinton
Bill Cosby
John Doe
Abraham Lincoln
Jon Stewart
So now you would create a number of lists: one per letter of the alphabet, say. Bill and Bill would go in one list, John, Jon Abraham all by themselves.
Each list would be assigned to a specific thread - access would have to go through that thread (you would have to marshall operations to an object onto that thread - a great use of functors). Then you only have two places to lock:
thread() {
loop {
scoped_lock lock(list.mutex);
list.objectAccess();
}
}
list_add() {
scoped_lock lock(list.mutex);
list.add(..);
}
Keep the locks to a minimum, and if you're still doing a lot of locking, you can optimise the number of iterations you perform on the objects in your lists from 1 to 5, to minimize the amount of time spent acquiring locks. If your data set grows or is keyed by number, you can do any amount of segregating data to keep the locking to a minimum.
It sounds to me like you need a work queue. If the lock on the work queue became a bottle neck you could switch it around so that each thread had its own work queue then some sort of scheduler would give the incoming object to the thread with the least amount of work to do. The next level up from that is work stealing where threads that have run out of work look at the work queues of other threads.(See Intel's thread building blocks library.)
If I follow you correctly ....
struct table_entry {
void * pObject; // substitute with your object
sem_t sem; // init to empty
int nPenders; // init to zero
};
struct table_entry * table;
object_lock (void * pObject) {
goto label; // yes it is an evil goto
do {
pEntry->nPenders++;
unlock (mutex);
sem_wait (sem);
label:
lock (mutex);
found = search (table, pObject, &pEntry);
} while (found);
add_object_to_table (table, pObject);
unlock (mutex);
}
object_unlock (void * pObject) {
lock (mutex);
pEntry = remove (table, pObject); // assuming it is in the table
if (nPenders != 0) {
nPenders--;
sem_post (pEntry->sem);
}
unlock (mutex);
}
The above should work, but it does have some potential drawbacks such as ...
A possible bottleneck in the search.
Thread starvation. There is no guarantee that any given thread will get out of the do-while loop in object_lock().
However, depending upon your setup, these potential draw-backs might not matter.
Hope this helps.
We here have an interest in a similar model. A solution we have considered is to have a global (or shared) lock but used in the following manner:
A flag that can be atomically set on the object. If you set the flag you then own the object.
You perform your action then reset the variable and signal (broadcast) a condition variable.
If the acquire failed you wait on the condition variable. When it is broadcast you check its state to see if it is available.
It does appear though that we need to lock the mutex each time we change the value of this variable. So there is a lot of locking and unlocking but you do not need to keep the lock for any long period.
With a "shared" lock you have one lock applying to multiple items. You would use some kind of "hash" function to determine which mutex/condition variable applies to this particular entry.
Answer the following question under the #JohnDibling's post.
did you implement this solution ? I've a similar problem and I would like to know how you solved to release the mutex back to the pool. I mean, how do you know, when you release the mutex, that it can be safely put back in queue if you do not know if another thread is holding it ?
by #LeonardoBernardini
I'm currently trying to solve the same kind of problem. My approach is create your own mutex struct (call it counterMutex) with a counter field and the real resource mutex field. So every time you try to lock the counterMutex, first you increment the counter then lock the underlying mutex. When you're done with it, you decrement the coutner and unlock the mutex, after that check the counter to see if it's zero which means no other thread is trying to acquire the lock . If so put the counterMutex back to the pool. Is there a race condition when manipulating the counter? you may ask. The answer is NO. Remember you have a global mutex to ensure that only one thread can access the coutnerMutex at one time.
When double-buffering data that's due to be shared between threads, I've used a system where one thread reads from one buffer, one thread reads from the other buffer and reads from the first buffer. The trouble is, how am I going to implement the pointer swap? Do I need to use a critical section? There's no Interlocked function available that will actually swap values. I can't have thread one reading from buffer one, then start reading from buffer two, in the middle of reading, that would be appcrash, even if the other thread didn't then begin writing to it.
I'm using native C++ on Windows in Visual Studio Ultimate 2010 RC.
Using critical sections is the accepted way of doing it. Just share a CRITICAL_SECTION object between all your threads and call EnterCriticalSection and LeaveCriticalSection on that object around your pointer manipulation/buffer reading/writing code. Try to finish your critical sections as soon as possible, leaving as much code outside the critical sections as possible.
Even if you use the double interlocked exchange trick, you still need a critical section or something to synchronize your threads, so might as well use it for this purpose too.
This sounds like a reader-writer-mutex type problem to me.
[ ... but I mostly do embedded development so this may make no sense for a Windows OS.
Actually, in an embedded OS with a priority-based scheduler, you can do this without any synchronization mechanism at all, if you guarantee that the swap is atomic and only allow the lower-priority thread to swap the buffers. ]
Suppose you have two buffers, B1 and B2, and you have two threads, T1 and T2. It's OK if T1 is using B1 while T2 is using B2. By "using" I mean reading and/or writing the buffer. Then at some time, the buffers need to swap so that T1 is using B2 and T2 is using B1. The thing you have to be careful of is that the swap is done while neither thread is accessing its buffer.
Suppose you used just one simple mutex. T1 could acquire the mutex and use B1. If T2 wanted to use B2, it would have to wait for the mutex. When T1 completed, T2 would unblock and do its work with B2. If either thread (or some third-party thread) wanted to swap the buffers, it would also have to take the mutex. Thus, using just one mutex serializes access to the buffers -- not so good.
It might work better if you use a reader-writer mutex instead. T1 could acquire a read-lock on the mutex and use B1. T2 could also acquire a read-lock on the mutex and use B2. When one of those threads (or a third-party thread) decides it's time to swap the buffers, it would have to take a write-lock on the mutex. It won't be able to acquire the write-lock until there are no more read-locks. At that point, it can swap the buffer pointers, knowing that nobody is using either buffer because when there is a write-lock on the mutex, all attempts to read-lock will block.
You have to build your own function to swap the pointers which uses a semaphore or critical section to control it. The same protection needs to be added to all users of pointers, since any code which reads a pointer which is in the midst of being modified is bad.
One way to manage this is to have all the pointer manipulation logic work under the protection of the lock.
Why can't you use InterlockedExchangePointer ?
edit: Ok, I get what you are saying now, IEP doesn't actually swap 2 live pointers with each other since it only takes a single value by reference.
See, I did originally design the threads so that they would be fully asynchronous and don't require any synchronizing in their regular operations. But, since I'm performing operations on a per-object basis in a thread pool, if a given object is unreadable because it's currently being synced, I can just do another while I'm waiting. In a sense, I can both wait and operate at the same time, since I have plenty of threads to go around.
Create two critical sections, one for each of the threads.
While rendering, hold the render crit section. The other thread can still do what it likes to the other crit section though. Use TryEnterCriticalSection, and if it's held, then return false, and add the object in a list to be re-rendered later. This should allow us to keep rendering even if a given object is currently being updated.
While updating, hold both crit sections.
While doing game logic, hold the game logic crit section. If it's already held, that's no problem, because we have more threads than actual processors. So if this thread is blocked, then another thread will just use the CPU time and this doesn't need to be managed.
You haven't mentioned what your Windows platform limitations are, but if you don't need compatibility with older versions than Windows Server 2003, or Vista on the client side, you can use the InterlockedExchange64() function to exchange a 64 bit value. By packing two 32-bit pointers into a 64-bit pair structure, you can effectively swap two pointers.
There are the usual Interlocked* variantions on that; InterlockedExchangeAcquire64(), InterlockedCompareExchange64(), etc...
If you need to run on, say, XP, I'd go for a critical section. When the chance of contention is low, they perform quite well.