Is this request frequency limiter thread safe?

Is this request frequency limiter thread safe? - c++

In order to prevent excessive server pressure, I implemented a request frequency limiter using a sliding window algorithm, which can determine whether the current request is allowed to pass according to the parameters. In order to achieve the thread safety of the algorithm, I used the atomic type to control the number of sliding steps of the window, and used unique_lock to achieve the correct sum of the total number of requests in the current window.
But I'm not sure whether my implementation is thread-safe, and if it is safe, whether it will affect service performance. Is there a better way to achieve it?
class SlideWindowLimiter
{
public:
bool TryAcquire();
void SlideWindow(int64_t window_number);
private:
int32_t limit_; // maximum number of window requests
int32_t split_num_; // subwindow number
int32_t window_size_; // the big window
int32_t sub_window_size_; // size of subwindow = window_size_ / split_number
int16_t index_{0}; //the index of window vector
std::mutex mtx_;
std::vector<int32_t> sub_windows_; // window vector
std::atomic<int64_t> start_time_{0}; //start time of limiter
}
bool SlideWindowLimiter::TryAcquire() {
int64_t cur_time = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
auto time_stamp = start_time_.load();
int64_t window_num = std::max(cur_time - window_size_ - start_time_, int64_t(0)) / sub_window_size_;
std::unique_lock<std::mutex> guard(mtx_, std::defer_lock);
if (window_num > 0 && start_time_.compare_exchange_strong(time_stamp, start_time_.load() + window_num*sub_window_size_)) {
guard.lock();
SlideWindow(window_num);
guard.unlock();
}
monitor_->TotalRequestQps();
{
guard.lock();
int32_t total_req = 0;
std::cout<<" "<<std::endl;
for(auto &p : sub_windows_) {
std::cout<<p<<" "<<std::this_thread::get_id()<<std::endl;
total_req += p;
}
if(total_req >= limit_) {
monitor_->RejectedRequestQps();
return false;
} else {
monitor_->PassedRequestQps();
sub_windows_[index_] += 1;
return true;
}
guard.unlock();
}
}
void SlideWindowLimiter::SlideWindow(int64_t window_num) {
int64_t slide_num = std::min(window_num, int64_t(split_num_));
for(int i = 0; i < slide_num; i++){
index_ += 1;
index_ = index_ % split_num_;
sub_windows_[index_] = 0;
}
}

First of all, thread-safe is a relative property. Two sequences of operations are thread-safe relative to each other. A single bit of code cannot be thread-safe by itself.
I'll instead answer "am I handling threading in such a way that reasonable thread-safety guarantees could be made with other reasonable code".
The answer is "No".
I found one concrete problem; your use of atomic and compare_exchange_strong isn't in a loop, and you access start_time_ atomically at multiple spots without the proper care. If start_time_ changes in the period with the 3 spots you read and write from it, you return false, and fail to call SlideWindow, then... proceed as if you had.
I can't think of why that would be a reasonable response to contention, so that is a "No, this code isn't written to behave reasonably under multiple threads using it".
There is a lot of bad smell in your code. You are mixing concurrency code with a whole pile of state, which means it isn't clear what mutexes are guarding what data.
You have a pointer in your code that is never defined. Maybe it is supposed to be a global variable?
You are writing to cout using multiple << on one line. That is a bad plan in a multithreaded environment; even if your cout is concurrency-hardened, you get scrambled output. Build a buffer string and do one <<.
You are passing data between functions via the back door. index_ for example. One function sets a member variable, another reads it. Is there any possibility it gets edited by another thread? Hard to audit, but seems reasonably likely; you set it on one .lock(), then .unlock(), then read it as if it was in a sensible state in a later lock(). What more, you use it to access a vector; if the vector or index changed in unplanned ways, that could crash or lead to memory corruption.
...
I would be shocked if this code didn't have a pile of race conditions, crashes and the like in production. I see no sign of any attempt to prove that this code is concurrency safe, or simplify it to the point where it is easy to sketch such a proof.
In actual real practice, any code that you haven't proven is concurrency safe is going to be unsafe to use concurrently. So complex concurrency code is almost guaranteed to be unsafe to use concurrently.
...
Start with a really, really simple model. If you have a mutex and some data, make that mutex and the data into a single struct, so you know exactly what that mutex is guarding.
If you are messing with an atomic, don't use it in the middle of other code mixed up with other variables. Put it in its own class. Give that class a name, representing some concrete semantics, ideally ones that you have found elsewhere. Describe what it is supposed to do, and what the methods guarantee before and after. Then use that.
Elsewhere, avoid any kind of global state. This includes class member variables used to pass state around. Pass your data explicitly from one function to another. Avoid pointers to anything mutable.
If your data is all value types in automatic storage and pointers to immutable (never changing in the lifetime of your threads) data, that data can't be directly involved in a race condition.
The remaining data is bundled up and firewalled in a small a spot as possible, and you can look at how you interact with it and determine if you are messing up.
...
Multithreaded programming is hard, especially in an environment with mutable data. If you aren't working to make it possible to prove your code is correct, you are going to produce code that isn't correct, and you won't know it.
Well, based off my experience, I know it; all code that isn't obviously trying to act in such a way that it is easy to show it is correct is simply incorrect. If the code is old and has piles of patches over a decade+, the incorrectness is probably unlikely and harder to find, but it is probably still incorrect. If it is new code, it is probably easier to find the incorrectness.

Related

multiple locks on different elements of an array

If I have 8 threads, and an array of 1,000,000,000 elements in an array, I can have 1,000,000,000 mutices where the index represents the element within the array that is being locked and written to. However this is fairly wasteful to me and requires a lot of memory.
Is there a way that I can only use 8 mutices and have the same functionality?

Thinking out aloud here... and not really sure how efficient this would be, but:
You could create method of locking certain indexes:
vector<int> mutexed_slots;
std::mutex mtx;
bool lock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// Check if item is in the mutexed list
if ( !std::find(mutexed_slots.begin(), mutexed_slots.end(), index) != vector.end() )
{
// If its not then add it - now that array value is safe from other threads
mutexed_slots.emplace_back(index);
return true;
}
return false;
}
void unlock_element(int index)
{
std::lock_guard<std::mutex> lock(mtx);
// No need to check because you will only unlock the element that you accessed (unless you are a very naughty boy indeed)
vec.erase(vec.begin() + index);
}
Note: This is the start of a idea, so don't knock it too hard just yet! Its also un-tested pseudo code. Its not really intended as a final answer - but as a start point. Please add comments to improve or to suggest that is is/isn't plausible.
Further points:
There may be a more efficient STL to use
You could probably wrap all of this up in a class along with your data
You would need to loop through lock_element() until it returns true - again not pretty at the moment. This mechanism could be improved.
Each thread needs to remember which index they currently are working on so that they only unlock that particular one - again this could be more integrated within a class to ensure that behaviour.
But as a concept - workable? I would think if you need really fast access (which maybe you do) this might not be that efficient, thoughts?
Update
This could be made much more efficient if each thread/worker "registers" its own entry in mutexed_slots. Then there would no push_back/remove's from the vector (except at the start/end). So each thread just sets the index that it has locked - if it has nothing locked then it just gets set to -1 (or such). I am thinking there are many more such efficiency improvements to be made. Again a complete class to do all this for you would be the way to implement it.
Testing / Results
I implemented a tester for this, just because I quite enjoy that sort of thing. My implementation is here
I think its a public github repo - so you are welcome to take a look. But I posted the results on the top-level readme (so scroll a little to see them). I implemented some improvements such that:
There are no insert/removal to the protection array at run-time
There is no need for a lock_guard to do the "unlock" because I am relying no a std::atomic index.
Below is my a printout of my summary:
Summary:
When the workload is 1ms (the time taken to perform each action) then the amount of work done was:
9808 for protected
8117 for normal
Note these values varied, sometimes the normal was higher, there appeared no clear winner.
When the workload is 0ms (basically increment a few counters) then the amount of work done was:
9791264 for protected
29307829 for normal
So here you can see that using the mutexed protection slows down the work by a factor of about a third (1/3). This ratio is consistant between tests.
I also ran the same tests for 1 worker, and the same ratios roughly held true. However when I make the array smaller (~1000 elements) the amount of work done is still roughly the same for when the work load is 1ms. But when the workload is very light I got results like:
5621311
39157931
Which is about 7 times slower.
Conclusion
The larger the array then less collisions occur - the performance is better.
The longer the work load is (per item) then the less noticeable the difference is with using the protecting mechanism.
It appears that the locking is generally only adding an overhead that is 2-3 times slower then incrementing a few counters. This is probably skewed by actual collisions because (from the results) the longest lock time recorded was a huge 40ms - but this was when there was the work time was very fast so, many collisions occurred (~8 successful locks per collision).

It depends on the access pattern, do you have a way to partition the work effectively? Basically, you can partition the array into 8 chunks (or as many as you can afford) and cover each part with a mutex, but if the access pattern is random you're still going to have a lot of collisions.
Do you have TSX support on your system? it would be a classic example, just have one global lock, and have the threads ignore it unless there's an actual collision.

You can write a class that will create locks on the fly when a particular index requires it, std::optional would be helpful for this (C++17 code ahead):
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == std::nullopt) {
index_locks_[i].emplace();
}
return *index_locks_[i];
}
private:
std::vector<std::optional<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
You could also use std::unique_ptr to minimize stack-space but maintain identical semantics:
class IndexLocker {
public:
explicit IndexLocker(size_t size) : index_locks_(size) {}
std::mutex& get_lock(size_t i) {
if (std::lock_guard guard(instance_lock_); index_locks_[i] == nullptr) {
index_locks_[i] = std::make_unique<std::mutex>();
}
return *index_locks_[i];
}
private:
std::vector<std::unique_ptr<std::mutex>> index_locks_;
std::mutex instance_lock_;
};
Using this class doesn't necessarily mean you need to create all 1,000,000 elements. You can use modulo operations to treat the locker as a "hash table" of mutexes:
constexpr size_t kLockLimit = 8;
IndexLocker index_locker(kLockLimit);
auto thread_code = [&](size_t i) {
std::lock_guard guard(index_locker.get_lock(i % kLockLimit));
// Do work with lock.
};
Worth mentioning that the "hash table" approach makes it very easy to deadlock (get_lock(0) followed by get_lock(16), for example). If each thread does work on exactly one element at a time, however, this shouldn't be an issue.

There are other trade-offs with fine-grain locking. Atomic operations are expensive, so a parallel algorithm that locks every element can take longer than the sequential version.
How to lock efficiently depends. Are the array elements dependent on other elements in the array? Are you mostly reading? mostly writing?
I don't want to split the array into 8 parts because that will cause a
high likelihood of waiting (access is random). The elements of the
array are a class that I will write that will be multiple Golomb coded
values.
I don't think having 8 mutexes is the way to go here. If a given lock protects an array section, you can't switch it to protect a different section in the midst of parallel execution without introducing a race condition (rendering the mutex pointless).
Are the array items small? If you can get them down to 8 bytes, you can declare your class with alignas(8) and instantiate std::atomic<YourClass> objects. (Size depends on architecture. Verify is_lock_free() returns true.) That could open up the possibility of lock-free algorithms. It almost seems like a variant of hazard pointers would be useful here. That's complex, so it's probably better to look into other approaches to parallelism if time is limited.

Thread-safe Settings

Im writing some settings classes that can be accessed from everywhere in my multithreaded application. I will read these settings very often (so read access should be fast), but they are not written very often.
For primitive datatypes it looks like boost::atomic offers what I need, so I came up with something like this:
class UInt16Setting
{
private:
boost::atomic<uint16_t> _Value;
public:
uint16_t getValue() const { return _Value.load(boost::memory_order_relaxed); }
void setValue(uint16_t value) { _Value.store(value, boost::memory_order_relaxed); }
};
Question 1: Im not sure about the memory ordering. I think in my application I don't really care about memory ordering (do I?). I just want to make sure that getValue() always returns a non-corrupted value (either the old or the new one). So are my memory ordering settings correct?
Question 2: Is this approach using boost::atomic recommended for this kind of synchronization? Or are there other constructs that offer better read performance?
I will also need some more complex setting types in my application, like std::string or for example a list of boost::asio::ip::tcp::endpoints. I consider all these setting values as immutable. So once I set the value using setValue(), the value itself (the std::string or the list of endpoints itself) does not change anymore. So again I just want to make sure that I get either the old value or the new value, but not some corrupted state.
Question 3: Does this approach work with boost::atomic<std::string>? If not, what are alternatives?
Question 4: How about more complex setting types like the list of endpoints? Would you recommend something like boost::atomic<boost::shared_ptr<std::vector<boost::asio::ip::tcp::endpoint>>>? If not, what would be better?

Q1, Correct if you don't try to read any shared non-atomic variables after reading the atomic. Memory barriers only synchronize access to non-atomic variables that may happen between atomic operations
Q2 I don't know (but see below)
Q3 Should work (if compiles). However,
atomic<string>
possibly isn't lock free
Q4 Should work but, again, the implementation isn't possibly lockfree (Implementing lockfree shared_ptr is challenging and patent-mined field).
So probably readers-writers lock (as Damon suggests in the comments) may be simpler and even more effective if your config includes data with size more than 1 machine word (for which CPU native atomics usually works)
[EDIT]However,
atomic<shared_ptr<TheWholeStructContainigAll> >
may have some sense even being non-lock free: this approach minimize collision probability for readers that need more than one coherent value, though the writer should make a new copy of the whole "parameter sheet" every time it changes something.

For question 1, the answer is "depends, but probably not". If you really only care that a single value isn't garbled, then yes, this is fine, and you don't care about memory order either.
Usually, though, this is a false premise.
For questions 2, 3, and 4 yes, this will work, but it will likely use locking for complex objects such as string (internally, for every access, without you knowing). Only rather small objects which are roughly the size of one or two pointers can normally be accessed/changed atomically in a lockfree manner. This depends on your platform, too.
It's a big difference whether one successfully updates one or two values atomically. Say you have the values left and right which delimit the left and right boundaries of where a task will do some processing in an array. Assume they are 50 and 100, respectively, and you change them to 101 and 150, each atomically. So the other thread picks up the change from 50 to 101 and starts doing its calculation, sees that 101 > 100, finishes, and writes the result to a file. After that, you change the output file's name, again, atomically.
Everything was atomic (and thus, more expensive than normal), but none of it was useful. The result is still wrong, and was written to the wrong file, too.
This may not be a problem in your particular case, but usually it is (and, your requirements may change in the future). Usually you really want the complete set of changes being atomic.
That said, if you have either many or complex (or, both many and complex) updates like this to do, you might want to use one big (reader-writer) lock for the whole config in the first place anyway, since that is more efficient than acquiring and releasing 20 or 30 locks or doing 50 or 100 atomic operations. Do however note that in any case, locking will severely impact performance.
As pointed out in the comments above, I would preferrably make a deep copy of the configuration from the one thread that modifies the configuration, and schedule updates of the reference (shared pointer) used by consumers as a normal tasks. That copy-modify-publish approach a bit similar to how MVCC databases work, too (these, too, have the problem that locking kills their performance).
Modifying a copy asserts that only readers are accessing any shared state, so no synchronization is necessary either for readers or for the single writer. Reading and writing is fast.
Swapping the configuration set happens only at well-defined points in times when the set is guaranteed to be in a complete, consistent state and threads are guaranteed not to do something else, so no ugly surprises of any kind can happen.
A typical task-driven application would look somewhat like this (in C++-like pseudocode):
// consumer/worker thread(s)
for(;;)
{
task = queue.pop();
switch(task.code)
{
case EXIT:
return;
case SET_CONFIG:
my_conf = task.data;
break;
default:
task.func(task.data, &my_conf); // can read without sync
}
}
// thread that interacts with user (also producer)
for(;;)
{
input = get_input();
if(input.action == QUIT)
{
queue.push(task(EXIT, 0, 0));
for(auto threads : thread)
thread.join();
return 0;
}
else if(input.action == CHANGE_SETTINGS)
{
new_config = new config(config); // copy, readonly operation, no sync
// assume we have operator[] overloaded
new_config[...] = ...; // I own this exclusively, no sync
task t(SET_CONFIG, 0, shared_ptr<...>(input.data));
queue.push(t);
}
else if(input.action() == ADD_TASK)
{
task t(RUN, input.func, input.data);
queue.push(t);
}
...
}

For anything more substantial than a pointer, use a mutex. The tbb (opensource) library supports the concept of reader-writer mutices, which allow multiple simultaneous readers, see the documentation.

C - faster locking of integer when using PThreads

I have a counter that's used by multiple threads to write to a specific element in an array. Here's what I have so far...
int count = 0;
pthread_mutex_t count_mutex;
void *Foo()
{
// something = random value from I/O redirection
pthread_mutex_lock(&count_mutex);
count = count + 1;
currentCount = count;
pthread_mutex_unlock(&count_mutex);
// do quick assignment operation. array[currentCount] = something
}
main()
{
// create n pthreads with the task Foo
}
The problem is that it is ungodly slow. I'm accepting a file of integers as I/O redirection and writing them into an array. It seems like each thread spends a lot of time waiting for the lock to be removed. Is there a faster way to increment the counter?
Note: I need to keep the numbers in order which is why I have to use a counter vs giving each thread a specific chunk of the array to write to.

You need to use interlocking. Check out the Interlocked* function on windows, or apple's OSAtomic* functions, or maybe libatomic on linux.
If you have a compiler that supports C++11 well you may even be able to use std::atomic.

Well, one option is to batch up the changes locally somewhere before applying the batch to your protected resource.
For example, have each thread gather ten pieces of information (or less if it runs out before it's gathered ten) then modify Foo to take a length and array - that way, you amortise the cost of the locking, making it much more efficient.
I'd also be very wary of doing:
// do quick assignment operation. array[currentCount] = something
outside the protected area - that's a recipe for disaster since another thread may change currentCount from underneath you. That's not a problem if it's a local variable since each thread will have its own copy but it's not clear from the code what scope that variable has.

how do i properly design a worker thread? (avoid for example Sleep(1))

i am still a beginner at multi-threading, so bear with me please:
i am currently writing an application that does some FVM calculation on a grid. it's a time-explicit model, so at every timestep i need to calculate new values for the whole grid. my idea was to distribute this calculation to 4 worker-threads, which then deal with the cells of the grid (first thread calculating 0, 4, 8... second thread 1, 5, 9... and so forth).
i create those 4 threads at program start.
they look something like this:
void __fastcall TCalculationThread::Execute()
{
bool alive = true;
THREAD_SIGNAL ts;
while (alive)
{
Sleep(1);
if (TryEnterCriticalSection(&TMS))
{
ts = thread_signal;
LeaveCriticalSection(&TMS);
alive = !ts.kill;
if (ts.go && !ts.done.at(this->index))
{
double delta_t = ts.dt;
for (unsigned int i=this->index; i < cells.size(); i+= this->steps)
{
calculate_one_cell();
}
EnterCriticalSection(&TMS);
thread_signal.done.at(this->index)=true;
LeaveCriticalSection(&TMS);
}
}
}
they use a global struct, to communicate with the main thread (main thread sets ts.go to true when the workers need to start.
now i am sure this is not the way to do it! not only does it feel wrong, it also doesn't perform very well...
i read for example here that a semaphore or an event would work better. the answer to this guy's question talks about a lockless queue.
i am not very familiar with these concepts would like some pointers how to continue.
could you line out any of the ways to do this better?
thank you for your time. (and sorry for the formatting)
i am using borland c++ builder and its thread-object (TThread).

The definitely more effective algorithm would be to calculate yields for 0,1,2,3 on one thread, 4,5,6,7 on another, etc. Interleaving memory accesses like that is very bad, even if the variables are completely independent- you'll get false sharing problems. This is the equivalent of the CPU locking every write.

Calling Sleep(1) in a calculation thread can't be a good solution to any problem. You want your threads to be doing useful work rather than blocking for no good reason.
I think your basic problem can be expressed as a serial algorithm of this basic form:
for (int i=0; i<N; i++)
cells[i]->Calculate();
You are in the happy position that calls to Calculate() are independent of each other—what you have here is a parallel for. This means that you can implement this without a mutex.
There are a variety of ways to achieve this. OpenMP would be one; a threadpool class another. If you are going to roll your own thread based solution then use InterlockedIncrement() on a shared variable to iterate through the array.
You may hit some false sharing problems, as #DeadMG suggests, but quite possibly not. If you do have false sharing then yet another approach is to stride across larger sub-arrays. Essentially the increment (i.e. stride) passed to InterlockedIncrement() would be greater than one.
The bottom line is that the way to make the code faster is to remove both the the critical section (and hence the contention on it) and the Sleep(1).

Lockless Deque in Win32 C++

I'm pretty new to lockless data structures, so for an exercise I wrote (What I hope functions as) a bounded lockless deque (No resizing yet, just want to get the base cases working). I'd just like to have some confirmation from people who know what they're doing as to whether I've got the right idea and/or how I might improve this.
class LocklessDeque
{
public:
LocklessDeque() : m_empty(false),
m_bottom(0),
m_top(0) {}
~LocklessDeque()
{
// Delete remaining tasks
for( unsigned i = m_top; i < m_bottom; ++i )
delete m_tasks[i];
}
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
ITask* PopBottom()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedDecrement(&m_bottom);
return m_tasks[m_bottom];
}
m_empty = true;
return NULL;
}
ITask* PopTop()
{
if( m_bottom - m_top > 0 )
{
m_empty = false;
InterlockedIncrement(&m_top);
return m_tasks[m_top];
}
m_empty = true;
return NULL;
}
bool IsEmpty() const
{
return m_empty;
}
private:
ITask* m_tasks[16];
bool m_empty;
volatile unsigned m_bottom;
volatile unsigned m_top;
};

Looking at this I would think this would be a problem:
void PushBottom(ITask* task)
{
m_tasks[m_bottom] = task;
InterlockedIncrement(&m_bottom);
}
If this is used in an actual multithreaded environment I would think you'd collide when setting m_tasks[m_bottom]. Think of what would happen if you have two threads trying to do this at the same time - you couldn't be sure of which one actually set m_tasks[m_bottom].
Check out this article which is a reasonable discussion of a lock-free queue.

Your use of the m_bottom and m_top members to index the array is not okay. You can use the return value of InterlockedXxxx() to get a safe index. You'll need to lose IsEmpty(), it can never be accurate in a multi-threading scenario. Same problem with the empty check in PopXxx. I don't see how you could make that work without a mutex.

The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).
The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).
You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:
union State
{
struct
{
short a;
short b;
};
uint32_t volatile rawState;
} state;
void DoSomething()
{
// Keep looping until nobody else changed it behind our back
for (;;)
{
state origState;
state newState;
// It's important that you only read the state once per try
origState.rawState = state.rawState;
// This must copy origState, NOT read the state again
newState.rawState = origState.rawState;
// Now you can do something impossible to do atomically...
// This example takes a lot of cycles, there is huge
// opportunity for another thread to come in and change
// it during this update
if (newState.b == 3 || newState.b % 6 != 0)
{
newState.a++;
}
// Now we atomically update the state,
// this ONLY changes state.rawState if it's still == origState.rawState
// In either case, InterlockedCompareExchange returns what value it has now
if (InterlockedCompareExchange(&state.rawState, newState.rawState,
origState.rawState) == origState.rawState)
return;
}
}
(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)
Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.
Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.
Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.
Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")
My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.
Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.

Addressing the problem pointed out by Aaron, I'd do something like:
void PushBottom(ITask *task) {
int pos = InterlockedIncrement(&m_bottom);
m_tasks[pos] = task;
}
Likewise, to pop:
ITask* PopTop() {
int pos = InterlockedIncrement(&m_top);
if (pos == m_bottom) // This is still subject to a race condition.
return NULL;
return m_tasks[pos];
}
I'd eliminate both m_empty and IsEmpty() from the design completely. The result returned by IsEmpty is subject to a race condition, meaning by the time you look at that result, it may well be stale (i.e. what it tells you about the queue may be wrong by the time you look at what it returned). Likewise, m_empty provides nothing but a record of information that's already available without it, a recipe for producing stale data. Using m_empty doesn't guarantee it can't work right, but it does increase the chances of bugs considerably, IMO.
I'm guessing it's due to the interim nature of the code, but right now you also have some serious problems with the array bounds. You aren't doing anything to force your array indexes to wrap around, so as soon as you try to push the 17th task onto the queue, you're going to have a major problem.
Edit: I should point out that the race condition noted in the comment is quite serious -- and it's not the only one either. Although somewhat better than the original, this should not be mistaken for code that will work correctly.
I'd say that writing correct lock-free code is considerably more difficult than writing correct code that uses locks. I don't know of anybody who has done so without a solid understanding of code that does use locking. Based on the original code, I'd say it would be much better to start by writing and understanding code for a queue that does use locks, and only when you've used that to gain a much better understanding of the issues involved really make an attempt at lock-free code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js