Lock Free Queue -- Single Producer, Multiple Consumers - c++

I am looking for a method to implement lock-free queue data structure that supports single producer, and multiple consumers. I have looked at the classic method by Maged Michael and Michael Scott (1996) but their version uses linked lists. I would like an implementation that makes use of bounded circular buffer. Something that uses atomic variables?
On a side note, I am not sure why these classic methods are designed for linked lists that require a lot of dynamic memory management. In a multi-threaded program, all memory management routines are serialized. Aren't we defeating the benefits of lock-free methods by using them in conjunction with dynamic data structures?
I am trying to code this in C/C++ using pthread library on a Intel 64-bit architecture.
Thank you,
Shirish

The use of a circular buffer makes a lock necessary, since blocking is needed to prevent the head from going past the tail. But otherwise the head and tail pointers can easily be updated atomically. Or in some cases the buffer can be so large that overwriting is not an issue. (in real life you will see this in automated trading systems, with circular buffers sized to hold X minutes of market data. If you are X minutes behind, you have wayyyy worse problems than overwriting your buffer).
When I implemented the MS queue in C++, I built a lock free allocator using a stack, which is very easy to implement. If I have MSQueue then at compile time I know sizeof(MSQueue::node). Then I make a stack of N buffers of the required size. The N can grow, i.e. if pop() returns null, it is easy to go ask the heap for more blocks, and these are pushed onto the stack. Outside of the possibly blocking call for more memory, this is a lock free operation.
Note that the T cannot have a non-trivial dtor. I worked on a version that did allow for non-trivial dtors, that actually worked. But I found that it was easier just to make the T a pointer to the T that I wanted, where the producer released ownership, and the consumer acquired ownership. This of course requires that the T itself is allocated using lockfree methods, but the same allocator I made with the stack works here as well.
In any case the point of lock-free programming is not that the data structures themselves are slower. The points are this:
lock free makes me independent of the scheduler. Lock-based programming depends on the scheduler to make sure that the holders of a lock are running so that they can release the lock. This is what causes "priority inversion" On Linux there are some lock attributes to make sure this happens
If I am independent of the scheduler, the OS has a far easier time managing timeslices, and I get far less context switching
it is easier to write correct multithreaded programs using lockfree methods since I dont have to worry about deadlock , livelock, scheduling, syncronization, etc This is espcially true with shared memory implementations, where a process could die while holding a lock in shared memory, and there is no way to release the lock
lock free methods are far easier to scale. In fact, I have implemented lock free methods using messaging over a network. Distributed locks like this are a nightmare
That said, there are many cases where lock-based methods are preferable and/or required
when updating things that are expensive or impossible to copy. Most lock free methods use some sort of versioning, i.e. make a copy of the object, update it, and check if the shared version is still the same as when you copied it, then make the current version you update version. Els ecopy it again, apply the update, and check again. Keep doing this until it works. This is fine when the objects are small, but it they are large, or contain file handles, etc then not recommended
Most types are impossible to access in a lock free way, e.g. any STL container. These have invariants that require non atomic access , for example assert(vector.size()==vector.end()-vector.begin()). So if you are updating/reading a vector that is shared, you have to lock it.

This is an old question, but no one has provided an accepted solution. So I offer this info for others who may be searching.
This website: http://www.1024cores.net
Provides some really useful lockfree/waitfree data structures with thorough explanations.
What you are seeking is a lock-free solution to the reader/writer problem.
See: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem

For a traditional one-block circular buffer I think this simply cannot be done safely with atomic operations. You need to do so much in one read. Suppose you have a structure that has this:
uint8_t* buf;
unsigned int size; // Actual max. buffer size
unsigned int length; // Actual stored data length (suppose in write prohibited from being > size)
unsigned int offset; // Start of current stored data
On a read you need to do the following (this is how I implemented it anyway, you can swap some steps like I'll discuss afterwards):
Check if the read length does not surpass stored length
Check if the offset+read length do not surpass buffer boundaries
Read data out
Increase offset, decrease length
What should you certainly do synchronised (so atomic) to make this work? Actually combine steps 1 and 4 in one atomic step, or to clarify: do this synchronised:
check read_length, this can be sth like read_length=min(read_length,length);
decrease length with read_length: length-=read_length
get a local copy from offset unsigned int local_offset = offset
increase offset with read_length: offset+=read_length
Afterwards you can just do a memcpy (or whatever) starting from your local_offset, check if your read goes over circular buffer size (split in 2 memcpy's), ... . This is 'quite' threadsafe, your write method could still write over the memory you're reading, so make sure your buffer is really large enough to minimize that possibility.
Now, while I can imagine you can combine 3 and 4 (I guess that's what they do in the linked-list case) or even 1 and 2 in atomic operations, I cannot see you do this whole deal in one atomic operation :).
You can however try to drop 'length' checking if your consumers are very smart and will always know what to read. You'd also need a new woffset variable then, because the old method of (offset+length)%size to determine write offset wouldn't work anymore. Note this is close to the case of a linked list, where you actually always read one element (= fixed, known size) from the list. Also here, if you make it a circular linked list, you can read to much or write to a position you're reading at that moment!
Finally: my advise, just go with locks, I use a CircularBuffer class, completely safe for reading & writing) for a realtime 720p60 video streamer and I have got no speed issues at all from locking.

This is an old question but no one has provided an answer that precisely answers it. Given that still comes up high in search results for (nearly) the same question, there should be an answer, given that one exists.
There may be more than one solution, but here is one that has an implementation:
https://github.com/tudinfse/FFQ
The conference paper referenced in the readme details the algorithm.

Related

Is it necessary to lock an array that is *only written to* from one thread and *only read from* another?

I have two threads running. They share an array. One of the threads adds new elements to the array (and removes them) and the other uses this array (read operations only).
Is it necessary for me to lock the array before I add/remove to/from it or read from it?
Further details:
I will need to keep iterating over the entire array in the other thread. No write operations over there as previously mentioned. "Just scanning something like a fixed-size circular buffer"
The easy thing to do in such cases is to use a lock. However locks can be very slow. I did not want to use locks if their use can be avoided. Also, as it came out from the discussions, it might not be necessary (it actually isn't) to lock all operations on the array. Just locking the management of an iterator for the array (count variable that will be used by the other thread) is enough
I don't think the question is "too broad". If it still comes out to be so, please let me know. I know the question isn't perfect. I had to combine at least 3 answers in order to be able to solve the question - which suggests most people were not able to fully understand all the issues and were forced to do some guess work. But most of it came out through the comments which I have tried to incorporate in the question. The answers helped me solve my problem quite objectively and I think the answers provided here are quite a helpful resource for someone starting out with multithreading.
If two threads perform an operation on the same memory location, and at least one operation is a write operation, you have a so-called data race. According to C11 and C++11, the behaviour of programs with data races is undefined.
So, you have to use some kind of synchronization mechanism, for example:
std::atomic
std::mutex
If you are writing and reading from the same location from multiple threads you will need to to perform locking or use atomics. We can see this by looking at the C11 draft standard(The C++11 standard looks almost identical, the equivalent section would be 1.10) says the following in section 5.1.2.4 Multi-threaded executions and data races:
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
and:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
and:
Compiler transformations that introduce assignments to a potentially
shared memory location that would not be modified by the abstract
machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in
cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
If you were just adding data to the array then in the C++ world a std::atomic index would be sufficient since you can add more elements and then atomically increment the index. But since you want to grow and shrink the array then you will need to use a mutex, in the C++ world std::lock_guard would be a typical choice.
To answer your question: maybe.
Simply put, the way that the question is framed doesn't provide enough information about whether or not a lock is required.
In most standard use cases, the answer would be yes. And most of the answers here are covering that case pretty well.
I'll cover the other case.
When would you not need a lock given the information you have provided?
There are some other questions here that would help better define whether you need a lock, whether you can use a lock-free synchronization method, or whether or not you can get away with no explicit synchronization.
Will writing data ever be non-atomic? Meaning, will writing data ever result in "torn data"? If your data is a single 32 bit value on an x86 system, and your data is aligned, then you would have a case where writing your data is already atomic. It's safe to assume that if your data is of any size larger than the size of a pointer (4 bytes on x86, 8 on x64), then your writes cannot be atomic without a lock.
Will the size of your array ever change in a way that requires reallocation? If your reader is walking through your data, will the data suddenly be "gone" (memory has been "delete"d)? Unless your reader takes this into account (unlikely), you'll need a lock if reallocation is possible.
When you write data to your array, is it ok if the reader "sees" old data?
If your data can be written atomically, your array won't suddenly not be there, and it's ok for the reader to see old data... then you won't need a lock. Even with those conditions being met, it would be appropriate to use the built in atomic functions for reading and storing. But, that's a case where you wouldn't need a lock :)
Probably safest to use a lock since you were unsure enough to ask this question. But, if you want to play around with the edge case of where you don't need a lock... there you go :)
One of the threads adds new elements to the array [...] and the other [reads] this array
In order to add and remove elements to/from an array, you will need an index that specifies the last place of the array where the valid data is stored. Such index is necessary, because arrays cannot be resized without potential reallocation (which is a different story altogether). You may also need a second index to mark the initial location from which the reading is allowed.
If you have an index or two like this, and assuming that you never re-allocate the array, it is not necessary to lock when you write to the array itself, as long as you lock the writes of valid indexes.
int lastValid = 0;
int shared[MAX];
...
int count = toAddCount;
// Add the new data
for (int i = lastValid ; count != 0 ; count--, i++) {
shared[i] = new_data(...);
}
// Lock a mutex before modifying lastValid
// You need to use the same mutex to protect the read of lastValid variable
lock_mutex(lastValid_mutex);
lastValid += toAddCount;
unlock_mutex(lastValid_mutex);
The reason this works is that when you perform writes to shared[] outside the locked region, the reader does not "look" past the lastValid index. Once the writing is complete, you lock the mutex, which normally causes a flush of the CPU cache, so the writes to shared[] would be complete before the reader is allowed to see the data.
Lock? No. But you do need some synchronization mechanism.
What you're describing sounds an awful like a "SPSC" (Single Producer Single Consumer) queue, of which there are tons of lockfree implementations out there including one in the Boost.Lockfree
The general way these work is that underneath the covers you have a circular buffer containing your objects and an index. The writer knows the last index it wrote to, and if it needs to write new data it (1) writes to the next slot, (2) updates the index by setting the index to the previous slot + 1, and then (3) signals the reader. The reader then reads until it hits an index that doesn't contain the index it expects and waits for the next signal. Deletes are implicit since new items in the buffer overwrite previous ones.
You need a way to atomically update the index, which is provided by atomic<> and has direct hardware support. You need a way for a writer to signal the reader. You also might need memory fences depending on the platform s.t. (1-3) occur in order. You don't need anything as heavy as a lock.
"Classical" POSIX would indeed need a lock for such a situation, but this is overkill. You just have to ensure that the reads and writes are atomic. C and C++ have that in the language since their 2011 versions of their standards. Compilers start to implement it, at least the latest versions of Clang and GCC have it.
It depends. One situation where it could be bad is if you are removing an item in one thread then reading the last item by its index in your read thread. That read thread would throw an OOB error.
As far as I know, this is exactly the usecase for a lock. Two threads which access one array concurrently must ensure that one thread is ready with its work.
Thread B might read unfinished data if thread A did not finish work.
If it's a fixed-size array, and you don't need to communicate anything extra like indices written/updated, then you can avoid mutual exclusion with the caveat that the reader may see:
no updates at all
If your memory ordering is relaxed enough that this happens, you need a store fence in the writer and a load fence in the consumer to fix it
partial writes
if the stored type is not atomic on your platform (int generally should be)
or your values are un-aligned, and especially if they may span cache lines
This is all dependent on your platform though - hardware, OS and compiler can all affect it. You haven't told us what they are.
The portable C++11 solution is to use an array of atomic<int>. You still need to decide what memory ordering constraints you require, and what that means for correctness and performance on your platform.
If you use e.g. vector for your array (so that it can dynamically grow), then reallocation may occur during the writes, you lose.
If you use data entries larger than is always written and read atomically (virtually any complex data type), you lose.
If the compiler / optimizer decides to keep certain things in registers (such as the counter holding the number of valid entries in the array) during some operations, you lose.
Or even if the compiler / optimizer decides to switch order of execution for your array element assignments and counter increments/decrements, you lose.
So you certianly do need some sort of synchronization. What is the best way to do so (for example it may be worth while to lock only parts of the array), depends on your specifics (how often and in what pattern do the threads access the array).

Multithreading - In an array what should I protect?

I'm working on some code that has a global array that can be accessed by two threads for reading writing purposes.
There will be no batch processing where a range of indexes are read or written, so I'm trying to figure out if I should lock the entire array or only the array index I am currently using.
The easiest solution would be to consider the array a CS and put a big fat lock around it, but can I avoid this and just lock an index?
Cheers.
Locking one index implies that you can keep track of which thread is accessing what part of the array. Keeping track of this information, which is shared between the reading and the writing thread, implies that you have one lock around this information. So, you still end up with a global lock.
In this situation, I think that the most efficient approaches are:
- using a reader/writer lock
- or dividing the big array into a few subsets, each subset using a distinct lock.
If this is C++ i suggest you to use STL containers. std::vector or something else which suits your job. They are fast, easy to use, no memory leaks.
If you want to do it all by your self, then of course one method will be to use a single mutex ( which is bad ).
or you can use some reader writer thingy for the whole array.
I think its not feasible to make each element of an array thread safe with its own lock!! that would eat your memory. Check the link and there are 3 solutions with different out comes. Test them out and use the best for your case. ( don't think like "ok i think my program needs the readers preference algorithm". try using it in your system and decide. because we really cant assume such things sometimes )
There is no way of knowing what will be optimal unless you profile under realistic running conditions. I would suggest implementing an array-like class, where you can lock a varying number of elements in groups. Then you fine-tune the size of these groups.
Another option would be to enqueue all read/write operations using an active object. This would make all access sequential, and means you could use a non-concurrent array type to store the data. It would require some sort of concurrent queue data structure under the hood.

TBB concurrent_queue, how unsafe is unsafe_size?

The documentation is not very clear on how unsafe concurrent_queue::unsafe_size() is.
The doxygen comment in the file tbb/internal/concurrent_queue.h mentions:
Get size of queue; result may be invalid if queue is modified
concurrently
What I am interested in knowing is will it return a valid size at some point in the past, or may it return garbage ? (Provided that several threads are both reading and writing).
I am not interested in the exact value, but more on a "load" indication. I want to prevent my queue to explode in capacity when the consumer threads have a problem. I'd use concurrent_bounded_queue which as the name suggest has a capacity control, but I would lose the lock-free property.
You really should treat it as garbage because it is marked unsafe. I think it may actually return questionable numbers because if memory serves its really a difference between pushes and pops. However even if it were 'safe' it would still only return a number that was valid in the past, that's one of the joys of wait free programming.
If you really need to reason about the counts store an atomic count in the class that holds the queue.
Is there a reason you are not using one of the tasking primitives in tbb instead of building your own?

Optimal strategy to make a C++ hash table, thread safe

(I am interested in design of implementation NOT a readymade construct that will do it all.)
Suppose we have a class HashTable (not hash-map implemented as a tree but hash-table)
and say there are eight threads.
Suppose read to write ratio is about 100:1 or even better 1000:1.
Case A) Only one thread is a writer and others including writer can read from HashTable(they may simply iterate over entire hash table)
Case B) All threads are identical and all could read/write.
Can someone suggest best strategy to make the class thread safe with following consideration
1. Top priority to least lock contention
2. Second priority to least number of locks
My understanding so far is thus :
One BIG reader-writer lock(semaphore).
Specialize the semaphore so that there could be eight instances writer-resource for case B, where each each writer resource locks one row(or range for that matter).
(so i guess 1+8 mutexes)
Please let me know if I am thinking on the correct line, and how could we improve on this solution.
With such high read/write ratios, you should consider a lock free solution, e.g. nbds.
EDIT:
In general, lock free algorithms work as follows:
arrange your data structures such that for each function you intend to support there is a point at which you are able to, in one atomic operation, determine whether its results are valid (i.e. other threads have not mutated its inputs since they have been read) and commit to them; with no changes to state visible to other threads unless you commit. This will involve leveraging platform-specific functions such as Win32's atomic compare-and-swap or Cell's cache line reservation opcodes.
each supported function becomes a loop that repeatedly reads the inputs and attempts to perform the work, until the commit succeeds.
In cases of very low contention, this is a performance win over locking algorithms since functions mostly succeed the first time through without incurring the overhead of acquiring a lock. As contention increases, the gains become more dubious.
Typically the amount of data it is possible to atomically manipulate is small - 32 or 64 bits is common - so for functions involving many reads and writes, the resulting algorithms become complex and potentially very difficult to reason about. For this reason, it is preferable to look for and adopt a mature, well-tested and well-understood third party lock free solution for your problem in preference to rolling your own.
Hashtable implementation details will depend on various aspects of the hash and table design. Do we expect to be able to grow the table? If so, we need a way to copy bulk data from the old table into the new safely. Do we expect hash collisions? If so, we need some way of walking colliding data. How do we make sure another thread doesn't delete a key/value pair between a lookup returning it and the caller making use of it? Some form of reference counting, perhaps? - but who owns the reference? - or simply copying the value on lookup? - but what if values are large?
Lock-free stacks are well understood and relatively straightforward to implement (to remove an item from the stack, get the current top, attempt to replace it with its next pointer until you succeed, return it; to add an item, get the current top and set it as the item's next pointer, until you succeed in writing a pointer to the item as the new top; on architectures with reserve/conditional write semantics, this is enough, on architectures only supporting CAS you need to append a nonce or version number to the atomically manipulated data to avoid the ABA problem). They are one way of keeping track of free space for keys/data in an atomic lock free manner, allowing you to reduce a key/value pair - the data actually stored in a hashtable entry - to a pointer/offset or two, a small enough amount to be manipulated using your architecture's atomic instructions. There are others.
Reads then become a case of looking up the entry, checking the kvp against the requested key, doing whatever it takes to make sure the value will remain valid when we return it (taking a copy / increasing its reference count), checking the entry hasn't been modified since we began the read, returning the value if so, undoing any reference count changes and repeating the read if not.
Writes will depend on what we're doing about collisions; in the trivial case, they are simply a case of finding the correct empty slot and writing the new kvp.
The above is greatly simplified and insufficient to produce your own safe implementation, especially if you are not familiar with lock-free/wait-free techniques. Possible complications include the ABA problem, priority inversion, starvation of particular threads; I have not addressed hash collisions.
The nbds page links to an excellent presentation on a real world approach that allows growth / collisions. Others exist, a quick Google finds lots of papers.
Lock free and wait free algorithms are fascinating areas of research; I encourage the reader to Google around. That said, naive lock free implementations can easily look reasonable and behave correctly much of the time while in reality being subtly unsafe. While it is important to have a solid grasp on the principles, I strongly recommend using an existing, well-understood and proven implementation over rolling your own.
You may want to look at Java's ConcurrentHashMap implementation for one possible implementation.
The basic idea is NOT to lock for every read operation but only for writes. Since in your interview they specifically mentioned an extremely high read:write ratio it makes sense trying to stuff as much overhead as possible into writes.
The ConcurrentHashMap divides the hashtable into so called "Segments" that are themselves concurrently readable hashtables and keep every single segment in a consistent state to allow traversing without locking.
When reading you basically have the usual hashmap get() with the difference that you have to worry about reading stale values, so things like the value of the correct node, the first node of the segment table and next pointers have to be volatile (with c++'s non-existent memory model you probably can't do this portably; c++0x should help here, but haven't looked at it so far).
When putting a new element in there you get all the overhead, first of all having to lock the given segment. After locking it's basically a usual put() operation, but you have to guarantee atomic writes when updating the next pointer of a node (pointing to the newly created node whose next pointer has to be already correctly pointing to the old next node) or overwriting the value of a node.
When growing the segment, you have to rehash the existing nodes and put them into the new, larger table. The important part is to clone nodes for the new table as not to influence the old table (by changing their next pointers too early) until the new table is complete and replaces the old one (they use some clever trick there that means they only have to clone about 1/6 of the nodes - nice that but I'm not really sure how they reach that number).
Note that garbage collection makes this a whole lot easier because you don't have to worry about the old nodes that weren't reused - as soon as all readers are finished they will automatically be GCed. That's solvable though, but I'm not sure what the best approach would be.
I hope the basic idea is somewhat clear - obviously there are several points that aren't trivially ported to c++, but it should give you a good idea.
No need to lock the whole table, just have a lock per bucket. That immediately gives parallelism. Inserting a new node to the table requires a lock on the bucket about to have the head node modified. New nodes are always added at the head of the table so that readers can iterate through the nodes without worrying about seeing new nodes.
Each node has a r/w lock; readers iterating get a read lock lock. Node modification requires a write lock.
Iteration without the bucket lock leading to node removal requires an attempt to take the bucket lock, and if it fails it must release the locks and retry to avoid deadlock because the lock order is different.
Brief overview.
You can try atomic_hashtable for c
https://github.com/Taymindis/atomic_hashtable for read, write, and delete without locking while multithreading accessing the data, Simple and Stable
API documents given in README.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?
With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.
Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.
Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.
You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections
As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.