What is the best architecture to frequently communicate values between multiple threads?

What is the best architecture to frequently communicate values between multiple threads? - c++

I am writing an application in C++14 that consists of a master thread and multiple slave threads. The master thread coordinates the slave threads which coordinately perform a search, each exploring a part of the search space. A slave thread sometimes encounters a bound on the search. Then it communicates this bound to the master thread which sends the bound to all other slave threads so that they can possibly narrow their searches.
A slave thread must very frequently check whether there is a new bound available, possibly at the entrance of a loop.
What would be the best way to communicate the bound to the slave threads? I can think of using std::atomic<int>, but I am afraid of the performance implications this has whenever the variable is read inside the loop.

The simplest way here is IMO to not overthink this. Just use a std::mutex for each thread, protecting a std::queue that the boundary information is in. Have the main thread wait on a std::condition_variable that each child can lock, write to a "new boundary" queue , then signals te cv, which the main thread then wakes up and copies the value to each child one at at time. As you said in your question, at the top of their loops, the child threads can check their thread-specific queue to see if there's additional bounding conditions.
You actually don't NEED the "main thread" in this. You could have the children write to all other children's queues directly (still mutex-protected), as long as you're careful to avoid deadlock, it would work that way too.
All of these classes can be seen in the thread support library, with decent documentation here.
Yes there's interrupt-based ways of doing things, but in this case polling is relatively cheap because it's not a lot of threads smashing on one mutex, but mostly thread-specific mutexes, and mutexes aren't all that expensive to lock, check, unlock quickly. You're not "holding" on to them for long periods, and thus it's OK. It's a bit of a test really: do you NEED the additional complexity of lock-free? If it's only a dozen (or less) threads, then probably not.

Basically you could make a bet with your architecture that a single write to a primitive datatype is atomic. As you only have one writer, your program would not break if you use the volatile keyword to prevent compiler optimizations that might perform updates to it only in local caches.
However everybody serious about doing things right(tm) will tell you otherwise. Have a look at this article to get a pretty good riskassessment: http://preshing.com/20130618/atomic-vs-non-atomic-operations/
So if you want to be on the safe side, which I recommend, you need to follow the C++ standard. As the C++ standard does not guarantee any atomicity even for the simplest operations, you are stuck with using std::atomic. But honestly, I don't think it is too bad. Sure there is a lock involved, but you can balance out the reading frequency with the benefit of knowing the new boundary early.
To prevent polling the atomic variable, you could use the POSIX signal mechanism to notify slave threads of an update (make sure it works with the platform you are programming for). If that benefits performance or not needs to be seen.

This is actually very simple. You only have to be aware of how things work to be confident the simple solution is not broken. So, what you need is two things:
1. Be sure the variable is written/read to/from memory every time you access it.
2. Be sure you read it in an atomic way, which means you have to read the full value in one go, or if it is not done naturally, have a cheap test to verify it.
To address #1, you have to declare it volatile. Make sure the volatile keyword is applied to the variable itself. Not it's pointer of anything like that.
To address #2, it depends on the type. On x86/64 accesses to integer types is atomic as long as they are aligned to their size. That is, int32_t has to be aligned to 4 bit boundary, and int64_t has to be aligned to 8 byte boundary.
So you may have something like this:
struct Params {
volatile uint64_t bound __attribute__((aligned(8)));
};
If your bounds variable is more complex (a struct) but still fits in 64 bits, you may union it with uint64_t and use the same attribute and volatile as above.
If it's too big for 64 bit, you will need some sort of a lock to ensure you did not read half stale value. The best lock for your circumstances (single writer, multiple readers) is a sequence lock. A sequence lock is simply an volatile int, like above, that serves as the version of the data. Its value starts from 0 and advances 2 on every update. You increment it by 1 before updating the protected value, and again afterwards. The net result is that even numbers are stable states and odd numbers are transient (value updating). In the readers you do this:
1. Read the version. If not changed - return
2. Read till you get an even number
3. Read the protected variable
4. Read the version again. If you get the same number as before - you're good
5. Otherwise - back to step 2
This is actually one of the topics in my next article. I'll implement that in C++ and let you know. Meanwhile, you can look at the seqlock in the linux kernel.
Another word of caution - you need compiler barriers between your memory accesses so that the compiler does not reorder things it should really not. That's how you do it in gcc:
asm volatile ("":::"memory");

Related

Is it possible that a store with memory_order_relaxed never reaches other threads?

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?
The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?
Edit
I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:
Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense
whatsoever unless you combine them with some release operation (and
acquire on the other thread), assuming you want another thread to
see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.

This is all the standard has to say on the matter, I believe:
[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

In practice
Without any other synchronization methods, how long would it take
before other threads can see this, using
x.load(std::memory_order_relaxed);?
No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.
Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.
In theory
Q: Can it theoretically be that such a store [memory_order_relaxed
without (any following) release operation] never reaches the other
thread?
Mark: Theoretically, yes,
You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.
Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)
Is it possible that the value written to x stays entirely thread-local
given the current definition of the C/C++ memory model that the
standard gives?
If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.
Or never run some threads.
Or run some threads so slowly that you would believe they aren't running.

This is what the standard says in 29.3.12:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.
Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

Is it necessary to lock an array that is only written to from one thread and only read from another?

I have two threads running. They share an array. One of the threads adds new elements to the array (and removes them) and the other uses this array (read operations only).
Is it necessary for me to lock the array before I add/remove to/from it or read from it?
Further details:
I will need to keep iterating over the entire array in the other thread. No write operations over there as previously mentioned. "Just scanning something like a fixed-size circular buffer"
The easy thing to do in such cases is to use a lock. However locks can be very slow. I did not want to use locks if their use can be avoided. Also, as it came out from the discussions, it might not be necessary (it actually isn't) to lock all operations on the array. Just locking the management of an iterator for the array (count variable that will be used by the other thread) is enough
I don't think the question is "too broad". If it still comes out to be so, please let me know. I know the question isn't perfect. I had to combine at least 3 answers in order to be able to solve the question - which suggests most people were not able to fully understand all the issues and were forced to do some guess work. But most of it came out through the comments which I have tried to incorporate in the question. The answers helped me solve my problem quite objectively and I think the answers provided here are quite a helpful resource for someone starting out with multithreading.

If two threads perform an operation on the same memory location, and at least one operation is a write operation, you have a so-called data race. According to C11 and C++11, the behaviour of programs with data races is undefined.
So, you have to use some kind of synchronization mechanism, for example:
std::atomic
std::mutex

If you are writing and reading from the same location from multiple threads you will need to to perform locking or use atomics. We can see this by looking at the C11 draft standard(The C++11 standard looks almost identical, the equivalent section would be 1.10) says the following in section 5.1.2.4 Multi-threaded executions and data races:
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
and:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
and:
Compiler transformations that introduce assignments to a potentially
shared memory location that would not be modified by the abstract
machine are generally precluded by this standard, since such an
assignment might overwrite another assignment by a different thread in
cases in which an abstract machine execution would not have
encountered a data race. This includes implementations of data member
assignment that overwrite adjacent members in separate memory
locations. We also generally preclude reordering of atomic loads in
cases in which the atomics in question may alias, since this may
violate the "visible sequence" rules.
If you were just adding data to the array then in the C++ world a std::atomic index would be sufficient since you can add more elements and then atomically increment the index. But since you want to grow and shrink the array then you will need to use a mutex, in the C++ world std::lock_guard would be a typical choice.

To answer your question: maybe.
Simply put, the way that the question is framed doesn't provide enough information about whether or not a lock is required.
In most standard use cases, the answer would be yes. And most of the answers here are covering that case pretty well.
I'll cover the other case.
When would you not need a lock given the information you have provided?
There are some other questions here that would help better define whether you need a lock, whether you can use a lock-free synchronization method, or whether or not you can get away with no explicit synchronization.
Will writing data ever be non-atomic? Meaning, will writing data ever result in "torn data"? If your data is a single 32 bit value on an x86 system, and your data is aligned, then you would have a case where writing your data is already atomic. It's safe to assume that if your data is of any size larger than the size of a pointer (4 bytes on x86, 8 on x64), then your writes cannot be atomic without a lock.
Will the size of your array ever change in a way that requires reallocation? If your reader is walking through your data, will the data suddenly be "gone" (memory has been "delete"d)? Unless your reader takes this into account (unlikely), you'll need a lock if reallocation is possible.
When you write data to your array, is it ok if the reader "sees" old data?
If your data can be written atomically, your array won't suddenly not be there, and it's ok for the reader to see old data... then you won't need a lock. Even with those conditions being met, it would be appropriate to use the built in atomic functions for reading and storing. But, that's a case where you wouldn't need a lock :)
Probably safest to use a lock since you were unsure enough to ask this question. But, if you want to play around with the edge case of where you don't need a lock... there you go :)

One of the threads adds new elements to the array [...] and the other [reads] this array
In order to add and remove elements to/from an array, you will need an index that specifies the last place of the array where the valid data is stored. Such index is necessary, because arrays cannot be resized without potential reallocation (which is a different story altogether). You may also need a second index to mark the initial location from which the reading is allowed.
If you have an index or two like this, and assuming that you never re-allocate the array, it is not necessary to lock when you write to the array itself, as long as you lock the writes of valid indexes.
int lastValid = 0;
int shared[MAX];
...
int count = toAddCount;
// Add the new data
for (int i = lastValid ; count != 0 ; count--, i++) {
shared[i] = new_data(...);
}
// Lock a mutex before modifying lastValid
// You need to use the same mutex to protect the read of lastValid variable
lock_mutex(lastValid_mutex);
lastValid += toAddCount;
unlock_mutex(lastValid_mutex);
The reason this works is that when you perform writes to shared[] outside the locked region, the reader does not "look" past the lastValid index. Once the writing is complete, you lock the mutex, which normally causes a flush of the CPU cache, so the writes to shared[] would be complete before the reader is allowed to see the data.

Lock? No. But you do need some synchronization mechanism.
What you're describing sounds an awful like a "SPSC" (Single Producer Single Consumer) queue, of which there are tons of lockfree implementations out there including one in the Boost.Lockfree
The general way these work is that underneath the covers you have a circular buffer containing your objects and an index. The writer knows the last index it wrote to, and if it needs to write new data it (1) writes to the next slot, (2) updates the index by setting the index to the previous slot + 1, and then (3) signals the reader. The reader then reads until it hits an index that doesn't contain the index it expects and waits for the next signal. Deletes are implicit since new items in the buffer overwrite previous ones.
You need a way to atomically update the index, which is provided by atomic<> and has direct hardware support. You need a way for a writer to signal the reader. You also might need memory fences depending on the platform s.t. (1-3) occur in order. You don't need anything as heavy as a lock.

"Classical" POSIX would indeed need a lock for such a situation, but this is overkill. You just have to ensure that the reads and writes are atomic. C and C++ have that in the language since their 2011 versions of their standards. Compilers start to implement it, at least the latest versions of Clang and GCC have it.

It depends. One situation where it could be bad is if you are removing an item in one thread then reading the last item by its index in your read thread. That read thread would throw an OOB error.

As far as I know, this is exactly the usecase for a lock. Two threads which access one array concurrently must ensure that one thread is ready with its work.
Thread B might read unfinished data if thread A did not finish work.

If it's a fixed-size array, and you don't need to communicate anything extra like indices written/updated, then you can avoid mutual exclusion with the caveat that the reader may see:
no updates at all
If your memory ordering is relaxed enough that this happens, you need a store fence in the writer and a load fence in the consumer to fix it
partial writes
if the stored type is not atomic on your platform (int generally should be)
or your values are un-aligned, and especially if they may span cache lines
This is all dependent on your platform though - hardware, OS and compiler can all affect it. You haven't told us what they are.
The portable C++11 solution is to use an array of atomic<int>. You still need to decide what memory ordering constraints you require, and what that means for correctness and performance on your platform.

If you use e.g. vector for your array (so that it can dynamically grow), then reallocation may occur during the writes, you lose.
If you use data entries larger than is always written and read atomically (virtually any complex data type), you lose.
If the compiler / optimizer decides to keep certain things in registers (such as the counter holding the number of valid entries in the array) during some operations, you lose.
Or even if the compiler / optimizer decides to switch order of execution for your array element assignments and counter increments/decrements, you lose.
So you certianly do need some sort of synchronization. What is the best way to do so (for example it may be worth while to lock only parts of the array), depends on your specifics (how often and in what pattern do the threads access the array).

C++ Thread Safe Integer

I have currently created a C++ class for a thread safe integer which simply stores an integer privately and has public get a set functions which use a boost::mutex to ensure that only one change at a time can be applied to the integer.
Is this the most efficient way to do it, I have been informed that mutexes are quite resource intensive? The class is used a lot, very rapidly so it could well be a bottleneck...
Googleing C++ Thread Safe Integer returns unclear views and oppinions on the thread safety of integer operations on different architectures.
Some say that a 32bit int on a 32bit arch is safe, but 64 on 32 isn't due to 'alignment' Others say it is compiler/OS specific (which I don't doubt).
I am using Ubuntu 9.10 on 32 bit machines, some have dual cores and so threads may be executed simultaneously on different cores in some cases and I am using GCC 4.4's g++ compiler.
Thanks in advance...
Please Note: The answer I have marked as 'correct' was most suitable for my problem - however there are some excellent points made in the other answers and they are all worth reading!

There is the C++0x atomic library, and there is also a Boost.Atomic library under development that use lock free techniques.

It's not compiler and OS specific, it's architecture specific. The compiler and OS come into it because they're the tools you work through, but they're not the ones setting the real rules. This is why the C++ standard won't touch the issue.
I have never in my life heard of an 64-bit integer write, which can be split into two 32-bit writes, being interrupted halfway through. (Yes, that's an invitation to others to post counterexamples.) Specifically, I have never heard of a CPU's load/store unit allowing a misaligned write to be interrupted; an interrupting source has to wait for the whole misaligned access to complete.
To have an interruptible load/store unit, its state would have to be saved to the stack... and the load/store unit is what saves the rest of the CPU's state to the stack. This would be hugely complicated, and bug prone, if the load/store unit were interruptible... and all that you would gain is one cycle less latency in responding to interrupts, which, at best, is measured in tens of cycles. Totally not worth it.
Back in 1997, A coworker and I wrote a C++ Queue template which was used in a multiprocessing system. (Each processor had its own OS running, and its own local memory, so these queues were only needed for memory shared between processors.) We worked out a way to make the queue change state with a single integer write, and treated this write as an atomic operation. Also, we required that each end of the queue (i.e. the read or write index) be owned by one and only one processor. Thirteen years later, the code is still running fine, and we even have a version that handles multiple readers.
Still, if you want to treat a 64-bit integer write as atomic, align the field to a 64-bit bound. Why worry?
EDIT: For the case you mention in your comment, I'd need more information to be sure, so let me give an example of something that could be implemented without specialized synchronization code.
Suppose you have N writers and one reader. You want the writers to be able to signal events to the reader. The events themselves have no data; you just want an event count, really.
Declare a structure for the shared memory, shared between all writers and the reader:
#include <stdint.h>
struct FlagTable
{ uint32_t flag[NWriters];
};
(Make this a class or template or whatever as you see fit.)
Each writer needs to be told its index and given a pointer to this table:
class Writer
{public:
Writer(FlagTable* flags_, size_t index_): flags(flags_), index(index_) {}
void SignalEvent(uint32_t eventCount = 1);
private:
FlagTable* flags;
size_t index;
}
When the writer wants to signal an event (or several), it updates its flag:
void Writer::SignalEvent(uint32_t eventCount)
{ // Effectively atomic: only one writer modifies this value, and
// the state changes when the incremented value is written out.
flags->flag[index] += eventCount;
}
The reader keeps a local copy of all the flag values it has seen:
class Reader
{public:
Reader(FlagTable* flags_): flags(flags_)
{ for(size_t i = 0; i < NWriters; ++i)
seenFlags[i] = flags->flag[i];
}
bool AnyEvents(void);
uint32_t CountEvents(int writerIndex);
private:
FlagTable* flags;
uint32_t seenFlags[NWriters];
}
To find out if any events have happened, it just looks for changed values:
bool Reader::AnyEvents(void)
{ for(size_t i = 0; i < NWriters; ++i)
if(seenFlags[i] != flags->flag[i])
return true;
return false;
}
If something happened, we can check each source and get the event count:
uint32_t Reader::CountEvents(int writerIndex)
{ // Only read a flag once per function call. If you read it twice,
// it may change between reads and then funny stuff happens.
uint32_t newFlag = flags->flag[i];
// Our local copy, though, we can mess with all we want since there
// is only one reader.
uint32_t oldFlag = seenFlags[i];
// Next line atomically changes Reader state, marking the events as counted.
seenFlags[i] = newFlag;
return newFlag - oldFlag;
}
Now the big gotcha in all this? It's nonblocking, which is to say that you can't make the Reader sleep until a Writer writes something. The Reader has to choose between sitting in a spin-loop waiting for AnyEvents() to return true, which minimizes latency, or it can sleep a bit each time through, which saves CPU but could let a lot of events build up. So it's better than nothing, but it's not the solution to everything.
Using actual synchronization primitives, one would only need to wrap this code with a mutex and condition variable to make it properly blocking: the Reader would sleep until there was something to do. Since you used atomic operations with the flags, you could actually keep the amount of time the mutex is locked to a minimum: the Writer would only need to lock the mutex long enough to send the condition, and not set the flag, and the reader only needs to wait for the condition before calling AnyEvents() (basically, it's like the sleep-loop case above, but with a wait-for-condition instead of a sleep call).

C++ has no real atomic integer implementation, neither do most common libraries.
Consider the fact that even if said implementation would exist, it would have to rely on some sort of mutex - due to the fact that you cannot guarantee atomic operations across all architectures.

As you're using GCC, and depending on what operations you want to perform on the integer, you might get away with GCC's atomic builtins.
These might be a bit faster than mutexes, but in some cases still a lot slower than "normal" operations.

For full, general purpose synchronization, as others have already mentioned, the traditional synchronization tools are pretty much required. However, for certain special cases it is possible to take advantage of hardware optimizations. Specifically, most modern CPUs support atomic increment & decrement on integers. The GLib library has pretty good cross-platform support for this. Essentially, the library wraps CPU & compiler specific assembly code for these operations and defaults to mutex protection where they're not available. It's certainly not very general-purpose but if you're only interested in maintaining a counter, this might be sufficient.

you can also have a look at the atomic ops section of intels Thread Building Blocks or the atomic_ops project

Synchronizing access to variable

I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!

There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.

Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.

All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.

I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.

Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.

Any issues with large numbers of critical sections?

I have a large array of structures, like this:
typedef struct
{
int a;
int b;
int c;
etc...
}
data_type;
data_type data[100000];
I have a bunch of separate threads, each of which will want to make alterations to elements within data[]. I need to make sure that no to threads attempt to access the same data element at the same time. To be precise: one thread performing data[475].a = 3; and another thread performing data[475].b = 7; at the same time is not allowed, but one thread performing data[475].a = 3; while another thread performs data[476].a = 7; is allowed. The program is highly speed critical. My plan is to make a separate critical section for each data element like so:
typedef struct
{
CRITICAL_SECTION critsec;
int a;
int b;
int c;
etc...
}
data_type;
In one way I guess it should all work and I should have no real questions, but not having had much experience in multithreaded programming I am just feeling a little uneasy about having so many critical sections. I'm wondering if the sheer number of them could be creating some sort of inefficiency. I'm also wondering if perhaps some other multithreading technique could be faster? Should I just relax and go ahead with plan A?

With this many objects, most of their critical sections will be unlocked, and there will be almost no contention. As you already know (other comment), critical sections don't require a kernel-mode transition if they're unowned. That makes critical sections efficient for this situation.
The only other consideration would be whether you would want the critical sections inside your objects or in another array. Locality of reference is a good reason to put the critical sections inside the object. When you've entered the critical section, an entire cacheline (e.g. 16 or 32 bytes) will be in memory. With a bit of padding, you can make sure each object starts on a cacheline. As a result, the object will be (partially) in cache once its critical section is entered.

Your plan is worth trying, but I think you will find that Windows is unhappy creating that many Critical Sections. Each CS contains some kernel handle(s) and you are using up precious kernel space. I think, depending on your version of Windows, you will run out of handle memory and InitializeCriticalSection() or some other function will start to fail.
What you might want to do is have a pool of CSs available for use, and store a pointer to the 'in use' CS inside your struct. But then this gets tricky quite quickly and you will need to use Atomic operations to set/clear the CS pointer (to atomically flag the array entry as 'in use'). Might also need some reference counting, etc...
Gets complicated.
So try your way first, and see what happens. We had a similar situation once, and we had to go with a pool, but maybe things have changed since then.

Depending on the data member types in your data_type structure (and also depending on the operations you want to perform on those members), you might be able to forgo using a separate synchronization object, using the Interlocked functions instead.
In your sample code, all the data members are integers, and all the operations are assignments (and presumably reads), so you could use InterlockedExchange() to set the values atomically and InterlockedCompareExchange() to read the values atomically.
If you need to use non-integer data member types, or if you need to perform more complex operations, or if you need to coordinate atomic access to more than one operation at a time (e.g., read data[1].a and then write data[1].b), then you will have to use a synchronization object, such as a CRITICAL_SECTION.
If you must use a synchronization object, I recommend that you consider partitioning your data set into subsets and use a single synchronization object per subset. For example, you might consider using one CRITICAL_SECTION for each span of 1000 elements in the data array.

You could also consider MUTEX.
This is nice method.
Each client could reserve the resource by itself with mutex (mutual-exclusion).
This is more common, some libraries also support this with threads.
Read about boost::thread and it's mutexes
With Your approach:
data_type data[100000];
I'd be afraid of stack overflow, unless You're allocating it at the heap.
EDIT:
Boost::MUTEX
uses win32 Critical Sections

As others have pointed out, yes there is an issue and it is called too fine-grained locking.. it's resource wasteful and even though the chances are small you will start creating a lot of backing primitives and data when the things do get an occasional, call it longer than usual or whatever, contention. Plus you are wasting resources as it is not really a trivial data structure as for example in VM impls..
If I recall correctly you will have a higher chance of a SEH exception from that point onwards on Win32 or just higher memory usage. Partitioning and pooling them is probably the way to go but it is a more complex implementation. Paritioning on something else (re:action) and expecting some short-lived contention is another way to deal with it.
In any case, it is a problem of resource management with what you have right now.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js