Asynchronous File I/O in C++

Asynchronous File I/O in C++ - c++

I can't find information about asynchronous reading and writing in C++. So I write code, function read() works correctly, but synchronization doesn't. Sync() function doesn't wait for the end of reading.
For my opinion variable state_read in thread has incorrect value. Please, understand me why.
struct IOParams{
char* buf;
unsigned int nBytesForRead;
FILE* fp;
};
struct AsyncFile {
FILE* fp;
bool state_read;
HANDLE hThreadRead;
IOParams read_params;
void AsyncFile::read(char* buf, unsigned int nBytesForRead){
sync();
read_params.buf = buf;
read_params.fp = fp;
read_params.nBytesForRead = nBytesForRead;
hThreadRead = CreateThread(0,0,ThreadFileRead,this,0);
}
void AsyncFile::sync() {
if (state_read) {
WaitForSingleObject(hThreadRead,INFINITE);
CloseHandle(hThreadRead);
}
state_read = false;
}
};
DWORD WINAPI ThreadFileRead(void* lpParameter) {
AsyncFile* asf = (AsyncFile*)lpParameter;
asf->setReadState(true);
IOParams & read_params = *asf->getReadParams();
fread(read_params.buf, 1, read_params.nBytesForRead, read_params.fp);
asf->setReadState(false);
return 0;
}
Maybe you know how to write the asynchronous reading in more reasonable way.

Maybe you know how to write the asynchronous reading in more reasonable way.
Since your question is tagged "Windows", you might look into FILE_FLAG_OVERLAPPED and ReadFileEx, which do asynchronous reading without extra threads (synchronisation via an event, a callback, or a completion port).
If you insist on using a separate loader thread (there may be valid reasons for that, though few), you do not want to read and write a flag repeatedly from two threads and use that for synchronisation. Although your code looks correct, the mere fact that does not work as intended shows that it's a bad idea.
Always use a proper synchronisation primitive (event or semaphore) for synchronisation, do not tamper with some flag that's (possibly inconsistently) written and read from different threads.
Alternatively, if you don't want an extra event object, you could always wait on the thread to die, unconditionally (but, read the next paragraph).
Generally, spawning a thread and letting it die for every read is not a good design. Not only is spawning a thread considerable overhead (both for CPU and memory), it can also introduce hard to predict "funny effects" and turn out to be a total anti-optimization. Imagine for example having 50 threads thrashing the harddrive on seeks, all of them trying to get a bit of it. This will be asynchronous for sure, but it will be a hundred times slower, too.
Using a small pool of workers (emphasis on small) will probably be a much superior design, if you do not want to use the operating system's native asynchronous mechanisms.

Related

Efficient way to have a thread wait for a value to change in memory?

For some silly reason, there's a piece of hardware on my (GNU/Linux) machine that can only communicate a certain occurrence by writing a value to memory. Assume that by some magic, the area of memory the hardware writes to is visible to a process I'm running. Now, I want to have a thread within that process keep track of that value, and as soon as possible after it has changed - execute some code. However, it is more important to me that the thread not waste CPU time than for it to absolutely minimize the response delay. So - no busy-waiting on a volatile...
How should I best do this (using modern C++)?
Notes:
I don't mind a solution involving atomics, or synchronization mechanisms (in fact, that would perhaps be preferable) - as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
The value the hardware writes can be whatever I like, as can the initial value in the memory location it writes to.
I used C++11 since it's the popular tag for Modern C++, but really, C++14 is great and C++17 is ok. On the other hand, even a C-based solution will do.

So, the naive thing to do would be non-busy sleeping, e.g.:
volatile int32_t* special_location = get_special_location();
auto polling_interval_in_usec = perform_tradeoff_between_accuracy_and_cpu_load();
auto polling_interval = std::chrono::microseconds(polling_interval_in_usec);
while(should_continue_polling()) {
if (*special_location == HardwareIsDone) {
do_stuff();
return;
}
std::this_thread::sleep_for(polling_interval);
}

This is usually done via std::condition_variable.
... as long as you bear in mind that the hardware doesn't support atomic operations on host memory - it performs a plain write.
Implementations of std::atomic may fall back to mutexes in such cases
UPD - Possible implementation details: assuming you have some data structure in a form of:
struct MyData {
std::mutex mutex;
std::condition_variable cv;
some_user_type value;
};
and you have an access to it from several processes. Writer process overrides value and notifies cv via notify_one, reader process waits on cv in a somewhat similar to busy wait manner, but thread yields for the wait duration. Everything else I could add is already present in the referred examples.

Kill std::thread while reading a big file

I have a std::thread function that is calling fopen to load a big file into an array:
void loadfile(char *fname, char *fbuffer, long fsize)
{
FILE *fp = fopen(fname, "rb");
fread(fbuffer, 1, fsize, fp);
flose(fp);
}
This is called by:
std::thread loader(loadfile, fname, fbuffer, fsize);
loader.detach();
At some point, something in my program wants to stop reading that file and asks for another file. The problem is that by the time I delete the fbuffer pointer, the loader thread is still going, and I get a race condition that trows an exception.
How can I kill that thread? My idea was to check for the existance of the fbuffer and maybe split the fread in small chunks:
void loadfile(char *fname, char *fbuffer, long fsize)
{
FILE *fp = fopen(fname, "rb");
long ch = 0;
while (ch += 256 < fsize)
{
if (fbuffer == NULL) return;
fread(fbuffer + ch, 1, 256, fp);
}
fclose(fp);
}
Will this slow down the reading of the file? Do you have a better idea?

You should avoid killing a thread at all costs. Doing so causes evil things to happen, like resources left in a permanently locked state.
The thread must be given a reference to a flag, the value of which can be set from elsewhere, to tell the thread to voluntarily quit.
You cannot use a buffer for this purpose; if one thread deletes the memory of the buffer while the other is writing to it, very evil things will happen. (Memory corruption.) So, pass a reference to a boolean flag.
Of course, in order for the thread to be able to periodically check the flag, it must have small chunks of work to do, so splitting your freads to small chunks was a good idea.
256 bytes might be a bit too small though; definitely use 4k or more, perhaps even 64k.

Killing threads is usually not the way to go - doing this may lead to leaked resources, critical sections you cannot exit and inconsistent program state.
Your idea is almost spot-on, but you need a way to signal the thread to finalize. You can use a boolean shared between your thread and the rest of your code that your thread reads after every read, and once it is set, stops reading into the buffer cleans up the file handles and exits cleanly.
On another note, handling the deletion of pointers with owning semantics by yourself is most of the time frowned upon in modern C++ - unless you have a very good reason not to, I'd recommend using the stl fstream and string classes.

You need proper thread synchronization. The comments about resource leaks and the proposal by #Mike Nakis about making the thread exit voluntarily by setting a boolean are almost correct (well, they're correct, but not complete). You need to go even farther than that.
You must ensure not only that the loader thread exits on its own, you must also ensure that it has exited before you delete the buffer it is writing to. Or, at least, you must ensure that it isn't ever touching that buffer in any way after you deleted it. Checking the pointer for null-ness does not work for two reasons. First, it doesn't work anyway, since you are looking at a copy of the original pointer (you would have to use a pointer-pointer or a reference). Second, and more importantly, even if the check worked, there is a race condition between the if statement and fread. In other words, there is no way to guarantee that you aren't freeing the buffer while fread is accessing it (no matter how small you make your chunks).
At the very minimum, you neeed two boolean flags, but preferrably you would use a proper synchronization primitive such as a condition variable to notify the main thread (so you don't have to spin waiting for the loader to exit, but can block).
The correct way of operation would be:
Notify loader thread
Wait for loader thread to signal me (block on cond var)
Loader thread picks up notification, sets condition variable and never touches the buffer afterwards, then exits
Resume (delete buffer, allocate new buffer, etc)
If you do not insist on detaching the loader thread, you could instead simply join it after telling it to exit (so you would not need a cond var).

How to make a multiple-read/single-write lock from more basic synchronization primitives?

We have found that we have several spots in our code where concurrent reads of data protected by a mutex are rather common, while writes are rare. Our measurements seem to say that using a simple mutex seriously hinders the performance of the code reading that data. So what we would need is a multiple-read/single-write mutex. I know that this can be built atop of simpler primitives, but before I try myself at this, I'd rather ask for existing knowledge:
What is an approved way to build a multiple-read/single-write lock out of simpler synchronization primitives?
I do have an idea how to make it, but I'd rather have answers unbiased by what I (probably wrongly) came up with. (Note: What I expect is an explanation how to do it, probably in pseudo code, not a full-fledged implementation. I can certainly write the code myself.)
Caveats:
This needs to have reasonable performance. (What I have in mind would require two lock/unlock operations per access. Now that might not be good enough, but needing many of them instead seems unreasonable.)
Commonly, reads are more numerous, but writes are more important and performance-sensitive than reads. Readers must not starve writers.
We are stuck on a rather old embedded platform (proprietary variant of VxWorks 5.5), with a rather old compiler (GCC 4.1.2), and boost 1.52 – except for most of boost's parts relying on POSIX, because POSIX isn't fully implemented on that platform. The locking primitives available basically are several kind of semaphores (binary, counting etc.), on top of which we have already created mutexes, conditions variables, and monitors.
This is IA32, single-core.

At first glance I thought I recognized this answer as the same algorithm that Alexander Terekhov introduced. But after studying it I believe that it is flawed. It is possible for two writers to simultaneously wait on m_exclusive_cond. When one of those writers wakes and obtains the exclusive lock, it will set exclusive_waiting_blocked = false on unlock, thus setting the mutex into an inconsistent state. After that, the mutex is likely hosed.
N2406, which first proposed std::shared_mutex contains a partial implementation, which is repeated below with updated syntax.
class shared_mutex
{
mutex mut_;
condition_variable gate1_;
condition_variable gate2_;
unsigned state_;
static const unsigned write_entered_ = 1U << (sizeof(unsigned)*CHAR_BIT - 1);
static const unsigned n_readers_ = ~write_entered_;
public:
shared_mutex() : state_(0) {}
// Exclusive ownership
void lock();
bool try_lock();
void unlock();
// Shared ownership
void lock_shared();
bool try_lock_shared();
void unlock_shared();
};
// Exclusive ownership
void
shared_mutex::lock()
{
unique_lock<mutex> lk(mut_);
while (state_ & write_entered_)
gate1_.wait(lk);
state_ |= write_entered_;
while (state_ & n_readers_)
gate2_.wait(lk);
}
bool
shared_mutex::try_lock()
{
unique_lock<mutex> lk(mut_, try_to_lock);
if (lk.owns_lock() && state_ == 0)
{
state_ = write_entered_;
return true;
}
return false;
}
void
shared_mutex::unlock()
{
{
lock_guard<mutex> _(mut_);
state_ = 0;
}
gate1_.notify_all();
}
// Shared ownership
void
shared_mutex::lock_shared()
{
unique_lock<mutex> lk(mut_);
while ((state_ & write_entered_) || (state_ & n_readers_) == n_readers_)
gate1_.wait(lk);
unsigned num_readers = (state_ & n_readers_) + 1;
state_ &= ~n_readers_;
state_ |= num_readers;
}
bool
shared_mutex::try_lock_shared()
{
unique_lock<mutex> lk(mut_, try_to_lock);
unsigned num_readers = state_ & n_readers_;
if (lk.owns_lock() && !(state_ & write_entered_) && num_readers != n_readers_)
{
++num_readers;
state_ &= ~n_readers_;
state_ |= num_readers;
return true;
}
return false;
}
void
shared_mutex::unlock_shared()
{
lock_guard<mutex> _(mut_);
unsigned num_readers = (state_ & n_readers_) - 1;
state_ &= ~n_readers_;
state_ |= num_readers;
if (state_ & write_entered_)
{
if (num_readers == 0)
gate2_.notify_one();
}
else
{
if (num_readers == n_readers_ - 1)
gate1_.notify_one();
}
}
The algorithm is derived from an old newsgroup posting of Alexander Terekhov. It starves neither readers nor writers.
There are two "gates", gate1_ and gate2_. Readers and writers have to pass gate1_, and can get blocked in trying to do so. Once a reader gets past gate1_, it has read-locked the mutex. Readers can get past gate1_ as long as there are not a maximum number of readers with ownership, and as long as a writer has not gotten past gate1_.
Only one writer at a time can get past gate1_. And a writer can get past gate1_ even if readers have ownership. But once past gate1_, a writer still does not have ownership. It must first get past gate2_. A writer can not get past gate2_ until all readers with ownership have relinquished it. Recall that new readers can't get past gate1_ while a writer is waiting at gate2_. And neither can a new writer get past gate1_ while a writer is waiting at gate2_.
The characteristic that both readers and writers are blocked at gate1_ with (nearly) identical requirements imposed to get past it, is what makes this algorithm fair to both readers and writers, starving neither.
The mutex "state" is intentionally kept in a single word so as to suggest that the partial use of atomics (as an optimization) for certain state changes is a possibility (i.e. for an uncontended "fast path"). However that optimization is not demonstrated here. One example would be if a writer thread could atomically change state_ from 0 to write_entered then he obtains the lock without having to block or even lock/unlock mut_. And unlock() could be implemented with an atomic store. Etc. These optimizations are not shown herein because they are much harder to implement correctly than this simple description makes it sound.

It seems like you only have mutex and condition_variable as synchronization primitives. therefore, I write a reader-writer lock here, which starves readers. it uses one mutex, two conditional_variable and three integer.
readers - readers in the cv readerQ plus the reading reader
writers - writers in cv writerQ plus the writing writer
active_writers - the writer currently writing. can only be 1 or 0.
It starve readers in this way. If there are several writers want to write, readers will never get the chance to read until all writers finish writing. This is because later readers need to check writers variable. At the same time, the active_writers variable will guarantee that only one writer could write at a time.
class RWLock {
public:
RWLock()
: shared()
, readerQ(), writerQ()
, active_readers(0), waiting_writers(0), active_writers(0)
{}
void ReadLock() {
std::unique_lock<std::mutex> lk(shared);
while( waiting_writers != 0 )
readerQ.wait(lk);
++active_readers;
lk.unlock();
}
void ReadUnlock() {
std::unique_lock<std::mutex> lk(shared);
--active_readers;
lk.unlock();
writerQ.notify_one();
}
void WriteLock() {
std::unique_lock<std::mutex> lk(shared);
++waiting_writers;
while( active_readers != 0 || active_writers != 0 )
writerQ.wait(lk);
++active_writers;
lk.unlock();
}
void WriteUnlock() {
std::unique_lock<std::mutex> lk(shared);
--waiting_writers;
--active_writers;
if(waiting_writers > 0)
writerQ.notify_one();
else
readerQ.notify_all();
lk.unlock();
}
private:
std::mutex shared;
std::condition_variable readerQ;
std::condition_variable writerQ;
int active_readers;
int waiting_writers;
int active_writers;
};

Concurrent reads of data protected by a mutex are rather common, while writes are rare
That sounds like an ideal scenario for User-space RCU:
URCU is similar to its Linux-kernel counterpart, providing a replacement for reader-writer locking, among other uses. This similarity continues with readers not synchronizing directly with RCU updaters, thus making RCU read-side code paths exceedingly fast, while furthermore permitting RCU readers to make useful forward progress even when running concurrently with RCU updaters—and vice versa.

There's some good tricks you can do to help.
First, good performance. VxWorks is notable for its very good context switch times. Whatever the locking solution you use it will likely involve semaphores. I wouldn't be afraid of using semaphores (plural) for this, they're pretty well optimsed in VxWorks, and the fast context switch times help mimimise the degradation in performance from assessing many semaphore states, etc.
Also I would forget using POSIX semaphores, which are simply going to be layered on top of VxWork's own semaphores. VxWorks provices native counting, binary and mutex semaphores; using the one that suits makes it all a bit faster. The binary ones can be quite useful sometimes; posted to many times, never exceed the value of 1.
Second, writes being more important than reads. When I've had this kind of requirement in VxWorks and have been using a semaphore(s) to control access, I've used task priority to indicate which task is more important and should get first access to the resource. This works quite well; literally everything in VxWorks is a task (well, thread) like any other, including all the device drivers, etc.
VxWorks also resolves priority inversions (the kind of thing that Linus Torvalds hates). So if you implement your locking with a semaphore(s), you can rely on the OS scheduler to chivvy up lower priority readers if they're blocking a higher priority writer. It can lead to much simpler code, and you're getting the most of the OS too.
So a potential solution is to have a single VxWorks counting semaphore protecting the resource, initialised to a value equal to the number of readers. Each time a reader wants to read, it takes the semaphore (reducing the count by 1. Each time a read is done it posts the semaphore, increasing the count by 1. Each time the writer wants to write it takes the semaphore n (n = number of readers) times, and posts it n times when done. Finally make the writer task of higher priority than any of the readers, and rely on the OS fast context switch time and priority inversion.
Remember that you're programming on a hard-realtime OS, not Linux. Taking / posting a native VxWorks semaphore doesn't involve the same amount of runtime as a similar act on Linux, though even Linux is pretty good these days (I'm using PREEMPT_RT nowadays). The VxWorks scheduler and all the device drivers can be relied upon to behave. You can even make your writer task the highest priority in the whole system if you wish, higher even than all the device drivers!
To help things along, also consider what it is that each of your threads are doing. VxWorks allows you to indicate that a task is/isn't using the FPU. If you're using native VxWorks TaskSpawn routines instead of pthread_create then you get an opportunity to specify this. What it means is that if your thread/task isn't doing any floating point maths, and you've said as such in your call to TaskSpawn, the context switch times will be even faster because the scheduler won't bother to preserve / restore the FPU state.
This stands a reasonable chance of being the best solution on the platform you're developing on. It's playing to the OS's strengths (fast semaphores, fast context switch times) without introducing a load of extra code to recreate an alternate (and possibly more elegant) solution commonly found on other platforms.
Third, stuck with old GCC and old Boost. Basically I can't help there other than low value suggestions about phoning up WindRiver and discussing buying an upgrade. Personally speaking when I've been programming for VxWorks I've used VxWork's native API rather than POSIX. Ok, so the code hasn't be very portable, but it has ended up being fast; POSIX is merely layer on top of the native API anyway and that will always slow things down.
That said, POSIX counting and mutex semaphores are very similar to VxWork's native counting and mutex semaphores. That probably means that the POSIX layering isn't very thick.
General Notes About Programming for VxWorks
Debugging I always sought to use the development tools (Tornado) available for Solaris. This is by far the best multi-threaded debugging environment I've ever come across. Why? It allows you to start up multiple debug sessions, one for each thread/task in the system. You end up with a debug window per thread, and you are individually and independently debugging each one. Step over a blocking operation, that debug window gets blocked. Move mouse focus to another debugging window, step over the operation that will release the block and watch the first window complete its step.
You end up with a lot of debug windows, but it's by far the best way to debug multi-threaded stuff. It made it veeeeery easy to write really quite complex stuff and see problems. You can easily explore the different dynamic interactions in your application because you had simple and all powerful control over what each thread is doing at any time.
Ironically the Windows version of Tornado didn't let you do this; one miserable single debug windows per system, just like any other boring old IDE such as Visual Studio, etc. I've never seen even modern IDEs come anywhere close to being as good as Tornado on Solaris for multi-threaded debugging.
HardDrives If your readers and writers are using files on disk, consider that VxWorks 5.5 is pretty old. Things like NCQ aren't going to be supported. In this case my proposed solution (outlined above) might be better done with a single mutex semaphore to stop multiple readers tripping over each other in their struggle to read different parts of the disk. It depends on what exactly your readers are doing, but if they're reading contiguous data from a file this would avoid thrashing the read/write head to and fro across the disk surface (very slow).
In my case I was using this trick to shape traffic across a network interface; each task was sending a different sort of data, and the task priority reflected the priority of the data on the network. It was very elegant, no message was ever fragmented, but the important messages got the lions share of the available bandwidth.

As always the best solution will depend on details. A read-write spin lock may be what you're looking for, but other approaches such as read-copy-update as suggested above might be a solution - though on an old embedded platform the extra memory used might be an issue. With rare writes I often arrange the work using a tasking system such that the writes can only occur when there are no reads from that data structure, but this is algorithm dependent.

One algorithm for this based on semaphores and mutexes is described in Concurrent Control with Readers and Writers; P.J. Courtois, F. Heymans, and D.L. Parnas; MBLE Research Laboratory; Brussels, Belgium.

This is a simplified answer based on my Boost headers (I would call Boost an approved way). It only requires Condition Variables and Mutexes. I rewrote it using Windows primitives because I find them descriptive and very simple, but view this as Pseudocode.
This is a very simple solution, which does not support things like mutex upgrading, or try_lock() operations. I can add those if you want. I also took out some frills like disabling interrupts that aren't strictly necessary.
Also, it's worth checking out boost\thread\pthread\shared_mutex.hpp (this being based on that). It's human-readable.
class SharedMutex {
CRITICAL_SECTION m_state_mutex;
CONDITION_VARIABLE m_shared_cond;
CONDITION_VARIABLE m_exclusive_cond;
size_t shared_count;
bool exclusive;
// This causes write blocks to prevent further read blocks
bool exclusive_wait_blocked;
SharedMutex() : shared_count(0), exclusive(false)
{
InitializeConditionVariable (m_shared_cond);
InitializeConditionVariable (m_exclusive_cond);
InitializeCriticalSection (m_state_mutex);
}
~SharedMutex()
{
DeleteCriticalSection (&m_state_mutex);
DeleteConditionVariable (&m_exclusive_cond);
DeleteConditionVariable (&m_shared_cond);
}
// Write lock
void lock(void)
{
EnterCriticalSection (&m_state_mutex);
while (shared_count > 0 || exclusive)
{
exclusive_waiting_blocked = true;
SleepConditionVariableCS (&m_exclusive_cond, &m_state_mutex, INFINITE)
}
// This thread now 'owns' the mutex
exclusive = true;
LeaveCriticalSection (&m_state_mutex);
}
void unlock(void)
{
EnterCriticalSection (&m_state_mutex);
exclusive = false;
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
// Read lock
void lock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
while (exclusive || exclusive_waiting_blocked)
{
SleepConditionVariableCS (&m_shared_cond, m_state_mutex, INFINITE);
}
++shared_count;
LeaveCriticalSection (&m_state_mutex);
}
void unlock_shared(void)
{
EnterCriticalSection (&m_state_mutex);
--shared_count;
if (shared_count == 0)
{
exclusive_waiting_blocked = false;
LeaveCriticalSection (&m_state_mutex);
WakeConditionVariable (&m_exclusive_cond);
WakeAllConditionVariable (&m_shared_cond);
}
else
{
LeaveCriticalSection (&m_state_mutex);
}
}
};
Behavior
Okay, there is some confusion about the behavior of this algorithm, so here is how it works.
During a Write Lock - Both readers and writers are blocked.
At the end of a Write Lock - Reader threads and one writer thread will race to see which one starts.
During a Read Lock - Writers are blocked. Readers are also blocked if and only if a Writer is blocked.
At the release of the final Read Lock - Reader threads and one writer thread will race to see which one starts.
This could cause readers to starve writers if the processor frequently context switches over to a m_shared_cond thread before an m_exclusive_cond during notification, but I suspect that issue is theoretical and not practical since it's Boost's algorithm.

Now that Microsoft has opened up the .NET source code, you can look at their ReaderWRiterLockSlim implementation.
I'm not sure the more basic primitives they use are available to you, some of them are also part of the .NET library and their code is also available.
Microsoft has spent quite a lot of time on improving the performance of their locking mechanisms, so this can be a good starting point.

Lockless reader/writer

I have some data that is both read and updated by multiple threads. Both reads and writes must be atomic. I was thinking of doing it like this:
// Values must be read and updated atomically
struct SValues
{
double a;
double b;
double c;
double d;
};
class Test
{
public:
Test()
{
m_pValues = &m_values;
}
SValues* LockAndGet()
{
// Spin forver until we got ownership of the pointer
while (true)
{
SValues* pValues = (SValues*)::InterlockedExchange((long*)m_pValues, 0xffffffff);
if (pValues != (SValues*)0xffffffff)
{
return pValues;
}
}
}
void Unlock(SValues* pValues)
{
// Return the pointer so other threads can lock it
::InterlockedExchange((long*)m_pValues, (long)pValues);
}
private:
SValues* m_pValues;
SValues m_values;
};
void TestFunc()
{
Test test;
SValues* pValues = test.LockAndGet();
// Update or read values
test.Unlock(pValues);
}
The data is protected by stealing the pointer to it for every read and write, which should make it threadsafe, but it requires two interlocked instructions for every access. There will be plenty of both reads and writes and I cannot tell in advance if there will be more reads or more writes.
Can it be done more effective than this? This also locks when reading, but since it's quite possible to have more writes then reads there is no point in optimizing for reading, unless it does not inflict a penalty on writing.
I was thinking of reads acquiring the pointer without an interlocked instruction (along with a sequence number), copying the data, and then having a way of telling if the sequence number had changed, in which case it should retry. This would require some memory barriers, though, and I don't know whether or not it could improve the speed.
----- EDIT -----
Thanks all, great comments! I haven't actually run this code, but I will try to compare the current method with a critical section later today (if I get the time). I'm still looking for an optimal solution, so I will get back to the more advanced comments later. Thanks again!

What you have written is essentially a spinlock. If you're going to do that, then you might as well just use a mutex, such as boost::mutex. If you really want a spinlock, use a system-provided one, or one from a library rather than writing your own.
Other possibilities include doing some form of copy-on-write. Store the data structure by pointer, and just read the pointer (atomically) on the read side. On the write side then create a new instance (copying the old data as necessary) and atomically swap the pointer. If the write does need the old value and there is more than one writer then you will either need to do a compare-exchange loop to ensure that the value hasn't changed since you read it (beware ABA issues), or a mutex for the writers. If you do this then you need to be careful how you manage memory --- you need some way to reclaim instances of the data when no threads are referencing it (but not before).

There are several ways to resolve this, specifically without mutexes or locking mechanisms. The problem is that I'm not sure what the constraints on your system is.
Remember that atomic operations is something that often get moved around by the compilers in C++.
Generally I would solve the issue like this:
Multiple-producer-single-consumer by having 1 single-producer-single-consumer per writing thread. Each thread writes into their own queue. A single consumer thread that gathers the produced data and stores it in a single-consumer-multiple-reader data storage. The implementation for this is a lot of work and only recommended if you are doing a time-critical application and that you have the time to put in for this solution.
There are more things to read up about this, since the implementation is platform specific:
Atomic etc operations on windows/xbox360:
http://msdn.microsoft.com/en-us/library/ee418650(VS.85).aspx
The multithreaded single-producer-single-consumer without locks:
http://www.codeproject.com/KB/threads/LockFree.aspx#heading0005
What "volatile" really is and can be used for:
http://www.drdobbs.com/cpp/212701484
Herb Sutter has written a good article that reminds you of the dangers of writing this kind of code:
http://www.drdobbs.com/cpp/210600279;jsessionid=ZSUN3G3VXJM0BQE1GHRSKHWATMY32JVN?pgno=2

About write buffer in general network programming

I'm writing server using boost.asio. I have read and write buffer for each connection and use asynchronized read/write function (async_write_some / async_read_some).
With read buffer and async_read_some, there's no problem. Just invoking async_read_some function is okay because read buffer is read only in read handler (means in same thread usually).
But, write buffer need to be accessed from several threads so it need to be locked for modifying.
FIRST QUESTION!
Are there any way to avoid LOCK for write buffer?
I write my own packet into stack buffer and copy it to the write buffer. Then, call async_write_some function to send the packet. In this way, if I send two packet in serial, is it okay invoking async_write_some function two times?
SECOND QUESTION!
What is common way for asynchronized writing in socket programming?
Thanks for reading.

Sorry but you have two choices:
Serialise the write statement, either with locks, or better
start a separate writer thread which reads requests from
a queue, other threads can then stack up requests on the
queue without too much contention (some mutexing would be required).
Give each writing thread its own socket!
This is actually the better solution if the program at the other end
of the wire can support it.

Answer #1:
You are correct that locking is a viable approach, but there is a much simpler way to do all of this. Boost has a nice little construct in ASIO called a strand. Any callback that has been wrapped using the strand will be serialized, guaranteed, no matter which thread executes the callback. Basically, it handles any locking for you.
This means that you can have as many writers as you want, and if they are all wrapped by the same strand (so, share your single strand among all of your writers) they will execute serially. One thing to watch out for is to make sure that you aren't trying to use the same actual buffer in memory for doing all of the writes. For example, this is what to avoid:
char buffer_to_write[256]; // shared among threads
/* ... in thread 1 ... */
memcpy(buffer_to_write, packet_1, std::min(sizeof(packet_1), sizeof(buffer_to_write)));
my_socket.async_write_some(boost::asio::buffer(buffer_to_write, sizeof(buffer_to_write)), &my_callback);
/* ... in thread 2 ... */
memcpy(buffer_to_write, packet_2, std::min(sizeof(packet_2), sizeof(buffer_to_write)));
my_socket.async_write_some(boost::asio::buffer(buffer_to_write, sizeof(buffer_to_write)), &my_callback);
There, you're sharing your actual write buffer (buffer_to_write). If you did something like this instead, you'll be okay:
/* A utility class that you can use */
class PacketWriter
{
private:
typedef std::vector<char> buffer_type;
static void WriteIsComplete(boost::shared_ptr<buffer_type> op_buffer, const boost::system::error_code& error, std::size_t bytes_transferred)
{
// Handle your write completion here
}
public:
template<class IO>
static bool WritePacket(const std::vector<char>& packet_data, IO& asio_object)
{
boost::shared_ptr<buffer_type> op_buffer(new buffer_type(packet_data));
if (!op_buffer)
{
return (false);
}
asio_object.async_write_some(boost::asio::buffer(*op_buffer), boost::bind(&PacketWriter::WriteIsComplete, op_buffer, boost::asio::placeholder::error, boost::asio::placeholder::bytes_transferred));
}
};
/* ... in thread 1 ... */
PacketWriter::WritePacket(packet_1, my_socket);
/* ... in thread 2 ... */
PacketWriter::WritePacket(packet_2, my_socket);
Here, it would help if you passed your strand into WritePacket as well. You get the idea, though.
Answer #2:
I think you are already taking a very good approach. One suggestion I would offer is to use async_write instead of async_write_some so that you are guaranteed the whole buffer is written before your callback gets called.

You could queue your modifications and perform them on the data in the write handler.
Network would most probably be the slowest part of the pipe (assuming your modification are not
computationaly expensive), so that you could perform mods while the socket layer is sending the
previous data.
Incase you are handling large number of clients with frequent connect/disconnect take a look at
IO completion ports or similar mechanism.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js