CUDA default streams and CUDA_API_PER_THREAD_DEFAULT_STREAM - c++

The documentation here tries to explain how default streams are handled.
Given code like this (ignoring allocation errors):
char *ptr;
char source[1000000];
cudaMalloc((void**)&ptr, 1000000);
cudaMemcpyAsync(ptr, source, 1000000, cudaMemcpyHostToDevice);
myKernel<<<1000, 1000>>>(ptr);
Is there a risk that myKernel will start before cudaMemcpyAsync finishes copying? I think "No" because this is a "Legacy default stream" as described in the documentation.
However, if I compile with CUDA_API_PER_THREAD_DEFAULT_STREAM what happens? The text for "Per-thread default stream" says:
The per-thread default stream is an implicit stream local to both the thread and the CUcontext, and which does not synchronize with other streams (just like explcitly created streams). The per-thread default stream is not a non-blocking stream and will synchronize with the legacy default stream if both are used in a program.
I think this might also be OK as both cudaMemcpyAsync and myKernel are effectively using CU_STREAM_PER_THREAD; am I correct?
The reason I ask is that I have a really weird intermittent CUDA error 77 in a kernel that I can only explain by a cudaMemcpyAsync not finishing before calling myKernel, which would mean that I am not understanding the documentation. The real code is too involved and too proprietary to make an MCVE, though.

Is there a risk that myKernel will start before cudaMemcpyAsync
finishes copying? I think "No" because this is a "Legacy default
stream" as described in the documentation.
No that can't happen because, as you note, the legacy default stream (stream 0) is blocking under all circumstances.
However, if I compile with CUDA_API_PER_THREAD_DEFAULT_STREAM what happens?
Almost nothing changes. The per-thread default stream isn't blocking, so other streams and other threads using their default streams could operate concurrently within the context. Both operations are, however, still in the same stream and are sequential with respect to one another. The only way overlap could occur between the two operations would be if source was a non-pageable memory allocation which permitted overlap between the transfer and the kernel execution. Otherwise, they will run sequentially because of the ordering property of the stream and the restrictions imposed by the host source memory.
If you are having a real problem with suspected unexpected overlap of operations, you should be able to confirm this by profiling.

Related

Relying on network I/O to provide cross-thread synchronization in C++

Can external I/O be relied upon as a form of cross-thread synchronization?
To be specific, consider the pseudocode below, which assumes the existence of network/socket functions:
int a; // Globally accessible data.
socket s1, s2; // Platform-specific.
int main() {
// Set up + connect two sockets to (the same) remote machine.
s1 = ...;
s2 = ...;
std::thread t1{thread1}, t2{thread2};
t1.join();
t2.join();
}
void thread1() {
a = 42;
send(s1, "foo");
}
void thread2() {
recv(s2); // Blocking receive (error handling omitted).
f(a); // Use a, should be 42.
}
We assume that the remote machine only sends data to s2 upon receiving the "foo" from s1. If this assumption fails, then certainly undefined behavior will result. But if it holds (and no other external failure occurs like network data corruption, etc.), does this program produce defined behavior?
"Never", "unspecified (depends on implementation)", "depends on the guarantees provided by the implementation of send/recv" are example answers of the sort I'm expecting, preferably with justification from the C++ standard (or other relevant standards, such as POSIX for sockets/networking).
If "never", then changing a to be a std::atomic<int> initialized to a definite value (say 0) would avoid undefined behaviour, but then is the value guaranteed to be read as 42 in thread2 or could a stale value be read? Do POSIX sockets provide a further guarantee that ensures a stale value will not be read?
If "depends", do POSIX sockets provide the relevant guarantee to make it defined behavior? (How about if s1 and s2 were the same socket instead of two separate sockets?)
For reference, the standard I/O library has a clause which seems to provide an analogous guarantee when working with iostreams (27.2.3¶2 in N4604):
If one thread makes a library call a that writes a value to a stream and, as a result, another thread reads this value from the stream through a library call b such that this does not result in a data race, then a’s write synchronizes with b’s read.
So is it a matter of the underlying network library/functions being used providing a similar guarantee?
In practical terms, it seems the compiler can't reorder accesses to the global a with respect to the send andrecv functions (as they could use a in principle). However, the thread running thread2 could still read a stale value of a unless there was some kind of memory barrier / synchronization guarantee provided by the send/recv pair itself.
Short answer: No, there is no generic guarantee that a will be updated. My suggestion would be to send the value of a along with "foo" - e.g. "foo, 42", or something like it. That is guaranteed to work, and probably not that significant overhead. [There may of course be other reasons why that doesn't work well]
Long rambling stuff that doesn't really answer the problem:
Global data is not guaranteed to be "visible" immediately in different cores of multicore processors without further operations. Yes, most modern processors are "coherent", but not all models of all brands are guaranteed to do so. So if thread2 runs on a processor that has already cached a copy of a, it can not be guaranteed that the value of a is 42 at the point when you call f.
The C++ standard guarantees that global variables are loaded after the function call, so the compiler is not allowed to do:
tmp = a;
recv(...);
f(tmp);
but as I said above, cache-operations may be needed to guarantee that all processors see the same value at the same time. If send and recv are long in time or big in accesses enough [there is no direct measure that says how long or big] you may see the correct value most or even all of the time, but there is no guarantee for ordinary types that they are ACTUALLY updated outside of the thread that wrote the value last.
std::atomic will help on some types of processors, but there is no guarantee that this is "visible" in a second thread or on a second processor core at any reasonable time after it was changed.
The only practical solution is to have some kind of "repeat until I see it change" type code - this may require one value that is (for example) a counter, and one value that is the actual value - if you want to be able to say that "a is now 42. I've set a again, it's 42 this time too". If a is reppresenting, for example the number of data items available in a buffer, it is probably "it changed value" that matters, and just checking "is this the same as last time". The std::atomic operations have guarantees with regard to ordering, which allows you to use them to ensure that "if I update this field, the other field is guaranteed to appear at the same time or before this". So you can use that to guarantee for example a pair of data items are set to the "there is a new value" (for example a counter to indicate the "version number" of the current data) and "the new value is X".
Of course, if you KNOW what processor architectures your code will run on, you can plausibly make more advanced guesses as to what the behaviour will be. For example all x86 and many ARM processors use the cache-interface to implement atomic updates on a variable, so by doing an atomic update on one core, you can know that "no other processor will have a stale value of this". But there are processors available that do not have this implementation detail, and where an update, even with an atomic instruction, will not be updated on other cores or in other threads until "some time in the future, uncertain when".
In general, no, external I/O can't be relied upon for cross-thread synchronization.
The question is out-of-scope of the C++ standard itself, as it involves the behavior of external/OS library functions. So whether the program is undefined behavior depends on any synchronization guarantees provided by the network I/O functions. In the absence of such guarantees, it is indeed undefined behavior. Switching to (initialized) atomics to avoid undefined behavior still wouldn't guarantee the "correct" up-to-date value will be read. To ensure that within the realms of the C++ standard would require some kind of locking (e.g. spinlock or mutex), even though it seems like waiting shouldn't be required due to the real-time ordering of the situation.
In general, the notion of "real-time" synchronization (involving visibility rather than merely ordering) required to avoid having to potentially wait after the recv returns before loading a isn't supported by the C++ standard. At a lower level, this notion does exist however, and would typically be implemented through inter-processor interrupts, e.g. FlushProcessWriteBuffers on Windows, or sys_membarrier on x86 Linux. This would be inserted after the store to a before send in thread1. No synchronization or barrier would be required in thread2. (It also seems like a simple SFENCE in thread1 might suffice on x86 due to its strong memory model, at least in the absence of non-temporal loads/stores.)
A compiler barrier shouldn't be needed in either thread for the reasons outlined in the question (call to an external function send, which for all the compiler knows could be acquiring an internal mutex to synchronize with the other call to recv).
Insidious problems of the sort described in section 4.3 of Hans Boehm's paper "Threads Cannot be Implemented as a Library" should not be a concern as the C++ compiler is thread-aware (and in particular the opaque functions send and recv could contain synchronization operations), so transformations introducing writes to a after the send in thread1 are not permissible under the memory model.
This leaves the open question of whether the POSIX network functions provide the necessary guarantees. I highly doubt it, as on some of the architectures with weak memory models, they are highly non-trivial and/or expensive to provide (requiring a process-wide mutex or IPI as mentioned earlier). On x86 specifically, it's almost certain that accessing a shared resource like a socket will entail an SFENCE or MFENCE (or even a LOCK-prefixed instruction) somewhere along the line, which should be sufficient, but this is unlikely to be enshrined in a standard anywhere. Edit: In fact, I think even the INT to switch to kernel mode entails a drain of the store buffer (the best reference I have to hand is this forum post).

Is std::iostream non-blocking?

According to the boost reference for Boost.Iostreams (In section 3.6, at the very bottom):
http://www.boost.org/doc/libs/1_64_0/libs/iostreams/doc/index.html
Although the Boost.Iostreams Filter and Device concepts can
accommodate non-blocking i/o, the C++ standard library stream and
stream buffer interfaces cannot, since they lack a means to
distinguish between temporary and permanent failures to satisfy a read
or write request
However, the function std::istream::readsome appears to be non-blocking, in that the available characters will be immediately returned, without a blocking (except for a RAM copy) wait. My understanding is that:
std::istream::read will block until eof or number of characters read.
std::istream::readsome will return immediately with characters copied from the internal buffer.
I agree with you that readsome is not a blocking operation. However, as specified, it is wholly inadequate as an interface for performing what is usually called "non-blocking I/O".
First, there is no guarantee that readsome will ever return new data, even if it is available. So to guarantee you actually make progress, you must use one of the blocking interfaces eventually.
Second, there is no way to know when readsome will return data. There is no way to "poll" the stream, or to get a "notification" or "event" or "callback". A usable non-blocking interface needs at least one of these.
In short, readsome appears to be a half-baked and under-specified attempt to provide a non-blocking interface to I/O streams. But I have never seen it used in production code, and I would not expect to.
I think the Boost documentation overstates the argument, because as you observe, readsome is certainly capable of distinguishing temporary from permanent failure. But their conclusion is still correct for the reasons above.
When looking into non-blocking portability, I didn't find anything in the C++ standard library that looked like it did what you think it does.
If your goal is portability, my interpretation was that the section that mattered most was this:
http://en.cppreference.com/w/cpp/io/basic_istream/readsome
For example, when used with std::ifstream, some library
implementations fill the underlying filebuf with data as soon as the
file is opened (and readsome() on such implementations reads data,
potentially, but not necessarily, the entire file), while other
implementations only read from file when an actual input operation is
requested (and readsome() issued after file opening never extracts any
characters).
This says that different implementations that use the iostream interface are allowed to do their work lazily, and readsome() doesn't guarantee that the work even gets kicked off.
However, I think your interpretation that readsome is guaranteed not to block is true.

Thread Safety: Multiple threads reading from a single const source

What should I be concerned about as far as thread safety and undefined behavior goes in a situation where multiple threads are reading from a single source that is constant?
I am working on a signal processing model that allows for parallel execution of independent processes, these processes may share an input buffer, but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute.
Do I need to worry about thread safety issues in this situation? and what could i do about it?
I would like to note that a lock free solution would be best if possible
but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute
If this is guaranteed then there is not a problem having multiple reads from different threads for const objects.
I don't have the official standard so the following is from n4296:
17.6.5.9 Data race avoidance
3 A C++ standard library function shall not directly or indirectly modify objects (1.10) accessible by threads
other than the current thread unless the objects are accessed directly or indirectly via the function’s non-const
arguments, including this.
4 [ Note: This means, for example, that implementations can’t use a static object for internal purposes without
synchronization because it could cause a data race even in programs that do not explicitly share objects
between threads. —end note ]
Here is the Herb Sutter video where I first learned about the meaning of const in the C++11 standard. (see around 7:00 to 10:30)
No, you are OK. Multiple reads from the same constant source are OK and do not pose any risks in all threading models I know of (namely, Posix and Windows).
However,
but the process that fills the input buffer will always be complete
What are the guarantees here? How do you really know this is the case? Do you have a synchronization?

Can I use fstream in C++ to read or write file when I'm implementing a disk management component of DBMS

In C++, I know I can use read or write file using system function like read or write and I can also do that with fstream's help.
Now I'm implementing a disk management which is a component of DBMS. For simplicity I only use disk management to manage the space of a Unix file.
All I know is fstream wrap system function like read or write and provide some buffer.
However I was wondering whether this will affect atomicity and synchronization or not?
My question is which way should I use and why?
No. Particularly not with Unix. A DBM is going to want contiguous files. That means either a unix variant that support them or creating a disk partition.
You're also going to want to handle the buffering; not following the C++ library's buffering.
I could go on but streams are for - - streams of data -- not secure, reliable structured data.
The following information about synchronization and thread safety of 'fstream' can be found from ISO C++ standard.
27.2.3 Thread safety [iostreams.threadsafety]
Concurrent access to a stream object (27.8, 27.9), stream buffer
object (27.6), or C Library stream (27.9.2) by multiple threads may
result in a data race (1.10) unless otherwise specified (27.4). [
Note: Data races result in undefined behavior (1.10). —end note ]
If one thread makes a library call a that writes a value to a stream
and, as a result, another thread reads this value from the stream
through a library call b such that this does not result in a data
race, then a’s write synchronizes with b’s read.
C/C++ file I/O operation are not thread safe by default. So if you are planning to use fstream of open/write/read system call, then you would have to use synchronization mechanism by yourself in your implementation. You may use 'std::mutex' mechanism provided in new C++ standard(.i.e C++11) to synchronize your file I/O.

c++ volatile multithreading variables

I'm writing a C++ app.
I have a class variable that more than one thread is writing to.
In C++, anything that can be modified without the compiler "realizing" that it's being changed needs to be marked volatile right? So if my code is multi threaded, and one thread may write to a var while another reads from it, do I need to mark the var volaltile?
[I don't have a race condition since I'm relying on writes to ints being atomic]
Thanks!
C++ hasn't yet any provision for multithreading. In practice, volatile doesn't do what you mean (it has been designed for memory adressed hardware and while the two issues are similar they are different enough that volatile doesn't do the right thing -- note that volatile has been used in other language for usages in mt contexts).
So if you want to write an object in one thread and read it in another, you'll have to use synchronization features your implementation needs when it needs them. For the one I know of, volatile play no role in that.
FYI, the next standard will take MT into account, and volatile will play no role in that. So that won't change. You'll just have standard defined conditions in which synchronization is needed and standard defined way of achieving them.
Yes, volatile is the absolute minimum you'll need. It ensures that the code generator won't generate code that stores the variable in a register and always performs reads and writes from/to memory. Most code generators can provide atomicity guarantees on variables that have the same size as the native CPU word, they'll ensure the memory address is aligned so that the variable cannot straddle a cache-line boundary.
That is however not a very strong contract on modern multi-core CPUs. Volatile does not promise that another thread that runs on another core can see updates to the variable. That requires a memory barrier, usually an instruction that flushes the CPU cache. If you don't provide a barrier, the thread will in effect keep running until such a flush occurs naturally. That will eventually happen, the thread scheduler is bound to provide one. That can take milliseconds.
Once you've taken care of details like this, you'll eventually have re-invented a condition variable (aka event) that isn't likely to be any faster than the one provided by a threading library. Or as well tested. Don't invent your own, threading is hard enough to get right, you don't need the FUD of not being sure that the very basic primitives are solid.
volatile instruct the compiler not to optimize upon "intuition" of a variable value or usage since it could be optimize "from the outside".
volatile won't provide any synchronization however and your assumption of writes to int being atomic are all but realistic!
I'd guess we'd need to see some usage to know if volatile is needed in your case (or check the behavior of your program) or more importantly if you see some sort of synchronization.
I think that volatile only really applies to reading, especially reading memory-mapped I/O registers.
It can be used to tell the compiler to not assume that once it has read from a memory location that the value won't change:
while (*p)
{
// ...
}
In the above code, if *p is not written to within the loop, the compiler might decide to move the read outside the loop, more like this:
cached_p=*p
while (cached_p)
{
// ...
}
If p is a pointer to a memory-mapped I/O port, you would want the first version where the port is checked before the loop is entered every time.
If p is a pointer to memory in a multi-threaded app, you're still not guaranteed that writes are atomic.
Without locking you may still get 'impossible' re-orderings done by the compiler or processor. And there's no guarantee that writes to ints are atomic.
It would be better to use proper locking.
Volatile will solve your problem, ie. it will guarantee consistency among all the caches of the system. However it will be inefficiency since it will update the variable in memory for each R or W access. You might concider using a memory barrier, only whenever it is needed, instead.
If you are working with or gcc/icc have look on sync built-ins : http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
EDIT (mostly about pm100 comment):
I understand that my beliefs are not a reference so I found something to quote :)
The volatile keyword was devised to prevent compiler optimizations that might render code incorrect in the presence of certain asynchronous events. For example, if you declare a primitive variable as volatile, the compiler is not permitted to cache it in a register
From Dr Dobb's
More interesting :
Volatile fields are linearizable. Reading a volatile field is like acquiring a lock; the working memory is invalidated and the volatile field's current value is reread from memory. Writing a volatile field is like releasing a lock : the volatile field is immediately written back to memory.
(this is all about consistency, not about atomicity)
from The Art of multiprocessor programming, Maurice Herlihy & Nir Shavit
Lock contains memory synchronization code, if you don't lock, you must do something and using volatile keyword is probably the simplest thing you can do (even if it was designed for external devices with memory binded to the address space, it's not the point here)