From the Vulkan doc it says "vkInvalidateMappedMemoryRanges guarantees that device writes to the memory ranges described by pMemoryRanges, which have been made available to the host memory domain using the VK_ACCESS_HOST_WRITE_BIT and VK_ACCESS_HOST_READ_BIT access types, are made visible to the host."
I wonder what command uses VK_ACCESS_HOST_WRITE_BIT and VK_ACCESS_HOST_READ_BIT to do the memory domain operation to make the write to the device domain available also available to the host domain? Or does it mean vkInvalidateMappedMemoryRanges will do the device domain to host domain operation itself?
I know vkQueueSubmit will internally do a host domain to device domain operation after we do a vkflushmappedmemoryranges , but what command does the opposite?
As with anything else in Vulkan, you have to establish a barrier between the write and the read to make the write available to the reader. The fact that the read is on the host does not change this fact; it merely changes how you have to go about creating that barrier.
You need to use a memory barrier after the device writes to provide host availability through the VK_ACCESS_HOST_READ_BIT. This is typically done by using a pipeline barrier with VK_PIPELINE_STAGE_HOST_BIT as the destination scope.
However, the CPU cannot act on any of this until it has synchronized in some way with that barrier. So it needs to wait on an event set after the barrier or the fence submitted with the whole queue submit operation. Note that the event itself cannot itself contain the barrier; the barrier must happen before the event is set.
Related
In the JVM spec
When data is copied from the main memory to a working memory, two
actions must occur: a read operation performed by the main memory,
followed some time later by a corresponding load operation performed
by the working memory.
Must read and load operation happen back to back? Can there be other machine instructions between read and load? In a read operation it only says where to read data from, but it does not say where to keep the data. So if there are other machine instructions between read and load, where is the data kept during the process between the two operations?
A lock action (by a thread tightly synchronized with main memory)
causes a thread to acquire one claim on a particular lock. An unlock
action (by a thread tightly synchronized with main memory) causes a
thread to release one claim on a particular lock.
What is the relation of the lock/unlock specified here and the lock/unlock implemented in synchronized keyword and locks in the JUC library? How the lock mentioned in the spec is implemented? Is it implemented by some machine instructions or by setting a mark on a variable header?
I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.
The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.
I've an application where producers and consumers ("clients") want to send broadcast messages to each other, i.e. a n:m relationship. All could be different programs so they are different processes and not threads.
To reduce the n:m to something more maintainable I was thinking of a setup like introducing a little, central server. That server would offer an socket where each client connects to.
And each client would send a new message through that socket to the server - resulting in 1:n.
The server would also offer a shared memory that is read only for the clients. It would be organized as a ring buffer where the new messages would be added by the server and overwrite older ones.
This would give the clients some time to process the message - but if it's too slow it's bad luck, it wouldn't be relevant anymore anyway...
The advantage I see by this approach is that I avoid synchronisation as well as unnecessary data copying and buffer hierarchies, the central one should be enough, shouldn't it?
That's the architecture so far - I hope it makes sense...
Now to the more interesting aspect of implementing that:
The index of the newest element in the ring buffer is a variable in shared memory and the clients would just have to wait till it changes. Instead of a stupid while( central_index == my_last_processed_index ) { /* do nothing */ } I want to free CPU resources, e.g. by using a pthread_cond_wait().
But that needs a mutex that I think I don't need - on the other hand Why do pthreads’ condition variable functions require a mutex? gave me the impression that I'd better ask if my architecture makes sense and could be implemented like that...
Can you give me a hint if all of that makes sense and could work?
(Side note: the client programs could also be written in the common scripting languages like Perl and Python. So the communication with the server has to be recreated there and thus shouldn't be too complicated or even proprietary)
If memory serves, the reason for the mutex accompanying a condition variable is that under POSIX, signalling the condition variable causes the kernel to wake up all waiters on the condition variable. In these circumstances, the first thing that consumer threads need to do is check is that there is something to consume - by means of accessing a variable shared between producer and consumer threads. The mutex protects against concurrent access to the variable used for this purpose. This of course means that if there are many consumers, n-1 of them are needless awoken.
Having implemented precisely the arrangement described above, the choice of IPC object to use is not obvious. We were buffering audio between high priority real-time threads in separate processes, and didn't want to block the consumer. As the audio was produced and consumed in real-time, we were already getting scheduled regularly on both ends, and if there wasn't to consume (or space to produce into) we trashed the data because we'd already missed the deadline.
In the arrangement you describe, you will need a mutex to prevent the consumers concurrently consuming items that are queued (and believe me, on a lightly loaded SMP system, they will). However, you don't need to have the producer contend on this as well.
I don't understand you comment about the consumer having read-only access to the shared memory. In the classic lockless ring buffer implementation, the producer writes the queue tail pointer and the consumer(s) the head - whilst all parties need to be able to read both.
You might of course arrange for the queue head and tails to be in a different shared memory region to the queue data itself.
Also be aware that there is a theoretical data coherency hazard on SMP systems when implementing a ring buffer such as this - namely that write-back to memory of the queue content with respect to the head or tail pointer may occur out of order (they in cache - usually per-CPU core). There are other variants on this theme to do with synchonization of caches between CPUs. To guard against these, you need to an memory, load and store barriers to enforce ordering. See Memory Barrier on Wikipedia. You explicitly avoid this hazard by using kernel synchronisation primitives such as mutex and condition variables.
The C11 atomic operations can help with this.
You do need a mutex on a pthread_cond_wait() as far as I know. The reason is that pthread_cond_wait() is not atomic. The condition variable could change during the call, unless it's protected by a mutex.
It's possible that you can ignore this situation - the client might sleep past message 1, but when the subsequent message is sent then the client will wake up and find two messages to process. If that's unacceptable then use a mutex.
You probably can have a bit of different design by using sem_t if your system has them; some POSIX systems are still stuck on the 2001 version of POSIX.
You probably don't forcably need a mutex/condition pair. This is just how it was designed long time ago for POSIX.
Modern C, C11, and C++, C++11, now brings you (or will bring you) atomic operations, which were a feature that is implemented in all modern processors, but lacked support from most higher languages. Atomic operations are part of the answer for resolving a race condition for a ring buffer as you want to implement it. But they are not sufficient because with them you can only do active wait through polling, which is probably not what you want.
Linux, as an extension to POSIX, has futex that resolves both problems: to avoid races for updates by using atomic operations and the ability to putting waiters to sleep via a system call. Futexes are often considered as being too low level for everyday programming, but I think that it actually isn't too difficult to use them. I have written up things here.
First of all: sorry for my english.
Guys, I have a trouble with POSIX sockets and/or pthreads. I'm developing on embedded device(ARM9 CPU). On the device will work multithread tcp server. And it will be able to process a lot of incoming connections. Server gets connection from client and increase counter variable(unsigned int counter). Clients routines will run in separate threads. All clients will use 1 singleton class instance(in this class will be opened and closed same files). Clients works with files, then client thread closes connection socket, and calls pthread_exit().
So, my tcp server can't handle more than 250 threads(counter = 249 +1(server thread). And I got "Resource temporary unavailable". What's the problem?
Whenever you hit the thread limit - or as mentioned run out of virtual process address space due to the number of threads - you're.... doing it wrong. More threads don't scale. Especially not when doing embedded programming. You can handle requests on a thread pool instead. Use poll(2) to handle many connections on fewer threads. This is prettty well-trod territory and libraries (like ACE, asio) have been leveraging this model for good reason
The 'thread-per-request' model is mainly popular because of it's (perceived) simple design.
As long as you keep connections on a single logical thread (sometimes known as a strand) there is no real difference, though.
Also, if the handling of a request involves no blocking operations, you can never do better than polling and handling on a single thread after all: you can use the 'backlog' feature of bind/accept to let the kernel worry about pending connections for you! (Note: this assumed a single core CPU, on a dual core CPU this kind of processing would be optimal with one thread per CPU)
Edit Addition Re:
ulimit shows how much threads can OS handle, right? If yes, ulimit does not solve my problem because my app uses ~10-15 threads in same time.
If that's the case, you should really double check that you are joining or detaching all threads properly. Also think of the synchronization objects; if you consistently forget to call the relevant pthread *_destroy functions, you'll run into the limits even without needing it. That would of course be a resource leak. Some tools may be able to help you spot them (vlagrind/helgrind come to mind)
Use ulimit -n to check the number of file system handles. You can increase it for your current session if the number is too low.
Also you can edit /etc/security/limits.conf and to set a permanent limit
Usually, the first limit you are hitting on 32-bit systems is that you are running out of virtual address space when using default stack sizes.
Try explicitly specifying the stack size when creating threads (to less than 1 MB) or setting the default stack size with "ulimit -s".
Also note that you need to either pthread_detach or pthread_join your threads so that all resources will be freed.
I am trying to improve the performance of a framework I use.
Currently it makes use of a shared memory space(shm), to allow for inter-process communication between two C++ threads. Control to the SHM is passed through a semaphore. This is the current system which works quite well, albeit slightly slower than what I'd like, and with a need of flags for communication.
I have thought of using a master/slave configuration with the signals only being driven by either side. Hence a signal such as Slave_Ready would be written by the slave, and read by the master to show that the slave can take a request.
I would expect this behaviour to be supported, since only one side is writing to a signal ever. However, when the slave is polling the master-driven signals, the master seems unable to change the values of the signals. I've done this in Eclipse, and when I try to step through the write instruction it just doesn't get executed. This is what it looks like :
shmp->MREADY = true; // in the same time the slave is polling this signal.
So this instruction never goes through.
By my understanding, a read/write should be irrelevant. The write should go through, or it should be handled as an atomic request by the memory controller. Even if a read happens half-way through a write, I do not have an issue with corrupted data. If the read sees a true, it goes through and accesses data that was written before the ready signal was asserted. If it reads a false, it accesses the signals the next cycle. Either way the data integrity is preserved. Hence I would expect that this would work without issue, however there is obviously something at play here.
Is it that concurrent read/writes aren't supported? Does the constant spam of polling read request drown out the write request?
I would suggest using C++11's std::atomic<> for the MREADY member. Or at least issue std::atomic_thread_fence(std::memory_order_acq_rel) after the assignment.