Non-Overlapped Serial - Do ReadFile() calls from separate threads block each other? - c++

I've inherited a large code base that contains multiple serial interface classes to various hardware components. Each of these serial models use non-overlapped serial for their communication. I have an issue where I get random CPU spikes to 100% which causes the threads to stall briefly, and then the CPU goes back to normal usage after ~10-20 seconds.
My theory is that due to the blocking nature of non-overlapped serial that there are times when multiple threads are calling readFile() and blocking each other.
My question is if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other? Based on my research I believe that's true but would like confirmation.
The platform is Windows XP running C++03 so I don't have many modern tools available

"if multiple threads are calling readFile() (or writeFile()) at the same time will they block each other?"
As far as I'm concerned, they will block each other.
I suggest you could refer to the Doc:Synchronization and Overlapped Input and Output
When a function is executed synchronously, it does not return until
the operation has been completed. This means that the execution of the
calling thread can be blocked for an indefinite period while it waits
for a time-consuming operation to finish. Functions called for
overlapped operation can return immediately, even though the operation
has not been completed. This enables a time-consuming I/O operation to
be executed in the background while the calling thread is free to
perform other tasks.
Using the same event on multiple threads can lead to a race condition
in which the event is signaled correctly for the thread whose
operation completes first and prematurely for other threads using that
event.
And the operating system is in charge of the CPU. Your code only gets to run when the operating system calls it. The OS will not bother running threads that are blocked.Blocking will not occupy the CPU. I suggest you could try to use Windows Performance Toolkik to check cpu utilization.

Related

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

RMA MPI window access latency

I use Fortran (with gfortran) and MPI 2 (OpenMPI). Through MPI_Win_lock and MPI_Win_unlock together with put and get operations (in non-overlapping regions of memory) all processes update a variable on my master process, which is exposed through a window.
In a test case I have noticed, however, that the unlock operations from processes that are not the master do not return until the master has finished some task, in this case sleeping for some seconds.
If instead of sleeping the master, I make it wait for some seconds using a while loop and a timer, and in the meantime I make the master lock and unlock the window, everything goes much faster:
call start_time(time_left)
do while (time_left .gt. 0)
call MPI_Win_lock(...)
call MPI_Win_unlock(...)
call update_time(time_left)
end do
However, in my real code the master performs operations just as the other processes, so it is impossible for it to continuously lock and unlock the window. Also, it seems to me rather wasteful.
My question is therefore how to decrease this latency?
Do I really need to sprinkle my code with locks and unlocks for the master? Or is there another solution? Is this compiler / implementation dependent?
The behaviour is implementation-dependent. Most MPI libraries do not perform asynchronous progression of the operations and progression only happens while the execution control is explicitly transferred to the library by calling MPI_Something. A relatively lightweight and portable hack is to call MPI_Iprobe periodically, which should enable the library to process any outstanding non-blocking send and receive operations used to implement the RMA.

I/O Completion Port vs. QueueUserApc?

Under Windows, there are two means to insert work items for avoiding to create too many threads:
Means 1: Use IOCP;
Means 2: Use QueueUserApc.
However, means 1 is far more intricate than means 2.
So my question is: what are the advantages of means 1 relative to that of means 2?
When you call QueueUserApc, you must target a specific thread.
IOCP has a built-in thread dispatch mechanism that QueueUserApc lacks that allows you to target the most efficient thread out of a pool of threads. The thread dispatch mechanism automatically prevents too many threads from running at the same time (which causes extra context switches and extra contention) or too few threads from running at the same time (which causes poor performance).
Windows actually keeps track of the number of threads running IOCP jobs. It initially sets the number of threads it allows to run equal to the number of virtual cores on the machine. However, if a thread blocks for I/O or synchronization, another thread blocked on the IOCP port is automatically released, avoiding thread starvation.
In addition, IOCP can be easily hooked up to I/O so that I/O events trigger dispatches of threads blocked on the IOCP port. This is the most efficient way to do I/O to a large number of destinations on Windows.

Thread or timer to read sensor data out?

My Linux C++ application is periodically reading sensor data. Readout is done by simple file I/O operation (OS is writing to file, application is reading from this file).
Some information about my platform:
I have single core processor with hyper-threading
sensor data update frequency is 1 second
application GUI runs in main thread and shouldn't be blocked
I considered two approaches for sensor data read out:
timer running in main application thread
separate thread with infinite loop which does sensor data readout and then sleeps
Which approach makes more sens, are there any other alternatives ? What are the costs of both solution (e.g. blocking of main thread in first or context switching in second approach) ?
I don't know anything about your application or the hardware, but here are a few things to consider:
If you use a thread, you will have to create a communication channel of some sort to tell the main thread that data has been updated. Usually this would be a pipe(), as signals are inherently unreliable and condition locks don't work with I/O multiplexing (i.e. select()/poll()).
Can you get the entire set of data without blocking? If so, then just reading it in the main thread is probably easier. However, if your read can block you'll probably need some more "keep track of my read state to incorporate it into my central select()", whereas a thread can just block until more data is available.
Thus, neither solution is automatically "easier" to do.
I wouldn't worry about "context switching" for a read that only occurs once per second; that's irrelevant.
What else does the main thread have to do? Is it ok if it blocks? If so, then you dont need to do the timer, etc in a separate thread.
If the main thread cant block waiting for the periodic timer, then a separate thread must be created. The communication of data between the threads can be via an object that is accessible to both threads and protected via a mutex (look up pthread_mutex_t), which is quite simple to do.
As for which solution would be better and what are the costs, it depends on what else the main thread is doing. But for something this simple, either way should be about the same, and the context switching shouldnt affect anything. What should affect performance the most is how performance intensive the reads are.
I believe that cost of the context switch once a second is not an issue even for single-core CPU without hyper-threading especially taking to the account that the application is running in user space, thus is not really time-critical. The polling of your sensor in the main thread complicates the logic of the application. So, I would recommend you to start a thread for that purpose.
A sleep-loop will skew the timing because each iteration is going to take longer than 1sec. Timers don't have that problem, and they are made for this scenario. So choose a timer.
Performance-wise there is no difference because you are only triggering once a second.
If the Linux driver is reading a sensor data and writing it to a device file every second, you shouldn't duplicate the timer logic in your application. It may happen that after 1 second sleep your application will still read the same data as 1 second ago. A better approach would be to have a thread that would call a blocking read on a device file. When new sensor data is available, blocking read returns, the thread can process the data and call read again.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...