Concurrent interrupts in ARM

Concurrent interrupts in ARM - concurrency

I am new to ARM processors. Atmel ATSAMD20e implements ARM cortex M0+ processor based on ARMv6 architecture. It allows upto 32 external interrupts, with the interrupt signals connected to the nested vector interrupt table (NVIC). Would it be possible to have concurrent interrupts using NVIC? if so,how can we determine the maximum number of interrupts that can be run concurrently? Could someone please point to any documentation that explains handling of concurrent interrupts. Thanks

The maximum interrupts that can run "concurrently" are limited by stack space, the number of priority levels, and the number of interrupt sources you have in the system. You say you have 32 interrupts, the M0+ has 192 levels, and I have no idea how much stack you're willing to sacrifice to get this behavior. (And "concurrent" is really a misnomer. they're preempting each other, not running concurrently)
In practice, however, it really doesn't buy much to support more than a few priority levels, if even that. You only need this if you have an interrupt whose deadline requirements are shorter than your longest interrupt running period.
See here (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0337e/Cihcbadd.html) for description of what happens on the stack as interrupts get preempted by other interrupts.

Related

How do I increase I/O priority for a C++ process on windows

I'm working on a project that consumes data from a piece of FPGA hardware over USB 3.0.
Right now, the buffer in the hardware is extremely small, so I have very little time to consume data from the hardware without getting a buffer overrun in the hardware's FIFO. In my tests, injecting a 6 millisecond sleep in the critical thread was enough to cause a overrun in the hardware buffer.
Right now, I'm barely managing to make things work, with a combination of pinning to a core (it turns out the OS scheduler moving my thread from one core to another was enough to cause it to overrun) and setting my thread priority as high as I can (+15, corresponds to non-admin "Realtime". Apparently values up to 31 are technically valid, but there are windows permissions issues going higher).
Basically, I'm trying to do a near-real-time thing in windows, which I realize is not the greatest idea, but that ship has sailed, unfortunately.
So, given the above, I'm interested to see if increasing the I/O priority can possibly help.
How can I (programatically) increase the IO priority for a process on windows?

should I "bind" "spinning" thread to the certain core?

My application contains several latency-critical threads that "spin", i.e. never blocks.
Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:
void Processor::ConnectionThread()
{
while (work)
{
Iterate();
}
}
I do not see "100% occupied" core in Task manager, overall system load is 36-40%.
But if I change it to this:
void Processor::ConnectionThread()
{
SetThreadAffinityMask(GetCurrentThread(), 2);
while (work)
{
Iterate();
}
}
Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.
Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?
I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.
upd found this slide which shows that binding busy-waiting thread to CPU may help:

Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.
Unless
Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
It saves for each reason R mentioned above, that is still there.
If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
your TLB is flushed
you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
No disk IO at all.
only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core.
The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
It will cost some total throughput.
as some threads that might have run if the context could have been switched.
but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
setting its priority higher will lessen the problem, but not eliminate it.
schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Other links:
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns

Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)
So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc

I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count.
After reading all the answers, I decided to implement and benchmark 4 different approaches
busy wait with no affinity set
busy wait with affinity set
observer pattern
signals
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.

Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost.
There some details in the docs you need to understand.

This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?

Linux Timing across Kernel & User Space

I'm writing a kernel module for a special camera, working through V4L2 to handle transfer of frames to userspace code.. Then I do lots of userspace stuff in the app..
Timing is very critical here, so I've been doing lots of performance profiling and plain old std::chrono::steady_clock stuff to track timing, but I've reached the point where I need to also collect timing data from the Kernel side of things so that I can analyze the entire path from hardware interrupt through V4L DQBuf to userspace...
Can anyone recommend a good way to get high-resolution timing data, that would be consistent with application userspace data, that I could use for such comparisons? Right now I'm measuring activity in microseconds..
Ubuntu 12.04 LTS

At the lowest level, there are the rdtsc and rdtscp instructions if you're on an x86/x86-64 processor. That should provide the lowest overhead, highest possible resolution across the kernel/userspace boundary.
However, there are things you need to worry about. You need to make sure you're executing across the same core/cpu, the process isn't being context switched, and the frequency isn't changing across invocations. If the cpu supports an invariant tsc, (constant_tsc in /proc/cpuinfo) it's a little more reliable across cpus/cores and frequencies.
This should provide roughly nanosecond accuracy.

There are lot of kernel level utilities available that can get the timing related traces for you. For eg ptrace, ftrace, LTTng, Kprobes. Check out this link for more information.
http://elinux.org/Kernel_Trace_Systems

How many threads can a C++ application create

I'd like to know, how many threads can a C++ application create at most.
Does OS, hardware caps and other factors influence on these bounds?

[C++11: 1.10/1]: [..] Under a hosted implementation, a C++ program can have more than one thread running concurrently. [..] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
[C++11: 30.3/1]: 30.3 describes components that can be used to create and manage threads. [ Note: These threads are intended to map one-to-one with operating system threads. —end note ]
So, basically, it's totally up to the implementation & OS; C++ doesn't care!
It doesn't even list a recommendation in Annex B "Implementation quantities"! (which seems like an omission, actually).

C++ as language does not specify a maximum (or even a minimum beyond the one). The particular implementation can, but I never saw it done directly. The OS also can, but normally just states a lank like limited by system resources. Each thread uses up some nonpaged memory, selector tables, other bound things, so you may run out of that. If you don't the system will become pretty unresponsive if the threads actually do work.
Looking from other side, real parallelism is limited by actual cores in the system, and you shall not have too many threads. Applications that could logically spawn hundreds or thousands usually start using thread pools for good practical reasons.

Basically, there are no limits at your C++ application level. The number of maximum thread is more on the OS level (based on your architecture and memory available).
On Linux, there are no limit on the maximum number of thread per process. The number of thread is limited system wide. You can check the number of maximum allowed threads by doing:
cat /proc/sys/kernel/threads-max
On Windows you can use the testlimit tool to check the maximum number of thread:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
On Mac OS, please read this table to find the number of thread based on your hardware configuration
However, please keep in mind that you are on a multitasking system. The number of threads executed at the same time is limited by the total number of processor cores available. To do more things, the system tries to switch between all theses thread. Each "switch" has a performce (a few milliseconds). If your system is "switching" too much, it won't speed too much time to "work" and your overall system will be slow.

Generally, the limit of number of threads is the amount of memory available, but there have been systems around that have lower limits.
Unless you go mad with creating threads, it's very unlikely it will be a problem to have a limit. Creating more threads is rarely beneficial, once you reach a certain number - that number may be around the same as, or a few times higher than, the number of cores (which for real big, heavy hardware can be a few hundred these days, with 16-core processors and 8 sockets).
Threads that are CPU bound should not be more than the number of processors - nothing good comes from that.
Threads that are doing I/O or otherwise "sitting around waiting" can be higher in numbers - 2-5 per processor core seems reasonable. Given that modern machines have 8 sockets and 16 cores at the higher end of the spectrum, that's still only around 1000 threads.
Sure, it's possible to design, say, a webserver system where each connection is a thread, and the system has 10k or 20k connections active at any given time. But it's probably not the most efficient.

I'd like to know, how many threads can a C++ application create at most.
Implementation/OS-dependent.
Keep in mind that there were no threads in C++ prior to C++11.
Does OS, hardware caps and other factors influence on these bounds?
Yes.
OS might be able limit number of threads a process can create.
OS can limit total number of threads running simultaneously (to prevent fork bombs, etc, linux can definitely do that).
Available physical(and virtual) memory will limit number of threads you can create IF each thread allocates its own stack.
There can be a (possibly hardcoded) limit on how many thread "handles" OS can provide.
Underlying OS/platform might not have threads at all (real-mode compiler for DOS/FreeDOS or something similar).

Apart from the general impracticality of having many more threads than cores, yes, there are limits. For example, a system may keep a unique "process ID" for each thread, and there may be only 65535 of them available. Also, each thread will have its own stack, and those stacks will eventually consume too much memory (you can however adjust the size of each stack when you spawn threads).
Here's an informative article--ignore the fact that it mentions Windows, as the concepts are similar on other common systems: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx

There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit.
Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

Send interrupt to cpu as keyboard do?

Is it possible to simulate hardware interrupts somehow from user program?
I've seen this question posted many times, but always not answered.
I want to know about low-level interrupts (for example simulate situation when key pressed on keyboard, so that keyboard driver would interrupt interrupt).
High level events and APIs are outside scope, and question is rather theoretical than practical (to prevent "why" discussions :)

Yes and no.
On an x86 CPU (for one example) there's an int instruction that generates an interrupt. Once the interrupt is generated, the CPU won't necessarily1 distinguish between an interrupt generated by hardware and one generated by software. For one example, in the original PC BIOS, IBM chose an interrupt that would cause the print-screen command to execute. The interrupt they chose (interrupt 5) was one that wasn't then in use, but which Intel had said was reserved for future use. Intel eventually did put that interrupt to use -- in the 286 they added a bound instruction that checks that a value is within bounds, and generates an interrupt if it's not. The bound instruction is essentially never used though, because it generates interrupt 5 if a value is out of bounds. This means (if you're running something like MS-DOS that allows it) executing the bound instruction with a value that's out of bounds will print the screen.
On a modern OS, however, this won't generally be allowed. All generation and handling of interrupts happens in the kernel. The hardware had 4 levels of protection ("rings") and support for specifying the ring at which the int instruction can be executed. If you try to execute it from code running at ring 3, it won't execute directly -- instead, execution will switch to the OS kernel, which can treat it as it chooses.
This allows (for example) Windows to emulate MS-DOS, so MS-DOS programs (which do use the int instruction) can execute in a virtual machine, with virtualized input and output, so even though they "think" they're working directly with the keyboard and screen hardware, they're actually using emulations of them provided by software.
For "native" programs, however, using most int instructions (i.e. any but a tiny number of interrupts intended for communication with the kernel) will simply result in the program being shut down.
So, bottom line: yes, the hardware supports it -- but the hardware also supports prohibiting it, and nearly every modern OS does exactly that, at least for most code outside the OS kernel itself.
Though, with typical hardware, the interrupt handler can read data from the programmable interrupt controller (PIC) chip that will tell it whether the interrupt came through the PIC (i.e., hardware interrupt) or not (software interrupt). Most hardware also supports at least a few interrupts that can be generated only by hardware, such as NMI on the x86. These are usually reserved for fairly narrow uses though (e.g., NMI on a PC is normally used for things like memory parity errors).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js