Zero MQ with C++ bindings and Open MP blocking issue. Why?

Zero MQ with C++ bindings and Open MP blocking issue. Why? - c++

I wrote a test for ZeroMQ to convince myself that it manages to map replies to the client independent from processing order, which would prove it thread safe.
It is a multi-threaded server, which just throws the received messages back at the sender. The client sends some messages from several threads and checks, if it receives the same message back. For multi-threading I use OpenMP.
That test worked fine and I wanted to move on and re-implement it with C++ bindings for ZeroMQ. And now it doesn't work in the same way anymore.
Here's the code with ZMQPP:
#include <gtest/gtest.h>
#include <zmqpp/zmqpp.hpp>
#include <zmqpp/proxy.hpp>
TEST(zmqomp, order) {
zmqpp::context ctx;
std::thread proxy([&ctx] {
zmqpp::socket dealer(ctx, zmqpp::socket_type::xrequest);
zmqpp::socket router(ctx, zmqpp::socket_type::xreply);
router.bind("tcp://*:1234");
dealer.bind("inproc://workers");
zmqpp::proxy(router, dealer);
});
std::thread worker_starter([&ctx] {
#pragma omp parallel
{
zmqpp::socket in(ctx, zmqpp::socket_type::reply);
in.connect("inproc://workers");
#pragma omp for
for (int i = 0; i < 1000; i++) {
std::string request;
in.receive(request);
in.send(request);
}
}
});
std::thread client([&ctx] {
#pragma omp parallel
{
zmqpp::socket out(ctx, zmqpp::socket_type::request);
out.connect("tcp://localhost:1234");
#pragma omp for
for (int i = 0; i < 1000; i++) {
std::string msg("Request " + std::to_string(i));
out.send(msg);
std::string reply;
out.receive(reply);
EXPECT_EQ(reply, msg);
}
}
});
client.join();
worker_starter.join();
ctx.terminate();
proxy.join();
}
The test blocks and doesn't get executed to the end. I played around with #pragmas a little bit and found out that only one change can "fix" it:
//#pragma omp parallel for
for (int i = 0; i < 250; i++) {
The code is still getting executed parallel in that case, but I have to divide the loop executions number by a number of my physical cores.
Does anybody have a clue what's going on here?

Prologue: ZeroMQ is by-definition and by-design NOT Thread-Safe.
This normally does not matter as there are some safe-guarding design practices, but situation here goes even worse, once following the proposed TEST(){...} design.
Having spent some time with ZeroMQ, your proposal headbangs due to violations on several principal things, that otherwise help distributed architectures to work smarter, than a pure SEQ of monolithic code.
ZeroMQ convinces in ( almost ) every third paragraph to avoid sharing of resources. Zero-sharing is one of the ZeroMQ's fabulous scalable performance and minimised latency maxims, so to say in short.
So one has better to avoid sharing zmq.Context() instance at all ( unless one knows pretty well, why and how the things work under the hood ).
Thus an attempt to fire 1000-times ( almost ) in parallel ( well, not a true PAR ) some flow of events onto a shared instance of zmq.Context ( the less once it was instantiated with default parameters and having none performance tuning adaptations ) will certainly suffer from doing the very opposite from what is, performance-wise and design-wise, recommended to do.
What are some of the constraints, not to headbang into?
1) Each zmq.Context() instance has a limited amount of I/O-threads, that were created during the instantiation process. Once a fair design needs some performance-tuning, it is possible to increase such number of I/O-threads and data-pumps will work that better ( sure, none amount of data-pumps will salvage a poor, the less a disastrous design / architecture of a distributed computing system. This is granted. ).
2) Each zmq.Socket() instance has an { implicit | explicit } mapping onto a respective I/O-thread ( Ref. 1) ). Once a fair design needs some increased robustness against sluggish event-loop handlings or against other adverse effects arisen from data-flow storms ( or load-balancing or you name it ), there are chances to benefit from a divide-and-conquer approach to use .setsockopt( zmq.AFFINITY, ... ) method to directly map each zmq.Socket() instance onto a respective I/O-thread, and remain thus in control of what buffering and internal queues are fighting for which resources during the real operations. In any case, where a total amount of threads goes over the localhost number of cores, the just-CONCURRENT scheduling is obvious ( so a dream of a true PAR execution is principally and inadvertently lost. This is granted. ).
3) Each zmq.Socket() has also a pair of "Hidden Queue Devastators", called High-Watermarks. These get set either { implicitly | explicitly }, the latter being for sure a wiser manner for performance tuning. Why Devastators? Because these stabilise and protect the distributed computing systems from overflows and are permitted to simply discard each and every message above the HWM level(s) so as to protect the systems capability to run forever, even under heavy storms, spurious blasts of crippled packets or DDoS-types of attack. There are many tools for tuning this domain of ZeroMQ Context()-instance's behaviour, which go beyond the scope of this answer ( Ref.: other my posts on ZeroMQ AFFINITY benefits or the ZeroMQ API specifications used in .setsockopt() method ).
4) Each tcp:// transport-class based zmq.Socket() instance has also inherited some O/S dependent heritage. Some O/S demonstrate this risk by extended accumulation of ip-packets ( outside of any ZeroMQ control ) until some threshold got passed and thus a due design care ought be taken for such cases to avoid adverse effects on the intended application signalling / messaging dynamics and robustness against such uncontrollable ( exosystem ) buffering habits.
5) Each .recv() and .send() method call is by-definition blocking, a thing a massively distributed computing system ought never risk to enter into. Never ever. Even in a school-book example. Rather use non-blocking form of these calls. Always. This is granted.
6) Each zmq.Socket() instance ought undertake a set of careful and graceful termination steps. A preventive step of .setsockopt( zmq.LINGER, 0 ) + an explicit .close() methods are fair to be required to be included in every use-case ( and made robust to get executed irrespective of any exceptions that may get appeared. ). A poor { self- | team- }-discipline in this practice is a sure ticket into hanging up the whole application infrastructure due to just not paying due care on a mandatory resources management policy. This is a must-have part of any serious distributed computing Project. Even the school-book examples ought have this. No exceptions. No excuse. This is granted.

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter

Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.

5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Best practice to implement fixed byte serial protocol in C++?

I have a device connected via serial interface to a BeagleBone computer. I communicates in a simple binary format like
|MessagID (1 Byte) | Data (n Bytes) | checksum (2 bytes) |
The message length is fixed for each command, meaning that it is known how many bytes to read after the First byte of a command was received. After some initial setup communication it sends packets of data every 20 ms.
My approach would be to use either termios or something like serial lib and then start a loop doing like that (a:
while(keepRunning)
{
char* buffer[256];
serial.read(buffer, 1)
switch(buffer[0])
{
case COMMAND1:
serial.read(&buffer[1], sizeof(MessageHello)+2); //Read data + checksum
if (calculateChecksum(buffer, sizeof(MessageHello)+3) )
{
extractDatafromCommand(buffer);
}
else
{
doSomeErrorHandling(buffer[0]);
}
break;
case COMMAND2:
serial.read(&buffer[1], sizeof(MessageFoo)+2);
[...]
}
}
extractDatafromCommand would then create some structs like:
struct MessageHello
{
char name[20];
int version;
}
Put everything in an own read thread and signal the availability of a new packet to other parts of the program using a semaphore (or a simple flag).
Is this a viable solution or are there better improvements to do (I assume so)?
Maybe make a abstract class Message and derive the other messages?

It really depends. The two major ways would be threaded (like you mentioned) and evented.
Threaded code is tricky because you can easily introduce race conditions. Code that you tested a million times could occasionally stumble and do the wrong thing after working for days or weeks or years. It's hard to 'prove' that things will always behave correctly. Seemingly trivial things like "i++" suddenly become leaky abstractions. (See why is i++ not thread safe on a single core machine? )
The other alternative is evented programming. Basically, you have a main loop that does a select() on all your file handles. Anything that is ready gets looked at, and you try to read/write as many bytes as you can without blocking. (pass O_NONBLOCK). There are two tricky parts: 1) you must never do long calculations without having a way to yield back to the main loop, and 2) you must never do a blocking operation (where the kernel stops your process waiting for a read or write).
In practice, most programs don't have long computations and it's easier to audit a small amount of your code for blocking calls than for races. (Although doing DNS without blocking is trickier than it should be.)
The upside of evented code is that there's no need for locking (no other threads to worry about) and it wastes less memory (in the general case where you're creating lots of threads.)
Most likely, you want to use a serial lib. termios processing is just overhead and a chance for stray bytes to do bad things.

Simulating CPU Load In C++

I am currently writing an application in Windows using C++ and I would like to simulate CPU load.
I have the following code:
void task1(void *param) {
unsigned elapsed =0;
unsigned t0;
while(1){
if ((t0=clock())>=50+elapsed){//if time elapsed is 50ms
elapsed=t0;
Sleep(50);
}
}
}
int main(){
int ThreadNr;
for(int i=0; i < 4;i++){//for each core (i.e. 4 cores)
_beginthread( task1, 0, &ThreadNr );//create a new thread and run the "task1" function
}
while(1){}
}
I wrote this code using the same methodology as in the answers given in this thread: Simulate steady CPU load and spikes
My questions are:
Have I translated the C# code from the other post correctly over to C++?
Will this code generate an average CPU load of 50% on a quad-core processor?
How can I, within reasonable accuracy, find out the load percentage of the CPU? (is task manager my only option?)
EDIT: The reason I ask this question is that I want to eventually be able to generate CPU loads of 10,20,30,...,90% within a reasonable tolerance. This code seems to work well for to generate loads 70%< but seems to be very inaccurate at any load below 70% (as measured by the task manager CPU load readings).
Would anyone have any ideas as to how I could generate said loads but still be able to use my program on different computers (i.e. with different CPUs)?

At first sight, this looks like not-pretty-but-correct C++ or C (an easy way to be sure is to compile it). Includes are missing (<windows.h>, <process.h>, and <time.h>) but otherwise it compiles fine.
Note that clock and Sleep are not terribly accurate, and Sleep is not terribly reliable either. On the average, the thread function should kind of work as intended, though (give or take a few percent of variation).
However, regarding question 2) you should replace the last while(1){} with something that blocks rather than spins (e.g. WaitForSingleObject or Sleep if you will). otherwise the entire program will not have 50% load on a quadcore. You will have 100% load on one core due to the main thread, plus the 4x 50% from your four workers. This will obviously sum up to more than 50% per core (and will cause threads to bounce from one core to the other, resulting in nasty side effects).
Using Task Manager or a similar utility to verify whether you get the load you want is a good option (and since it's the easiest solution, it's also the best one).
Also do note that simulating load in such a way will probably kind of work, but is not 100% reliable.
There might be effects (memory, execution units) that are hard to predict. Assume for example that you're using 100% of the CPU's integer execution units with this loop (reasonable assumption) but zero of it's floating point or SSE units. Modern CPUs may share resources between real or logical cores, and you might not be able to predict exactly what effects you get. Or, another thread may be memory bound or having significant page faults, so taking away CPU time won't affect it nearly as much as you think (might in fact give it enough time to make prefetching work better). Or, it might block on AGP transfers. Or, something else you can't tell.
EDIT:
Improved version, shorter code that fixes a few issues and also works as intended:
Uses clock_t for the value returned by clock (which is technically "more correct" than using a not specially typedef'd integer. Incidentially, that's probably the very reason why the original code does not work as intended, since clock_t is a signed integer under Win32. The condition in if() always evaluates true, so the workers sleep almost all the time, consuming no CPU.
Less code, less complicated math when spinning. Computes a wakeup time 50 ticks in the future and spins until that time is reached.
Uses getchar to block the program at the end. This does not burn CPU time, and it allows you to end the program by pressing Enter. Threads are not properly ended as one would normally do, but in this simple case it's probably OK to just let the OS terminate them as the process exits.
Like the original code, this assumes that clock and Sleep use the same ticks. That is admittedly a bold assumption, but it holds true under Win32 which you used in the original code (both "ticks" are milliseconds). C++ doesn't have anything like Sleep (without boost::thread, or C++11 std::thread), so if non-Windows portability is intended, you'd have to rethink anyway.
Like the original code, it relies on functions (clock and Sleep) which are unprecise and unreliable. Sleep(50) equals Sleep(63) on my system without using timeBeginPeriod. Nevertheless, the program works "almost perfectly", resulting in a 50% +/- 0.5% load on my machine.
Like the original code, this does not take thread priorities into account. A process that has a higher than normal priority class will be entirely unimpressed by this throttling code, because that is how the Windows scheduler works.
#include <windows.h>
#include <process.h>
#include <time.h>
#include <stdio.h>
void task1(void *)
{
while(1)
{
clock_t wakeup = clock() + 50;
while(clock() < wakeup) {}
Sleep(50);
}
}
int main(int, char**)
{
int ThreadNr;
for(int i=0; i < 4; i++) _beginthread( task1, 0, &ThreadNr );
(void) getchar();
return 0;
}

Here is an a code sample which loaded my CPU to 100% on Windows.
#include "windows.h"
DWORD WINAPI thread_function(void* data)
{
float number = 1.5;
while(true)
{
number*=number;
}
return 0;
}
void main()
{
while (true)
{
CreateThread(NULL, 0, &thread_function, NULL, 0, NULL);
}
}
When you build the app and run it, push Ctrl-C to kill the app.

You can use the Windows perf counter API to get the CPU load. Either for the entire system or for your process.

QueryPerformanceCounter and overflows

I'm using QueryPerformanceCounter to do some timing in my application. However, after running it for a few days the application seems to stop functioning properly. If I simply restart the application it starts working again. This makes me a believe I have an overflow problem in my timing code.
// Author: Ryan M. Geiss
// http://www.geisswerks.com/ryan/FAQS/timing.html
class timer
{
public:
timer()
{
QueryPerformanceFrequency(&freq_);
QueryPerformanceCounter(&time_);
}
void tick(double interval)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
if (time_.QuadPart != 0)
{
int ticks_to_wait = static_cast<int>(static_cast<double>(freq_.QuadPart) * interval);
int done = 0;
do
{
QueryPerformanceCounter(&t);
int ticks_passed = static_cast<int>(static_cast<__int64>(t.QuadPart) - static_cast<__int64>(time_.QuadPart));
int ticks_left = ticks_to_wait - ticks_passed;
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
if (ticks_passed >= ticks_to_wait)
done = 1;
if (!done)
{
// if > 0.002s left, do Sleep(1), which will actually sleep some
// steady amount, probably 1-2 ms,
// and do so in a nice way (cpu meter drops; laptop battery spared).
// otherwise, do a few Sleep(0)'s, which just give up the timeslice,
// but don't really save cpu or battery, but do pass a tiny
// amount of time.
if (ticks_left > static_cast<int>((freq_.QuadPart*2)/1000))
Sleep(1);
else
for (int i = 0; i < 10; ++i)
Sleep(0); // causes thread to give up its timeslice
}
}
while (!done);
}
time_ = t;
}
private:
LARGE_INTEGER freq_;
LARGE_INTEGER time_;
};
My question is whether the code above should work deterministically for weeks of running continuously?
And if not where the problem is? I thought the overflow was handled by
if (t.QuadPart < time_.QuadPart) // time wrap
done = 1;
But maybe thats not enough?
EDIT: Please observe that I did not write the original code, Ryan M. Geiss did, the link to the original source of the code is in the code.

QueryPerformanceCounter is notorious for its unreliability. It's fine to use for individual short-interval timing, if you're prepared to handle abnormal results. It is not exact - It's typically based on the PCI bus frequency, and a heavily loaded bus can lead to lost ticks.
GetTickCount is actually more stable, and can give you 1ms resolution if you've called timeBeginPeriod. It will eventually wrap, so you need to handle that.
__rdtsc should not be used, unless you're profiling and have control of which core you're running on and are prepared to handle variable CPU frequency.
GetSystemTime is decent for longer periods of measurements, but will jump when the system time is adjusted.
Also, Sleep(0) does not do what you think it does. It will yield the cpu if another context wants it - otherwise it'll return immediately.
In short, timing on windows is a mess. One would think that today it'd be possible to get accurate long-term timing from a computer without going through hoops - but this isn't the case. In our game framework we're using several time sources and corrections from the server to ensure all connected clients have the same game time, and there's a lot of bad clocks out there.
Your best bet would likely be to just use GetTickCount or GetSystemTime, wrap it into something that adjusts for time jumps/wrap arounds.
Also, you should convert your double interval to an int64 milliseconds and then use only integer math - this avoids problems due to floating point types' varying accuracy based on their contents.

Based on your comment, you probably should be using Waitable Timers instead.
See the following examples:
Using Waitable Timer Objects
Using Waitable Timers with an Asynchronous Procedure Call

Performance counters are 64-bit, so they are large enough for years of running continuously. For example, if you assume the performance counter increments 2 billion times each second (some imaginary 2 GHz processor) it will overflow in about 290 years.

Using a nanosecond-scale timer to control something like Sleep() that at best is precise to several milliseconds (and usually, several dozen milliseconds) is somewhat controversary anyway.
A different approach you might consider would be to use WaitForSingleObject or a similar function. This burns less CPU cycles, causes a trillion fewer context switches over the day, and is more reliable than Sleep(0), too.
You could for example create a semapore and never touch it in normal operation. The semaphore exists only so you can wait on something, if you don't have anything better to wait on. Then you can specify a timeout in milliseconds up to 49 days long with a single syscall. And, it will not only be less work, it will be much more accurate too.
The advantage is that if "something happens", so you want to break up earlier than that, you only need to signal the semaphore. The wait call will return instantly, and you will know from the WAIT_OBJECT_0 return value that it was due to being signaled, not due to time running out. And all that without complicated logic and counting cycles.

The problem you asked about most directly:
if (t.QuadPart < time_.QuadPart)
should instead be this:
if (t.QuadPart - time_.QuadPart < 0)
The reason for that is that you want to look for wrapping in relative time, not absolute time. Relative time will wrap (1ull<<63) time units after the reference call to QPC. Absolute time might wrap (1ull<<63) time units after reboot, but it could wrap at any other time it felt like it, that's undefined.
QPC is a little bugged on some systems (older RDTSC-based QPCs on early multicore CPUs, for instance) so it may be desirable to allow small negative time deltas like so:
if (t.QuadPart - time_.QuadPart < -1000000) //time wrap
An actual wrap will produce a very large negative time deltas, so that's safe. It shouldn't be necessary on modern systems, but trusting microsoft is rarely a good idea.
...
However, the bigger problem there with time wrapping is in the fact that ticks_to_wait, ticks_passed, and ticks_left are all int, not LARGE_INT or long long like they should be. This makes most of that code wrap if any significant time periods are involved - and "significant" in this context is platform dependent, it can be on the order of 1 second in a few (rare these days) cases, or even less on some hypothetical future system.
Other issues:
if (time_.QuadPart != 0)
Zero is not a special value there, and should not be treated as such. My guess is that the code is conflating QPC returning a time of zero with QPCs return value being zero. The return value is not the 64 bit time passed by pointer, it's the BOOL that QPC actually returns.
Also, that loop of Sleep(0) is foolish - it appears to be tuned to behave correctly only on a particular level of contention and a particular per-thread CPU performance. If you need resolution that's a horrible idea, and if you don't need resolution then that entire function should have just been a single call to Sleep.

How much footprint does C++ exception handling add

This issue is important especially for embedded development. Exception handling adds some footprint to generated binary output. On the other hand, without exceptions the errors need to be handled some other way, which requires additional code, which eventually also increases binary size.
I'm interested in your experiences, especially:
What is average footprint added by your compiler for the exception handling (if you have such measurements)?
Is the exception handling really more expensive (many say that), in terms of binary output size, than other error handling strategies?
What error handling strategy would you suggest for embedded development?
Please take my questions only as guidance. Any input is welcome.
Addendum: Does any one have a concrete method/script/tool that, for a specific C++ object/executable, will show the percentage of the loaded memory footprint that is occupied by compiler-generated code and data structures dedicated to exception handling?

When an exception occurs there will be time overhead which depends on how you implement your exception handling. But, being anecdotal, the severity of an event that should cause an exception will take just as much time to handle using any other method. Why not use the highly supported language based method of dealing with such problems?
The GNU C++ compiler uses the zero–cost model by default i.e. there is no time overhead when exceptions don't occur.
Since information about exception-handling code and the offsets of local objects can be computed once at compile time, such information can be kept in a single place associated with each function, but not in each ARI. You essentially remove exception overhead from each ARI and thus avoid the extra time to push them onto the stack. This approach is called the zero-cost model of exception handling, and the optimized storage mentioned earlier is known as the shadow stack. - Bruce Eckel, Thinking in C++ Volume 2
The size complexity overhead isn't easily quantifiable but Eckel states an average of 5 and 15 percent. This will depend on the size of your exception handling code in ratio to the size of your application code. If your program is small then exceptions will be a large part of the binary. If you are using a zero–cost model than exceptions will take more space to remove the time overhead, so if you care about space and not time than don't use zero-cost compilation.
My opinion is that most embedded systems have plenty of memory to the extent that if your system has a C++ compiler you have enough space to include exceptions. The PC/104 computer that my project uses has several GB of secondary memory, 512 MB of main memory, hence no space problem for exceptions - though, our micorcontrollers are programmed in C. My heuristic is "if there is a mainstream C++ compiler for it, use exceptions, otherwise use C".

Measuring things, part 2. I have now got two programs. The first is in C and is compiled with gcc -O2:
#include <stdio.h>
#include <time.h>
#define BIG 1000000
int f( int n ) {
int r = 0, i = 0;
for ( i = 0; i < 1000; i++ ) {
r += i;
if ( n == BIG - 1 ) {
return -1;
}
}
return r;
}
int main() {
clock_t start = clock();
int i = 0, z = 0;
for ( i = 0; i < BIG; i++ ) {
if ( (z = f(i)) == -1 ) {
break;
}
}
double t = (double)(clock() - start) / CLOCKS_PER_SEC;
printf( "%f\n", t );
printf( "%d\n", z );
}
The second is C++, with exception handling, compiled with g++ -O2:
#include <stdio.h>
#include <time.h>
#define BIG 1000000
int f( int n ) {
int r = 0, i = 0;
for ( i = 0; i < 1000; i++ ) {
r += i;
if ( n == BIG - 1 ) {
throw -1;
}
}
return r;
}
int main() {
clock_t start = clock();
int i = 0, z = 0;
for ( i = 0; i < BIG; i++ ) {
try {
z += f(i);
}
catch( ... ) {
break;
}
}
double t = (double)(clock() - start) / CLOCKS_PER_SEC;
printf( "%f\n", t );
printf( "%d\n", z );
}
I think these answer all the criticisms made of my last post.
Result: Execution times give the C version a 0.5% edge over the C++ version with exceptions, not the 10% that others have talked about (but not demonstrated)
I'd be very grateful if others could try compiling and running the code (should only take a few minutes) in order to check that I have not made a horrible and obvious mistake anywhere. This is knownas "the scientific method"!

I work in a low latency environment. (sub 300 microseconds for my application in the "chain" of production) Exception handling, in my experience, adds 5-25% execution time depending on the amount you do!
We don't generally care about binary bloat, but if you get too much bloat then you thrash like crazy, so you need to be careful.
Just keep the binary reasonable (depends on your setup).
I do pretty extensive profiling of my systems.
Other nasty areas:
Logging
Persisting (we just don't do this one, or if we do it's in parallel)

I guess it'd depend on the hardware and toolchain port for that specific platform.
I don't have the figures. However, for most embedded developement, I have seen people chucking out two things (for VxWorks/GCC toolchain):
Templates
RTTI
Exception handling does make use of both in most cases, so there is a tendency to throw it out as well.
In those cases where we really want to get close to the metal, setjmp/longjmp are used. Note, that this isn't the best solution possible (or very powerful) probably, but then that's what _we_ use.
You can run simple tests on your desktop with two versions of a benchmarking suite with/without exception handling and get the data that you can rely on most.
Another thing about embedded development: templates are avoided like the plague -- they cause too much bloat. Exceptions tag along templates and RTTI as explained by Johann Gerell in the comments (I assumed this was well understood).
Again, this is just what we do. What is it with all the downvoting?

One thing to consider: If you're working in an embedded environment, you want to get the application as small as possible. The Microsoft C Runtime adds quite a bit of overhead to programs. By removing the C runtime as a requirement, I was able to get a simple program to be a 2KB exe file instead of a 70-something kilobyte file, and that's with all the optimizations for size turned on.
C++ exception handling requires compiler support, which is provided by the C runtime. The specifics are shrouded in mystery and are not documented at all. By avoiding C++ exceptions I could cut out the entire C runtime library.
You might argue to just dynamically link, but in my case that wasn't practical.
Another concern is that C++ exceptions need limited RTTI (runtime type information) at least on MSVC, which means that the type names of your exceptions are stored in the executable. Space-wise, it's not an issue, but it just 'feels' cleaner to me to not have this information in the file.

It's easy to see the impact on binary size, just turn off RTTI and exceptions in your compiler. You'll get complaints about dynamic_cast<>, if you're using it... but we generally avoid using code that depends on dynamic_cast<> in our environments.
We've always found it to be a win to turn off exception handling and RTTI in terms of binary size. I've seen many different error handling methods in the absence of exception handling. The most popular seems to be passing failure codes up the callstack. In our current project we use setjmp/longjmp but I'd advise against this in a C++ project as they won't run destructors when exiting a scope in many implementations. If I'm honest I think this was a poor choice made by the original architects of the code, especially considering that our project is C++.

In my opinion exception handling is not something that's generally acceptable for embedded development.
Neither GCC nor Microsoft have "zero-overhead" exception handling. Both compilers insert prologue and epilogue statements into each function that track the scope of execution. This leads to a measurable increase in performance and memory footprint.
The performance difference is something like 10% in my experience, which for my area of work (realtime graphics) is a huge amount. The memory overhead was far less but still significant - I can't remember the figure off-hand but with GCC/MSVC it's easy to compile your program both ways and measure the difference.
I've seen some people talk about exception handling as an "only if you use it" cost. Based on what I've observed this just isn't true. When you enable exception handling it affects all code, whether a code path can throw exceptions or not (which makes total sense when you consider how a compiler works).
I would also stay away from RTTI for embedded development, although we do use it in debug builds to sanity check downcasting results.

Define 'embedded'. On an 8-bit processor I would not certainly not work with exceptions (I would certainly not work with C++ on an 8-bit processor). If you're working with a PC104 type board that is powerful enough to have been someone's desktop a few years back then you might get away with it. But I have to ask - why are there exceptions? Usually in embedded applications anything like an exception occurring is unthinkable - why didn't that problem get sorted out in testing?
For instance, is this in a medical device? Sloppy software in medical devices has killed people. It is unacceptable for anything unplanned to occur, period. All failure modes must be accounted for and, as Joel Spolsky said, exceptions are like GOTO statements except you don't know where they're called from. So when you handle your exception, what failed and what state is your device in? Due to your exception is your radiation therapy machine stuck at FULL and is cooking someone alive (this has happened IRL)? At just what point did the exception happen in your 10,000+ lines of code. Sure you may be able to cut that down to perhaps 100 lines of code but do you know the significance of each of those lines causing an exception is?
Without more information I would say do NOT plan for exceptions in your embedded system. If you add them then be prepared to plan the failure modes of EVERY LINE OF CODE that could cause an exception. If you're making a medical device then people die if you don't. If you're making a portable DVD player, well, you've made a bad portable DVD player. Which is it?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js