When should you not use [[carries_dependency]]?

When should you not use [[carries_dependency]]? - c++

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.

In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html

because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

Related

Require compiler to emit branchless/constant-time code

In cryptography, any piece of code that depends on secret data (such as a private key) must execute in constant time in order to avoid side-channel timing attacks.
The most popular architectures currently (x86-64 and ARM AArch64) both support certain kinds of conditional execution instructions, such as:
CMOVcc, SETcc for x86-64
CSINCcc, CSINVcc, CSNEGcc for AArch64
Even when such instructions are not available, there are techniques to convert a piece of code into a branchless version. Performance may suffer, but in this scenario it's not the primary goal -- running in constant time is.
Therefore, it should in principle be possible to write branchless code in e.g. C/C++, and indeed it is seen that gcc/clang will often emit branchless code with optimizations turned on (there is even a specific flag for this in gcc: -fif-conversion2). However, this appears to be an optimization decision, and if the compiler thinks branchless will perform worse (say, if the "then" and "else" clauses perform a lot of computation, more than the cost of flushing the pipeline in case of a wrongly predicted branch), then I assume the compiler will emit regular code.
If constant-time is a non-negotiable goal, one may be forced to use some of the aforementioned tricks to generate branchless code, making the code less clear. Also, performance is often a secondary and quite important goal, so the developer has to hope that the compiler will infer the intended operation behind the branchless code and emit an efficient instruction sequence, often using the instructions mentioned above. This may require rewriting the code over and over while looking at the assembly output, until a magic incantation satisfies the compilers -- and this may change from compiler to compiler, or when a new version comes out.
Overall, this is an awful situation on both sides: compiler writers must infer intent from obfuscated code, transforming it into a much simpler instruction sequence; while developers must write such obfuscated code, since there are no guarantees that simple, clear code would actually run in constant time.
Making this into a question: if a certain piece of code must be emitted in constant-time (or not at all), is there a compiler flag or pragma that will force the code to be emitted as such, even if the compiler predicts worse performance than the branched version, or abort the compilation if it is not possible? Developers would be able to write clear code with the peace of mind that it will be constant-time, while supplying the compiler with clear and easy to analyze code. I understand this is probably a language- and compiler-dependent question, so I would be satisfied with either C or C++ answers, for either gcc or clang.

I found this question by going down a similar rabbit hole. For security purposes I require my code to not branch on secret data and to not leak information trough timing attacks.
While not an answer per se I can recommend this paper from the S&P 2018: https://ieeexplore.ieee.org/document/8406587.
The authors also wrote and extension for CLang/LLVM. I am not sure how well this extension works but it's a first step and gives a good overview on where we currently stand in the research context.

Is undefined behavior only an issue if you are deploying on several platforms?

Most of the conversations around undefined behavior (UB) talk about how there are some platforms that can do this, or some compilers do that.
What if you are only interested in one platform and only one compiler (same version) and you know you will be using them for years?
Nothing is changing but the code, and the UB is not implementation-defined.
Once the UB has manifested for that architecture and that compiler and you have tested, can't you assume that from then on whatever the compiler did with the UB the first time, it will do that every time?
Note: I know undefined behavior is very, very bad, but when I pointed out UB in code written by somebody in this situation, they asked this, and I didn't have anything better to say than, if you ever have to upgrade or port, all the UB will be very expensive to fix.
It seems there are different categories of Behavior:
Defined - This is behavior documented to work by the standards
Supported - This is behavior documented to be supported a.k.a
implementation defined
Extensions - This is a documented addition, support for low level
bit operations like popcount, branch hints, fall into this category
Constant - While not documented, these are behaviors that will
likely be consistent on a given platform things like endianness,
sizeof int while not portable are likely to not change
Reasonable - generally safe and usually legacy, casting from
unsigned to signed, using the low bit of a pointer as temp space
Dangerous - reading uninitialized or unallocated memory, returning
a temp variable, using memcopy on a non pod class
It would seem that Constant might be invariant within a patch version on one platform. The line between Reasonable and Dangerous seems to be moving more and more behavior towards Dangerous as compilers become more aggressive in their optimizations

OS changes, innocuous system changes (different hardware version!), or compiler changes can all cause previously "working" UB to not work.
But it is worse than that.
Sometimes a change to an unrelated compilation unit, or far away code in the same compilation unit, can cause previously "working" UB to not work; as an example, two inline functions or methods with different definitions but the same signature. One is silently discarded during linking; and completely innocuous code changes can change which one is discarded.
The code that is working in one context can suddenly stop working in the same compiler, OS and hardware when you use it in a different context. An example of this is violating strong aliasing; the compiled code might work when called at spot A, but when inlined (possibly at link-time!) the code can change meaning.
Your code, if part of a larger project, could conditionally call some 3rd party code (say, a shell extension that previews an image type in a file open dialog) that changes the state of some flags (floating point precision, locale, integer overflow flags, division by zero behavior, etc). Your code, which worked fine before, now exhibits completely different behavior.
Next, many kinds of undefined behavior are inherently non-deterministic. Accessing the contents of a pointer after it is freed (even writing to it) might be safe 99/100, but 1/100 the page was swapped out, or something else was written there before you got to it. Now you have memory corruption. It passes all your tests, but you lacked complete knowledge of what can go wrong.
By using undefined behavior, you commit yourself to a complete understanding of the C++ standard, everything your compiler can do in that situation, and every way the runtime environment can react. You have to audit the produced assembly, not the C++ source, possibly for the entire program, every time you build it! You also commit everyone who reads that code, or who modifies that code, to that level of knowledge.
It is sometimes still worth it.
Fastest Possible Delegates uses UB and knowledge about calling conventions to be a really fast non-owning std::function-like type.
Impossibly Fast Delegates competes. It is faster in some situations, slower in others, and is compliant with the C++ standard.
Using the UB might be worth it, for the performance boost. It is rare that you gain something other than performance (speed or memory usage) from such UB hackery.
Another example I've seen is when we had to register a callback with a poor C API that just took a function pointer. We'd create a function (compiled without optimization), copy it to another page, modify a pointer constant within that function, then mark that page as executable, allowing us to secretly pass a pointer along with the function pointer to the callback.
An alternative implementation would be to have some fixed size set of functions (10? 100? 1000? 1 million?) all of which look up a std::function in a global array and invoke it. This would put a limit on how many such callbacks we install at any one time, but practically was sufficient.

No, that's not safe. First of all, you will have to fix everything, not only the compiler version. I do not have particular examples, but I guess that a different (upgraded) OS, or even an upgraded processor might change UB results.
Moreover, even having a different data input to your program can change UB behavior. For example, an out-of-bound array access (at least without optimizations) usually depend on whatever is in the memory after the array.
UPD: see a great answer by Yakk for more discussion on this.
And a bigger problem is optimization and other compiler flags. UB may manifest itself in a different ways depending on optimization flags, and it's quite difficult to imagine somebody to use always the same optimization flags (at least you'll use different flags for debug and release).
UPD: just noticed that you did never mention fixing a compiler version, you only mentioned fixing a compiler itself. Then everything is even more unsafe: new compiler versions might definitely change UB behavior. From this series of blog posts:
The important and scary thing to realize is that just about any
optimization based on undefined behavior can start being triggered on
buggy code at any time in the future. Inlining, loop unrolling, memory
promotion and other optimizations will keep getting better, and a
significant part of their reason for existing is to expose secondary
optimizations like the ones above.

This is basically a question about a specific C++ implementation. "Can I assume that a specific behavior, undefined by the standard, will continue to be handled by ($CXX) on platform XYZ in the same way under circumstances UVW?"
I think you either should clarify by saying exactly what compiler and platform you are working with, and then consult their documentation to see if they make any guarantees, otherwise the question is fundamentally unanswerable.
The whole point of undefined behavior is that the C++ standard doesn't specify what happens, so if you are looking for some kind of guarantee from the standard that it's "ok" you aren't going to find it. If you are asking whether the "community at large" considers it safe, that's primarily opinion based.
Once the UB has manifested for that architecture and that compiler and you have tested, can't you assume that from then on whatever the compiler did with the UB the first time, it will do that every time?
Only if the compiler makers guarantee that you can do this, otherwise, no, it's wishful thinking.
Let me try to answer again in a slightly different way.
As we all know, in normal software engineering, and engineering at large, programmers / engineers are taught to do things according to a standard, the compiler writers / parts manufacturers produce parts / tools that meet a standard, and at the end you produce something where "under the assumptions of the standards, my engineering work shows that this product will work", and then you test it and ship it.
Suppose you had a crazy uncle jimbo and one day, he got all his tools out and a whole bunch of two by fours, and worked for weeks and made a makeshift roller coaster in your backyard. And then you run it, and sure enough it doesn't crash. And you even run it ten times, and it doesn't crash. Now jimbo is not an engineer, so this is not made according to standards. But if it didn't crash after even ten times, that means it's safe and you can start charging admission to the public, right?
To a large extent what's safe and what isn't is a sociological question. But if you want to just make it a simple question of "when can I reasonably assume that no one would get hurt by me charging admission, when I can't really assume anything about the product", this is how I would do it. Suppose I estimate that, if I start charging admission to the public, I'll run it for X years, and in that time, maybe 100,000 people will ride it. If it's basically a biased coin flip whether it breaks or not, then what I would want to see is something like, "this device has been run a million times with crash dummies, and it never crashed or showed hints of breaking." Then I could quite reasonably believe that if I start charging admission to the public, the odds that anyone will ever get hurt are quite low, even though there are no rigorous engineering standards involved. That would just be based on a general knowledge of statistics and mechanics.
In relation to your question, I would say, if you are shipping code with undefined behavior, which no one, either the standard, the compiler maker, or anyone else will support, that's basically "crazy uncle jimbo" engineering, and it's only "okay" if you do vastly increased amounts of testing to verify that it meets your needs, based on a general knowledge of statistics and computers.

What you are referring to is more likely implementation defined and not undefined behavior. The former is when the standard doesn't tell you what will happen but it should work the same if you are using the same compiler and the same platform. An example for this is assuming that an int is 4 bytes long. UB is something more serious. There the standard doesn't say anything. It is possible that for a given compiler and platform it works, but it is also possible that it works only in some of the cases.
An example is using uninitialized values. If you use an uninitialized bool in an if, you may get true or false, and it may happen that it is always what you want, but the code will break in several surprising ways.
Another example is dereferencing a null pointer. While it will probably result in a segfault in all cases, but the standard doesn't require the program to even produce the same results every time a program is run.
In summary, if you are doing something that is implementation defined, then you are safe if you are only developing to one platform and you tested that it works. If you are doing something that is undefined behavior, then you are probably not safe in any case. There may be that it works but nothing guarantees it.

Think about it a different way.
Undefined behavior is ALWAYS bad, and should never be used, because you never know what you will get.
However, you can temper that with
Behavior can be defined by parties other than just the language specification
Thus you should never rely on UB, ever, but you can find alternate sources which state that a certain behavior is DEFINED behavior for your compiler in your circumstances.
Yakk gave great examples regarding the fast delegate classes. In those cases, the author explicitly claims that they are engaging in undefined behavior, according to the spec. However, they then go to explain a business reason why the behavior is better defined than that. For example, they declare that the memory layout of a member function pointer is unlikely to change in Visual Studio because there would be rampant business costs due to incompatibilities which are distasteful to Microsoft. Thus they declare that the behavior is "de facto defined behavior."
Similar behavior can be seen in the typical linux implementation of pthreads (to be compiled by gcc). There are cases where they make assumptions about what optimizations a compiler is allowed to invoke in multithreaded scenarios. Those assumptions are stated plainly in comments in the sourcecode. How is this "de facto defined behavior?" Well, pthreads and gcc go kind of hand in hand. It would be considered unacceptable to add an optimization to gcc which broke pthreads, so nobody will ever do it.
However, you cannot make the same assumption. You may say "pthreads does it, so I should be able to as well." Then, someone makes an optimization, and updates gcc to work with it (perhaps using __sync calls instead of relying on volatile). Now pthreads keeps functioning... but your code doesn't anymore.
Also consider the case of MySQL (or was it Postgre?) where they found a buffer overflow error. The overflow had actually been caught in the code, but it did so using undefined behavior, so the latest gcc started optimizing the entire check out.
So, in all, look for an alternate source of defining the behavior, rather than using it while it is undefined. It is totally legit to find a reason why you know 1.0/0.0 equals NaN, rather than causing a floating point trap to occur. But never use that assumption without first proving that it is a valid definition of behavior for you and your compiler.
And please oh please oh please remember that we upgrade compilers every now and then.

Historically, C compilers have generally tended to act in somewhat-predictable fashion even when not required to do so by the Standard. On most platforms, for example, a comparison between a null pointer and a pointer to a dead object will simply report that they are not equal (useful if code wishes to safely assert that the pointer is null and trap if it isn't). The Standard does not require compilers to do these things, but historically compilers which could do them easily have done so.
Unfortunately, some compiler writers have gotten the idea that if such a comparison could not be reached while the pointer was validly non-null, the compiler should omit the assertion code. Worse, if it can also determine that certain input would cause the code to be reached with an invalid non-null pointer, it should assume that such input will never be received, and omit all code which would handle such input.
Hopefully such compiler behavior will turn out to be a short-lived fad. Supposedly, it's driven by a desire to "optimize" code, but for most applications robustness is more important than speed, and having compilers mess with code that would have limited the damage caused by errant inputs or errand program behavior is a recipe for disaster.
Until then, however, one must be very careful when using compilers to read the documentation carefully, since there's no guarantee that a compiler writer won't have decided that it was less important to support useful behaviors which, though widely supported, aren't mandated by the Standard (such as being able to safely check whether two arbitrary objects overlap), than to exploit every opportunity to eliminate code which the Standard doesn't require it to execute.

Undefined behavior can be altered by things such as the ambient temperature, which causes rotating hard disk latencies to change, which causes thread scheduling to change, which in turn changes the contents of the random garbage that's getting evaluated.
In short, not safe unless the compiler or the OS specifies the behavior (since the language standard didn't).

There is a fundamental problem with undefined behavior of any kind: It is diagnosed by sanitizers and optimizers. A compiler can silently change behavior corresponding to those from one version to another (e.g. by expanding its repertoire), and suddenly you'll have some untraceable error in your program. This should be avoided.
There is undefined behavior that is made "defined" by your particular implementation, though. A left shift by a negative amount of bits can be defined by your machine, and it would be safe to use it there, as breaking changes of documented features occur quite rarely. One more common example is strict aliasing: GCC can disable this restriction with -fno-strict-aliasing.

While I agree with the answers that say that it's not safe even if you don't target multiple platforms, every rule can have exceptions.
I would like to present two examples where I'm confident that allowing undefined / implementation-defined behavior was the right choice.
A single-shot program. It's not a program which is intended to be used by anyone, but it's a small and quickly written program created to calculate or generate something now. In such a case a "quick and dirty" solution can be the right choice, for example, if I know the endianness of my system and I don't want to bother with writing a code which works with the other endianness. For example, I only needed it to perform a mathematical proof to know if I'll be able to use a specific formula in my other, user-oriented program or not.
Very small embedded devices. The cheapest microcontrollers have memory measured in a few hundred bytes. If you develop a small toy with blinking LEDs or a musical postcard, etc, every penny counts, because it will be produced in the millions with a very low profit per unit. Neither the processor nor the code ever changes, and if you have to use a different processor for the next generation of your product, you will probably have to rewrite your code anyway. A good example of an undefined behavior in this case is that there are microcontrollers which guarantee a value of zero (or 255) for every memory location at power-up. In this case you can skip the initialization of your variables. If your microcontroller has only 256 bytes of memory, this can make a difference between a program which fits into the memory and a code which doesn't.
Anyone who disagrees with point 2, please imagine what would happen if you told something like this to your boss:
"I know the hardware costs only $ 0.40 and we plan selling it for $ 0.50. However, the program with 40 lines of code I've written for it only works for this very specific type of processor, so if in the distant future we ever change to a different processor, the code will not be usable and I'll have to throw it out and write a new one. A standard-conforming program which works for every type of processor will not fit into our $ 0.40 processor. Therefore I request to use a processor which costs $ 0.60, because I refuse to write a program which is not portable."

"Software that doesn't change, isn't being used."
If you are doing something unusual with pointers, there's probably a way to use casts to define what you want. Because of their nature, they will not be "whatever the compiler did with the UB the first time".
For example, when you refer to memory pointed at by an uninitialize pointer, you get a random address that is different every time you run the program.
Undefined behavior generally means you are doing something tricky, and you would be better off doing the task another way.
For instance, this is undefined:
printf("%d %d", ++i, ++i);
It's hard to know what the intent would even be here, and should be re-thought.

Changing the code without breaking it requires reading and understanding the current code. Relying on undefined behavior hurts readability: If I can't look it up, how am I supposed to know what the code does?
While portability of the program might not be an issue, portability of the programmers might be. If you need to hire someone to maintain the program, you'll want to be able to look simply for a '<language x> developer with experience in <application domain>' that fits well into your team rather than having to find a capable '<language x> developer with experience in <application domain> knowing (or willing to learn) all the undefined behavior intrinsics of version x.y.z on platform foo when used in combination with bar while having baz on the furbleblawup'.

Nothing is changing but the code, and the UB is not implementation-defined.
Changing the code is sufficient to trigger different behavior from the optimizer with respect to undefined behavior and so code that may have worked can easily break due to seemingly minor changes that expose more optimization opportunities. For example a change that allows a function to be inlined, this is covered well in What Every C Programmer Should Know About Undefined Behavior #2/3 which says:
While this is intentionally a simple and contrived example, this sort of thing happens all the time with inlining: inlining a function often exposes a number of secondary optimization opportunities. This means that if the optimizer decides to inline a function, a variety of local optimizations can kick in, which change the behavior of the code. This is both perfectly valid according to the standard, and important for performance in practice.
Compiler vendors have become very aggressive with optimizations around undefined behavior and upgrades can expose previously unexploited code:
The important and scary thing to realize is that just about any optimization based on undefined behavior can start being triggered on buggy code at any time in the future. Inlining, loop unrolling, memory promotion and other optimizations will keep getting better, and a significant part of their reason for existing is to expose secondary optimizations like the ones above.

Does C++ code run faster if there is no structure in program

I know it helps a lot if we structure our programs using classes, structs etc. but does it help in terms of running speed that we avoid these structures and write code plain in terms of basic C++ syntax?
For example, I am trying to write a program that works on vectors. Now it sounds tempting to write a class vector and define its methods like set_at_index(int i) that sets the value of specific row i of this vector. Furthermore I can check whether i<=N where N is the length of the vector in question.
My confusion is that with these routine every set_at_index method that is used a lot will require one 'if' statement. So if I want my code to run faster should I avoid it and go with declaring an array and manually take care that there is no memory leak?
Is there any way I can check for the memory leaks without putting burden on the code speed?

Yes, bounds checking will take slightly more time. But it will take so little extra time that it will only matter if the code is being run 28894389375 times and then it might add up to a millisecond. Note that std::vector only performs bounds checking if you use the at member function, not if you use operator[]. Also, if you are doing anything like writing to a file or printing text to the console, doing that one time will likely take more time than ten million bounds-checked array accesses, because I/O is relatively very very slow.
Typically, without bounds checking code using classes will run at the same speed as code using plain arrays. The problem with manually managing memory like you suggest is that it's easy to forget to clean it up, or to clean it up only through one path of execution through the program, or to fail to clean it up in the event of an exception. It's really hardly ever worth it. Also, it'll be just as fast to use a vector class without bounds checking as it will be to use a dynamic array without bounds checking. You pay for it either way.
I also suggest using std::vector instead of writing your own vector class since they do pretty much every optimisation you could do yourself, and they usually have the advantage of being able to write the code for their specific compiler and perhaps be able to take advantage of things that only that compiler does because they know more of its implementation. The STL classes are also rigorously tested and written by experts (usually).
You should write your code first, then measure with a profiler to see the bottlenecks in your code if it is not fast enough already, then optimise the bottlenecks. I will bet that bounds checking on arrays is probably not going to be one of those bottlenecks.
Checking for memory leaks can be done with a tool like valgrind. You don't do it in the code itself.

Don't try to over optimize before you even start writing. Go ahead and write code that is easily maintainable, readable, and as bug free as possible. Once you have things working, you can start profiling to see the real bottlenecks.

"Premature optimization is root of all evils" - Donald Knuth. (this is true 97% of the time).
Unless you profile your application and see that your class encapsulation is a bottleneck that does slow your application in a significant amount, don't hesitate to have high level structures. It will brings you plenty of benefits like readability, maintainance, and understanding what you do. That's what brings OOP: Big scale programs.

Some good answers have already been posted, and premature optimization is indeed inadvisable as others have said. However, let me put your question in slightly another light.
I know it helps a lot if we structure our programs using classes,
structs etc. but does it help in terms of running speed that we avoid
these structures and write code plain in terms of basic C++ syntax?
Theoretically, most properly written C++ code should run just as fast with fully developed classes as without, but
there are exceptions to the rule;
the effort required to write the C++ code theoretically properly may be too great; and
the same features of the C++ compiler that make it hard to write incorrect code can make it all too easy to write grossly inefficient code.
Point-by-point remarks follow.
Consider a complex three-dimensional vector type, of which each instance consists of six doubles (three real parts and three imaginary parts). If there were not so many doubles, your compiler might load them directly into your microprocessor's registers, but with six they are likely to remain on the stack when the complex three-dimensional vector is loaded. Some operations however on a complex three-dimensional vector do not require all six doubles, but only one, two or three of them. If so, then it might be preferable to store the six floating-point components separately. Thus, rather than an array of 1000 vectors, you'd keep six arrays of 1000 doubles each. Of course, one can (and probably should) bind the arrays together in a class of some kind, but -- for efficiency reasons only -- a good design might never explicitly associate individual elements from one array to another.
Sometimes, you know where your data is and what you want to do with it, and C++'s elaborate organizational and access-control facilities only get in your way. In this case, you might skip the high-level C++ and just do what you want in primitive, hackworthy, brutish, machete-wielding C-style code. Indeed, C++ explicitly supports this style of coding by making it possible -- nay, easy -- to wrap the primitive C code safely within a module and thus to hide its horror from the rest of your beautiful C++ program. Of course, if you hand your code a machete, so to speak, then you take a risk, don't you, because your code may hack up data you never wanted it to, and your compiler will stand aside and let it do it; but sometimes the risk is worth the gain, and sometimes the risk is even fun (and character-building) for a programmer's change of pace.
This point is the most subtle of the three. Where a user-defined type consists partly of other user-defined types, multiple layers of constructors will be called and implicitly invoked. This is great, and usually it is what you want, especially if you have a good unit-testing regime at each layer. The rose however has a thorn, as it were. A properly written constructor is careful never to lose anything it needs. So, unless the programmer is most careful, a constructor may quietly make a lot of strictly unnecessary copies of very large objects. Sometimes, the programmer will mentally lose track of all the levels of implicit invocation, which he never would have done if he had had to handle each invocation explicitly. Also, your data in an object of one type may lack access to a member function to which it can easily gain access, so long is it temporarily copies itself to an object of another type (you can avoid the copy with the use of handle types, reference counting and so forth, but this is not free: it's quite a bit of work). Even if the programmer is conscious of the implicit copy, the implicit copy is so much easier to code in the moment that the temptation to do so is sometimes too great -- especially when a deadline looms! Several hidden inefficiencies can arise in these ways. One can, and should, work around such inefficiencies, of course, but it can take a lot of coding effort to do so and, even then, your compiler is so busy helping you to avoid logical errors that it tends to cause you to create inadvertent inefficiencies that you would never purposely have created. The unnecessary, hidden copying of data is a much bigger problem in C++ than it ever was in C.
All in all, I would say that the C++ trade-off is worth it 80 percent of the time. C++'s organizational and access-control facilities merit the effort it takes to apply them properly. If your question regards the 20 percent, well, there is more than one valid approach to programming, in my view. Sometimes it really does help "that we avoid these structures and write code plain in terms of basic C++ syntax," as you have said.
Usually, no. Sometimes, yes. I think that the earlier answers are right, though, that the particular example you have posed is probably better treated in boring, neat, orderly C++, without tricks.

Two things:
DO NOT do any kind of premature optimization, CPUs are fast nowadays, compilers are smart and able to figure out optimizations that you wouldn't think of in months of looking at your code.
you can easily check things like memory leaks by profiling your code and/or using conditional compilation. Leaks shouldn't occur on release versions so you should just skip that checks.

Is it worth writing part of code in C instead of C++ as micro-optimization?

I am wondering if it is still worth with modern compilers and their optimizations to write some critical code in C instead of C++ to make it faster.
I know C++ might lead to bad performance in case classes are copied while they could be passed by reference or when classes are created automatically by the compiler, typically with overloaded operators and many other similar cases; but for a good C++ developer who knows how to avoid all of this, is it still worth writing code in C to improve performance?

I'm going to agree with a lot of the comments. C syntax is supported, intentionally (with divergence only in C99), in C++. Therefore all C++ compilers have to support it. In fact I think it's hard to find any dedicated C compilers anymore. For example, in GCC you'll actually end up using the same optimization/compilation engine regardless of whether the code is C or C++.
The real question is then, does writing plain C code and compiling in C++ suffer a performance penalty. The answer is, for all intents and purposes, no. There are a few tricky points about exceptions and RTTI, but those are mainly size changes, not speed changes. You'd be so hard pressed to find an example that actually takes a performance hit that it doesn't seem worth it do write a dedicate module.
What was said about what features you use is important. It is very easy in C++ to get sloppy about copy semantics and suffer huge overheads from copying memory. In my experience this is the biggest cost -- in C you can also suffer this cost, but not as easily I'd say.
Virtual function calls are ever so slightly more expensive than normal functions. At the same time forced inline functions are cheaper than normal function calls. In both cases it is likely the cost of pushing/popping parameters from the stack that is more expensive. Worrying about function call overhead though should come quite late in the optimization process -- as it is rarely a significant problem.
Exceptions are costly at throw time (in GCC at least). But setting up catch statements and using RAII doesn't have a significant cost associated with it. This was by design in the GCC compiler (and others) so that truly only the exceptional cases are costly.
But to summarize: a good C++ programmer would not be able to make their code run faster simply by writing it in C.

measure! measure before thinking about optimizing, measure before applying optimization, measure after applying optimization, measure!
If you must run your code 1 nanosecond faster (because it's going to be used by 1000 people, 1000 times in the next 1000 days and that second is very important) anything goes.
Yes! it is worth ...
changing languages (C++ to C; Python to COBOL; Mathlab to Fortran; PHP to Lisp)
tweaking the compiler (enable/disable all the -f options)
use different libraries (even write your own)
etc
etc
What you must not forget is to measure!.

pmg nailed it. Just measure instead of global assumptions. Also think of it this way, compilers like gcc separate the front, middle, and back end. so the frontend fortran, c, c++, ada, etc ends up in the same internal middle language if you will that is what gets most of the optimization. Then that generic middle language is turned into assembler for the specific target, and there are target specific optimizations that occur. So the language may or may not induce more code from the front to middle when the languages differ greatly, but for C/C++ I would assume it is the same or very similar. Now the binary size is another story, the libraries that may get sucked into the binary for C only vs C++ even if it is only C syntax can/will vary. Doesnt necessarily affect execution performance but can bulk up the program file costing storage and transfer differences as well as memory requirements if the program loaded as a while into ram. Here again, just measure.
I also add to the measure comment compile to assembler and/or disassemble the output and compare the results of your different languages/compiler choices. This can/will supplement the timing differences you see when you measure.

The question has been answered to death, so I won't add to that.
Simply as a generic question, assuming you have measured, etc, and you have identified that a certain C++ (or other) code segment is not running at optimal speed (which generally means you have not used the right tool for the job); and you know you can get better performance by writing it in C, then yes, definitely, it is worth it.
There is a certain mindset that is common, trying to do everything from one tool (Java or SQL or C++). Not just Maslow's Hammer, but the actual belief that they can code a C construct in Java, etc. This leads to all kinds of performance problems. Architecture, as a true profession, is about placing code segments in the appropriate architectural location or platform. It is the correct combination of Java, SQL and C that will deliver performance. That produces an app that does not need to be re-visited; uneventful execution. In which case, it will not matter if or when C++ implements this constructors or that.

I am wondering if it is still worth with modern compilers and their optimizations to write some critical code in C instead of C++ to make it faster.
no. keep it readable. if your team prefers c++ or c, prefer that - especially if it is already functioning in production code (don't rewrite it without very good reasons).
I know C++ might lead to bad performance in case classes are copied while they could be passed by reference
then forbid copying and assigning
or when classes are created automatically by the compiler, typically with overloaded operators and many other similar cases
could you elaborate? if you are referring to templates, they don't have additional cost in runtime (although they can lead to additional exported symbols, resulting in a larger binary). in fact, using a template method can improve performance if (for example) a conversion would otherwise be necessary.
but for a good C++ developer who knows how to avoid all of this, is it still worth writing code in C to improve performance?
in my experience, an expert c++ developer can create a faster, more maintainable program.
you have to be selective about the language features that you use (and do not use). if you break c++ features down to the set available in c (e.g., remove exceptions, virtual function calls, rtti) then you're off to a good start. if you learn to use templates, metaprogramming, optimization techniques, avoid type aliasing (which becomes increasingly difficult or verbose in c), etc. then you should be on par or faster than c - with a program which is more easily maintained (since you are familiar with c++).
if you're comfortable using the features of c++, use c++. it has plenty of features (many of which have been added with speed/cost in mind), and can be written to be as fast as c (or faster).
with templates and metaprogramming, you could turn many runtime variables into compile-time constants for exceptional gains. sometimes that goes well into micro-optimization territory.

Implementing atomic<T>::store

I'm attempting to implement the atomic library from the C++0x draft. Specifically, I'm implementing §29.6/8, the store method:
template <typename T>
void atomic<T>::store(T pDesired, memory_order pOrder = memory_order_seq_cst);
The requirement states:
The order argument shall not be memory_order_consume, memory_order_acquire, nor memory_order_acq_rel.
I'm not sure what to do if it is one of these. Should I do nothing, throw an exception, get undefined behavior, or do something else?
P.S.: "C++0X" looks kinda like a dead fish :3

Do what you want. It doesn't matter.
When ISO state that you "shall not do something", doing it is undefined behaviour. If a user does that, they have violated the contract with the implementation, and the implementation is within its rights to do as it pleases.
What you decide to do is entirely up to you. I would opt for whatever makes your implementation "better" (in your eyes, be that faster, more readable, subject to the principle of least astonishment, and so forth).
Myself, I'd go for readability (since I would have to maintain the thing) with speed taking a close second.

I prefer a compile time error. If not that, then an assert() failure.
Assert is good because it compiles out of the release version and will not impact performance.
Compile time errors are even better because they provide more immediate feedback without waiting for the software to trip over the bug. Compile time error checks are a thing I love about C++ code over Python, Ruby, Perl code.

I'd rather get vaguey sane behaviour that something crazy.
Well, as a potential consumer of your library, here's what I'd like: if there's no performance cost to the documented usage, then see if one of the memory_order values provides a functional superset of the others, particularly something corresponding to what a caller might naively expect the unsupported modes to do (if any sensible expectation can be formed). The caller may get the slowest, safest mode, but that's better than something functionally wrong. You minimise the client code's dependency on getting everything perfect for your code. The problem with this - compared to an assert/exception - is that it can go unnoticed in a test environment, so consider also writing an explanation to std::cerr, using a static variable to limit the messages to one per process run. That's a very useful diagnostic.
An exception, fatal assertion etc. might bring down a client application at a very inconvenient moment.... Seems a bit draconian, and not something I'd appreciate particularly. Another option is to have an environment variable control this behaviour.
(There's presumably a similar issue for values that aren't even in your current enumeration.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js