How to disable out of order execution of CPU [duplicate]

How to disable out of order execution of CPU [duplicate] - c++

This question already has answers here:
Enforcing statement order in C++
(7 answers)
Closed 4 years ago.
As we know, modern CPU is able to execute multi commands concurrently: https://en.wikipedia.org/wiki/Out-of-order_execution
As a result, we may get such a fact as below:
When we execute the c++ code: x = a + b; y = c + d;, y = c + d; may be executed before x = a + b;.
My question is if it is possible to disable the out-of-order execution of CPU?

No, you can't deactivate such hardware mechanism when it has it. That's what gives you performance. That's how the CPU is designed.
What C++ guarantees is that you won't be able to see the difference between what you want with the proper order, and what you will get. And that's also something that vendors like Intel will make sure for the assembly.
Have a look at https://www.youtube.com/watch?v=FJIn1YhPJJc&frags=pl%2Cwn for the C++ execution model.

All you should care about is the meaning of your program. That's not being pedantic, it is the fundamental basis around which the entire language has been designed.
A C++ program describes the meaning of a program. It is not a one-to-one mapping of source code to what a computer should literally do.
If you want that, you will have to code in assembly or perhaps some old-fashioned language from the middle ages, but even then you are going to have a hard time telling a modern CPU not to do all the clever things that it is designed to do in order to support useful programs. Certainly I'm not aware of any out-of-the-box switch, flag or setting that makes this happen; it would go against the grain of the very architecture of a CPU.
Ultimately you may be better off building and programming a Difference Engine ;)

You cannot deactivate the reordering at the hardware level (CPU level).
However, you can ensure that the compiler will not reorder by using the optimization level of debug.
This will help in debugging the program, but it will make your code slow. It is not recommended in production code.

It's not possible to disable out-of-order execution from the program, this is the well-known problem of Memory Barrier. As the problem consequences appear in multi-threading programs only, you have to use synchronization primitives to ensure the execution order.

Related

Will compiler automatically optimize repeating code?

If I have some code with simple arithmetics that is repeating several times. Will the compiler automatically optimize it?
Here the example:
someArray[index + 1] = 5;
otherArray[index + 1] = 7;
Does it make sense to introduce variable nextIndex = index + 1 from the perfomance point of view, (not from the point of view of good readable and maintanable code) or the compiler will do such optimization automatically?

You should not worry about trivial optimization like this because almost all compilers do it last 10-15 years or longer.
But if you have a really critical place in your code and want to get maximal speed of running, than you can check generated assembler code for this lines to be sure that compiler did this trivial optimization.
In some cases one more arithmetic addition could be more faster version of code than saving in register or memory, and compilers knows about this. You can make your code slower if you try optimize trivial cases manually.
And you can use online services like https://gcc.godbolt.org for check generated code (support gcc, clang, icc in several version).

The old adage "suck it and see" seems to be appropriate here. We often forget that by far the most common processors are 4/8/16 bit micros with weird and wonderful application specific architectures and suitably odd vendor specific compilers to go with them. They frequently have compiler extensions to "aid" (or confuse) the compiler into producing "better" code.
One DSP from early 2000s carried out 8 instructions per clock-cycle in parallel in a pipeline (complex - "load+increment+multiply+add+round"). The precondition for this to work was that everything had to be preloaded into the registers beforehand. This meant that registers were obviously at a premium (as always). With this architecture it was frequently better to bin results to free registers and use free slots that couldn't be paralleled (some instructions precluded the use of others in the same cycle) to recalculate it later. Did the compiler get this "right"?. Yes, it often kept the result to reuse later with the result that it stalled the pipeline due to lack of registers which resulted in slower execution speed.
So, you compiled it, examined it, profiled it etc. to make sure that the when when the compiler got it "right" we could go in and fix it. Without additional semantic information which is not supported by the language it is really hard to know what "right" is.
Conclusion: Suck it and see

Yes. It's a common optimization. https://en.wikipedia.org/wiki/Common_subexpression_elimination

On the implementation of multiple simultaneous tasks of a compiler [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In my struggle to part ways with interpreters, and while attempting to advance my knowledge in c++, I recently puchased a copy of "C++ In a Nutshell: A Desktop Quick Reference (In a Nutshell (O'Reilly))" and "Writing Compilers and Interpreters (Wiley)". While my college course on C++ taught me how to sort stacks and lists, it taught me nothing on those subjects. I decided I will write a compiler taylored to my unique habits and coding style.
In the case of the recent advent of real multiprocessing capacity, what is worth the required effort? I am fully aware of a multitude of libraries capable of providing threading, and of providing multiprocessing. Defaulting to a resort of pre-existing code does not actualize an efficient learning process due to the fact that that old intimate connection with personally written code would be quite lacking.

Compilers are generally implemented as a pipeline, where source code goes in one end and a number of processing steps are applied in sequence, and an object file comes out the other end. This kind of processing does not generally lend itself well to multithreading.
With the speed of today's computers, it is rare that a compiler needs to do so much processing that it would benefit from a multithreaded implementation. Within a pipeline, there's only a few things you could usefully do anyway, such as running per-function optimisations in parallel.
A much easier gain can be seen by simply running the compiler in parallel against more than one source file. This can be doing without any multithreading support in the compiler; just write a single-threaded compiler (much less prone to error), and let your build system take care of the parallelism.

Writing a compiler is hard; languages are complex, people want good code, and they don't want to wait long get it. This should be obvious from your own experience with using C++ compilers. Writing a C++ compiler is especially hard, because the language is especially complex (the new C++11 standard making it considerably more complex), and people expect really good code from C++ compilers.
All this complexity, and your lack of background in compilers, suggests that you are not likely to write a C++ compiler, let alone a parallel one. The GCC and CLANG communities have hundreds of peoples and decades of elapsed development time. It may be worth your effort to build a toy compiler, to understand the issues better.
Regarding parallelism in the compiler itself, one can take several approaches.
The first is to use a standard library, as you have suggested, and attempt to retrofit an existing compiler with parallelism. It is odd that few seem to have attempted this task given that GCC and CLANG are open source tools. But it is also very difficult to parallelize a program that was designed without parallelism in mind. Where does one find units of parallelism (processing individual methods?), how do you insert the parallelism, how do you ensure that the now-retrofitted compiler doesn't have synchronization problems (if the compiler processes all methods in parallel, where's the guarantee that a signature from one method is actually ready and available to other methods being compiled?) Finally, how does one guarantee that the parallel work dominates the additional thread initialization/teardown/synchronization overhead so that the compiler is actually faster given multiple CPUs? In addition, thread libraries are a bit difficult to use, and it is relatively easy to make dumb mistake coding a threading call. If you have lots of these in your compiler, you have a high probability of such a dumb mistake. Debugging will be hard.
The second is build a new compiler from scratch, using such libraries. This requires a lot of work just to get the basic compiler in place, but has the advantage that the individual elements of the compiler can be considered during design, and parallel interlocks designed in I don't know of any compilers built this way (surely there are some research compilers like this) but its a lot of work, and clearly more work than just writing a non-parallel compiler. You still suffer from the clumsiness of thread libraries, too.
The third is to find a parallel programming language, and write the compiler in that. This makes is easier to write parallel code without error, and can allow you to implement kinds of parallelism that might not be possible with a thread library (e.g., dynamic teams of computation, partial orders, ...). It also has the advantage that the parallel-language compiler can see the parallelism in the code, and can thus generate lower-overhead thread operations. Because compilers do many computations of varying duration, you don't want a synchronized-data-parallel language; you want one with task parallelism.
Our PARLANSE compiler is such a programming language, designed with goal of doing parallel symbolic (eg non-numeric) computations appropriate for compilers. Now you need a a parallel compiler and the energy to build a new compiler from scratch.
The fourth approach is to use a parallel language and a predefined library of compiler-typical activities (parsing, symbol table lookup, flow analysis, code generation) and build your compiler that way. At then you don't have to reinvent the basic facilities of the compiler and can get on with building the compiler itself.
Our DMS Software Reengineering Toolkit is exactly such a set of tools and libraries designed to allow one to build complex code generation/transformation or analysis tools. DMS has an full C++11 front end available that uses all the DMS (parallel) support machinery.
We've used DMS to carry out massive C++ analysis and transformation tasks. The parallelism is there, and it works; we could do more if we tried. We have not attempted to build a real parallel compiler; we don't see the market for it considering that other C++ compilers are free and well established. I expect someday that somebody will find a niche place where a parallel C++ compiler is needed, and then this machinery like this is likely to be a foundation; its too much work to start from scratch.

It's unlikely that you're going to build something that's substantially useful to you in real coding, especially as a first learning experience -- or, in other words, the value is in the journey, not in what you reach at the end. This is a learning experience.
Thus, I would suggest that you take a couple of things that are of interest to you and that annoy you about existing languages and compilers, and try to improve on them (or at least try out something different). Beyond that, though, you want to write something that's as simple as possible, so that you can actually complete it.
If you are doing enough multithreaded programming that that's a critical part of what you're thinking about, then it may be worth trying to write something that does multiprocessing. Otherwise, I would strongly recommend that you leave it out of your first compiler project; it's a big hairy nest of complication, and a compiler is a big enough and hairy enough project without it.
And, if you decide that multithreading is important, you can always add it later. Better to have something that works first, though, so that you have the fun of using it to keep you from getting bogged down while you add more to it.
Oh, and you definitely don't want to make your first compiler multithreaded internally! Start with simple, then add complexity once you've done it once and know how much effort is involved.

It depends on the type of language you wish to compile.
If you think of creating a language today, it should be based on modules, rather than the header/source way C++ is built. With such languages, the compiler is often given a single "target" (a module) and automatically compiles the dependencies (other modules), this is for example what the Python compiler does.
In this case, multithreading yields an immediate gain: you can simply parallelize the multiple modules compilations. This is relatively easy, just think about leaving a "mark" that you are compiling a given module to avoid doing it several times in parallel.
Going further is not immediately useful. While optimizations could probably be applied in parallel for example, it would require paying attention to the synchronization of the models they are applied to. The nice thing about compiling multiple modules in parallel is that they are mostly independent during compilation and read-only once ready, which alleviates most synchronization issues.

Is it worth writing part of code in C instead of C++ as micro-optimization?

I am wondering if it is still worth with modern compilers and their optimizations to write some critical code in C instead of C++ to make it faster.
I know C++ might lead to bad performance in case classes are copied while they could be passed by reference or when classes are created automatically by the compiler, typically with overloaded operators and many other similar cases; but for a good C++ developer who knows how to avoid all of this, is it still worth writing code in C to improve performance?

I'm going to agree with a lot of the comments. C syntax is supported, intentionally (with divergence only in C99), in C++. Therefore all C++ compilers have to support it. In fact I think it's hard to find any dedicated C compilers anymore. For example, in GCC you'll actually end up using the same optimization/compilation engine regardless of whether the code is C or C++.
The real question is then, does writing plain C code and compiling in C++ suffer a performance penalty. The answer is, for all intents and purposes, no. There are a few tricky points about exceptions and RTTI, but those are mainly size changes, not speed changes. You'd be so hard pressed to find an example that actually takes a performance hit that it doesn't seem worth it do write a dedicate module.
What was said about what features you use is important. It is very easy in C++ to get sloppy about copy semantics and suffer huge overheads from copying memory. In my experience this is the biggest cost -- in C you can also suffer this cost, but not as easily I'd say.
Virtual function calls are ever so slightly more expensive than normal functions. At the same time forced inline functions are cheaper than normal function calls. In both cases it is likely the cost of pushing/popping parameters from the stack that is more expensive. Worrying about function call overhead though should come quite late in the optimization process -- as it is rarely a significant problem.
Exceptions are costly at throw time (in GCC at least). But setting up catch statements and using RAII doesn't have a significant cost associated with it. This was by design in the GCC compiler (and others) so that truly only the exceptional cases are costly.
But to summarize: a good C++ programmer would not be able to make their code run faster simply by writing it in C.

measure! measure before thinking about optimizing, measure before applying optimization, measure after applying optimization, measure!
If you must run your code 1 nanosecond faster (because it's going to be used by 1000 people, 1000 times in the next 1000 days and that second is very important) anything goes.
Yes! it is worth ...
changing languages (C++ to C; Python to COBOL; Mathlab to Fortran; PHP to Lisp)
tweaking the compiler (enable/disable all the -f options)
use different libraries (even write your own)
etc
etc
What you must not forget is to measure!.

pmg nailed it. Just measure instead of global assumptions. Also think of it this way, compilers like gcc separate the front, middle, and back end. so the frontend fortran, c, c++, ada, etc ends up in the same internal middle language if you will that is what gets most of the optimization. Then that generic middle language is turned into assembler for the specific target, and there are target specific optimizations that occur. So the language may or may not induce more code from the front to middle when the languages differ greatly, but for C/C++ I would assume it is the same or very similar. Now the binary size is another story, the libraries that may get sucked into the binary for C only vs C++ even if it is only C syntax can/will vary. Doesnt necessarily affect execution performance but can bulk up the program file costing storage and transfer differences as well as memory requirements if the program loaded as a while into ram. Here again, just measure.
I also add to the measure comment compile to assembler and/or disassemble the output and compare the results of your different languages/compiler choices. This can/will supplement the timing differences you see when you measure.

The question has been answered to death, so I won't add to that.
Simply as a generic question, assuming you have measured, etc, and you have identified that a certain C++ (or other) code segment is not running at optimal speed (which generally means you have not used the right tool for the job); and you know you can get better performance by writing it in C, then yes, definitely, it is worth it.
There is a certain mindset that is common, trying to do everything from one tool (Java or SQL or C++). Not just Maslow's Hammer, but the actual belief that they can code a C construct in Java, etc. This leads to all kinds of performance problems. Architecture, as a true profession, is about placing code segments in the appropriate architectural location or platform. It is the correct combination of Java, SQL and C that will deliver performance. That produces an app that does not need to be re-visited; uneventful execution. In which case, it will not matter if or when C++ implements this constructors or that.

I am wondering if it is still worth with modern compilers and their optimizations to write some critical code in C instead of C++ to make it faster.
no. keep it readable. if your team prefers c++ or c, prefer that - especially if it is already functioning in production code (don't rewrite it without very good reasons).
I know C++ might lead to bad performance in case classes are copied while they could be passed by reference
then forbid copying and assigning
or when classes are created automatically by the compiler, typically with overloaded operators and many other similar cases
could you elaborate? if you are referring to templates, they don't have additional cost in runtime (although they can lead to additional exported symbols, resulting in a larger binary). in fact, using a template method can improve performance if (for example) a conversion would otherwise be necessary.
but for a good C++ developer who knows how to avoid all of this, is it still worth writing code in C to improve performance?
in my experience, an expert c++ developer can create a faster, more maintainable program.
you have to be selective about the language features that you use (and do not use). if you break c++ features down to the set available in c (e.g., remove exceptions, virtual function calls, rtti) then you're off to a good start. if you learn to use templates, metaprogramming, optimization techniques, avoid type aliasing (which becomes increasingly difficult or verbose in c), etc. then you should be on par or faster than c - with a program which is more easily maintained (since you are familiar with c++).
if you're comfortable using the features of c++, use c++. it has plenty of features (many of which have been added with speed/cost in mind), and can be written to be as fast as c (or faster).
with templates and metaprogramming, you could turn many runtime variables into compile-time constants for exceptional gains. sometimes that goes well into micro-optimization territory.

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.

Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.

Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm

In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.

There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)

There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.

Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.

One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.

Assembly can be very optimal than what any compiler can generate in certain situations.

Languages faster than C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
It is said that Blitz++ provides near-Fortran performance.
Does Fortran actually tend to be faster than regular C++ for equivalent tasks?
What about other HL languages of exceptional runtime performance? I've heard of a few languages suprassing C++ for certain tasks... Objective Caml, Java, D...
I guess GC can make much code faster, because it removes the need for excessive copying around the stack? (assuming the code is not written for performance)
I am asking out of curiosity -- I always assumed C++ is pretty much unbeatable barring expert ASM coding.

Fortran is faster and almost always better than C++ for purely numerical code. There are many reasons why Fortran is faster. It is the oldest compiled language (a lot of knowledge in optimizing compilers). It is still THE language for numerical computations, so many compiler vendors make a living of selling optimized compilers. There are also other, more technical reasons. Fortran (well, at least Fortran77) does not have pointers, and thus, does not have the aliasing problems, which plague the C/C++ languages in that domain. Many high performance libraries are still coded in Fortran, with a long (> 30 years) history. Neither C or C++ have any good array constructs (C is too low level, C++ has as many array libraries as compilers on the planet, which are all incompatible with each other, thus preventing a pool of well tested, fast code).

Whether fortran is faster than c++ is a matter of discussion. Some say yes, some say no; I won't go into that. It depends on the compiler, the architecture you're running it on, the implementation of the algorithm ... etc.
Where fortran does have a big advantage over C is the time it takes you to implement those algorithms. And that makes it extremely well suited for any kind of numerical computing. I'll state just a few obvious advantages over C:
1-based array indexing (tremendously helpful when implementing larger models, and you don't have to think about it, but just FORmula TRANslate
has a power operator (**) (God, whose idea was that a power function will do ? Instead of an operator?!)
it has, I'd say the best support for multidimensional arrays of all the languages in the current market (and it doesn't seem that's gonna change so soon) - A(1,2) just like in math
not to mention avoiding the loops - A=B*C multiplies the arrays (almost like matlab syntax with compiled speed)
it has parallelism features built into the language (check the new standard on this one)
very easily connectible with languages like C, python, so you can make your heavy duty calculations in fortran, while .. whatever ... in the language of your choice, if you feel so inclined
completely backward compatible (since whole F77 is a subset of F90) so you have whole century of coding at your disposal
very very portable (this might not work for some compiler extensions, but in general it works like a charm)
problem oriented solving community (since fortran users are usually not cs, but math, phy, engineers ... people with no programming, but rather problem solving experience whose knowledge about your problem can be very helpful)
Can't think of anything else off the top of my head right now, so this will have to do.

What Blitz++ is competing against is not so much the Fortran language, but the man-centuries of work going into Fortran math libraries. To some extent the language helps: an older language has had a lot more time to get optimizing compilers (and , let's face it, C++ is one of the most complex languages). On the other hand, high level C++ libraries like Blitz++ and uBLAS allows you to state your intentions more clearly than relatively low-level Fortran code, and allows for whole new classes of compile-time optimizations.
However, using any library effectively all the time requires developers to be well acquainted with the language, the library and the mathematics. You can usually get faster code by improving any one of the three...

FORTAN is typically faster than C++ for array processing because of the different ways the languages implement arrays - FORTRAN doesn't allow aliasing of array elements, whereas C++ does. This makes the FORTRAN compilers job easier. Also, FORTRAN has many very mature mathematical libraries which have been worked on for nearly 50 years - C++ has not been around that long!

This will depend a lot on the compiler, programmers, whether it has gc and can vary too much. If it is compiled directly to machine code then expect to have better performance than interpreted most of the time but there is a finite amount of optimization possible before you have asm speed anyway.
If someone said fortran was slightly faster would you code a new project in that anyway?

the thing with c++ is that it is very close to the hardware level. In fact, you can program at the hardware level (via assembly blocks). In general, c++ compilers do a pretty good job at optimisations (for a huge speed boost, enable "Link Time Code Generation" to allow the inlining of functions between different cpp files), but if you know the hardware and have the know-how, you can write a few functions in assembly that work even faster (though sometimes, you just can't beat the compiler).
You can also implement you're own memory managers (which is something a lot of other high level languages don't allow), thus you can customize them for your specific task (maybe most allocations will be 32 bytes or less, then you can just have a giant list of 32-byte buffers that you can allocate/deallocate in O(1) time). I believe that c++ CAN beat any other language, as long as you fully understand the compiler and the hardware that you are using. The majority of it comes down to what algorithms you use more than anything else.

You must be using some odd managed XML parser as you load this page then. :)
We continously profile code and the gain is consistently (and this is not naive C++, it is just modern C++ with boos). It consistensly paves any CLR implementation by at least 2x and often by 5x or more. A bit better than Java days when it was around 20x times faster but you can still find good instances and simply eliminate all the System.Object bloat and clearly beat it to a pulp.
One thing managed devs don't get is that the hardware architecture is against any scaling of VM and object root aproaches. You have to see it to believe it, hang on, fire up a browser and go to a 'thin' VM like Silverlight. You'll be schocked how slow and CPU hungry it is.
Two, kick of a database app for any performance, yes managed vs native db.

It's usually the algorithm not the language that determines the performance ballpark that you will end up in.
Within that ballpark, optimising compilers can usually produce better code than most assembly coders.
Premature optimisation is the root of all evil
This may be the "common knowledge" that everyone can parrot, but I submit that's probably because it's correct. I await concrete evidence to the contrary.

D can sometimes be faster than C++ in practical applications, largely because the presence of garbage collection helps avoid the overhead of RAII and reference counting when using smart pointers. For programs that allocate large amounts of small objects with non-trivial lifecycles, garbage collection can be faster than C++-style memory management. Also, D's builtin arrays allow the compiler to perform better optimizations in some cases than C++'s STL vector, which the compiler doesn't understand. Furthermore, D2 supports immutable data and pure function annotations, which recent versions of DMD2 optimize based on. Walter Bright, D's creator, wrote a JavaScript interpreter in both D and C++, and according to him, the D version is faster.

C# is much faster than C++ - in C# I can write an XML parser and data processor in a tenth the time it takes me to write it C++.
Oh, did you mean execution speed?
Even then, if you take the time from the first line of code written to the end of the first execution of the code, C# is still probably faster than C++.
This is a very interesting article about converting a C++ program to C# and the effort required to make the C++ faster than the C#.
So, if you take development speed into account, almost anything beats C++.
OK, to address tht OP's runtime only performance requirement: It's not the langauge, it's the implementation of the language that determines the runtime performance. I could write a C++ compiler that produces the slowest code imaginable, but it's still C++. It is also theoretically possible to write a compiler for Java that targets IA32 instructions rather than the Java VM byte codes, giving a runtime speed boost.
The performance of your code will depend on the fit between the strengths of the language and the requirements of the code. For example, a program that does lots of memory allocation / deallocation will perform badly in a naive C++ program (i.e. use the default memory allocator) since the C++ memory allocation strategy is too generalised, whereas C#'s GC based allocator can perform better (as the above link shows). String manipulation is slow in C++ but quick in languages like php, perl, etc.

It all depends on the compiler, take for example the Stalin Scheme compiler, it beats almost all languages in the Debian micro benchmark suite, but do they mention anything about compile times?
No, I suspect (I have not used Stalin before) compiling for benchmarks (iow all optimizations at maximum effort levels) takes a jolly long time for anything but the smallest pieces of code.

if the code is not written for performance then C# is faster than C++.
A necessary disclaimer: All benchmarks are evil.
Here's benchmarks that in favour of C++.
The above two links show that we can find cases where C++ is faster than C# and vice versa.

Performance of a compiled language is a useless concept: What's important is the quality of the compiler, ie what optimizations it is able to apply. For example, often - but not always - the Intel C++ compiler produces better performing code than g++. So how do you measure the performance of C++?
Where language semantics come in is how easy it is for the programmer to get the compiler to create optimal output. For example, it's often easier to parallelize Fortran code than C code, which is why Fortran is still heavily used for high-performance computation (eg climate simulations).
As the question and some of the answers mentioned assembler: the same is true here, it's just another compiled language and thus not inherently 'faster'. The difference between assembler and other languages is that the programmer - who ideally has absolute knowledge about the program - is responsible for all of the optimizations instead of delegating some of them to the 'dumb' compiler.
Eg function calls in assembler may use registers to pass arguments and don't need to create unnecessary stack frames, but a good compiler can do this as well (think inlining or fastcall). The downside of using assembler is that better performing algorithms are harder to implement (think linear search vs. binary seach, hashtable lookup, ...).

Doing much better than C++ is mostly going to be about making the compiler understand what the programmer means. An example of this might be an instance where a compiler of any language infers that a region of code is independent of its inputs and just computes the result value at compile time.
Another example of this is how C# produces some very high performance code simply because the compiler knows what particular incantations 'mean' and can cleverly use the implementation that produces the highest performance, where a transliteration of the same program into C++ results in needless alloc/delete cycles (hidden by templates) because the compiler is handling the general case instead of the particular case this piece of code is giving.
A final example might be in the Brook/Cuda adaptations of C designed for exotic hardware that isn't so exotic anymore. The language supports the exact primitives (kernel functions) that map to the non von-neuman hardware being compiled for.

Is that why you are using a managed browser? Because it is faster. Or managed OS because it is faster. Nah, hang on, it is the SQL database.. Wait, it must be the game you are playing. Stop, there must be a piece of numerical code Java adn Csharp frankly are useless with. BTW, you have to check what your VM is written it to slag the root language and say it is slow.
What a misconecption, but hey show me a fast managed app so we can all have a laugh. VS? OpenOffice?

Ahh... The good old question - which compiler makes faster code?
It only matters in code that actually spends much time at the bottom of the call stack, i.e. hot spots that don't contain function calls, such as matrix inversion, etc.
(Implied by 1) It only matters in code the compiler actually sees. If your program counter spends all its time in 3rd-party libraries you don't build, it doesn't matter.
In code where it does matter, it all comes down to which compiler makes better ASM, and that's largely a function of how smartly or stupidly the source code is written.
With all these variables, it's hard to distinguish between good compilers.
However, as was said, if you've got a lot of Fortran code to compile, don't re-write it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js