Convert Freepascal function to assembly?

Convert Freepascal function to assembly? - c++

Due to performance issues, I'd like to attempt to convert a Freepascal function (SHA1Update, from the SHA1 unit) to assembly. I use Freepascal 2.6.4 and Lazxarus 1.2.4.
The reason is, I have a loop structure (repeat...until) that reads 64Kb blocks of raw data from disk into a buffer, and then it is hashed. Without the hashing, I can read the disk at 4Gb p\min. With the hashing, it slows to just over 1Gb p\min. So someone suggested converting the hashing routine to assembly.
I am a below average programmer when using high-level languages, let alone assembly, but the potential for performance improvement is drving me to at least enquire.
So my question is : is there a program or script that can take a procedure or function and magically convert it to assembly that I can then compile using the Freepascal compiler? I know it can be done for C\C++ using even web based system like this one

Assembly is indeed what you would use for optimising selected sequences of code. But, because native code compilers generate machine code, usually using an intermediate assembly source representation, which is then run through an assembler, the advantage you gain from using a compiler to "magically convert" your section of code, subject to optimisation, to assembly which then is linked to the rest of the program, compared to simply compiling the whole program with the compiler, is about zero - you're using the same compiler for converting, after all. From that angle, a compiler is nothing else than such a program which "magically converts it to assembly". For optimisation purposes, you want to hand code those section of code - and you need to be good at it. Many compilers generate code nowadays which performs better than non-expert crafted code, for various reasons. One is that target CPUs are very different in what is best performing code for them, and the rules to determine how efficient code for a specific CPU must look like, are often extremely complex. As a hand coder, you need to know the differences between them, to know how to write code which performs well. This knowledge is something many compilers have, and are therefore able to generate code such that one or another CPU architecture or model can benefit from the differences the compiler puts into code generation.
Often much better performance gains can be achieved by choosing more efficient algorithms. A better algorithm, coded in high level, usually outperforms a less adequate algorithm, hand coded in assembly. Therefore I'd look into possibilities to make the hashing process as such faster, by looking at alternative and faster algorithms, rather than trying to improve speed using assembly at this stage - consider assembly optimisation as a last, final step optimisation, when other means to speed up your code have been exhausted.

As #Bushmills already explained your code is converted to assembly automatically by the FreePascal compiler - before producing the machine code in the Portable Executable (*.exe) format.
What you would need is not the assembly language, but hand-optimized code written in assembly language. This is task for experienced assembly programmer. You can 1) become an assembly language expert by yourself, this Stack Overflow question can give you some starting points: A good NASM/FASM tutorial?
My guess is that any programmer can become an assembly language expert (either CISC or RISC architectures) in about a year. Depending on your previous experience and the courses you'd take and your eagerness. For theoretical background (processor-neutral) I'd recommend Donald Knuth's MMIX lectures
You should be able to 2) see the intermediate assembly files produced by the FreePascal compiler by following instructions in this: http://free-pascal-general.1045716.n5.nabble.com/Assembler-file-generate-by-compiler-td5710837.html discussion
If you want to really move on in a reasonable time-frame then I'd suggest you to create Minimal, Complete and Verifiable example and 3) ask for code review at some code review sites where some more experienced programmers will take a look at your code and propose some changes. These sites should be a good candidates:
https://codereview.stackexchange.com/
https://www.codementor.io/
Those are sites designed especially for helping beginners and intermediate programmers with problems like the one of yours

Related

How much faster if write in-line assembly rather than regular c/c++ code? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
One of my senior collegues optimizes a function (he is implementing image filtering) by writing in-line assembly. Is that really necessary? Wouldn't modern compiler do that for us? Typically, how much gain do we have by converting C code into assembly? If assembly code really brings lots of benefits, when should we convert C/C++ code into assembly and when should we leave the code as it is, since assembly code is hard to read and maintain.

If you are smarter than the compiler, you may be able to make your code faster on one specific platform by writing it by hand in assembly.
However, most big C/C++ compilers are extremely good optimizers; you are unlikely to be smarter than them.

No, that's not really necessary, and this also makes porting the app much more different. This is the main concern about inline assembly.
And, of course, 80% of the time compiler can do this better.

First find an efficient algorithm.
Then implement it in clearly readable code.
Then evaluate its performance.
If your code's performance is inadequate, consider alternative algorithms
Repeat steps 3 and 4 until either performance is acceptable or you have exhausted all algorithmic alternatives
Drink some coffee.
Take a walk.
Repeat steps 3 and 4 again some more.
Have a beer.
Give steps 3 and 4 another few tries.
Get some rest
Back to 3 and 4.
Spend years studying the architecture of the CPU(s) your code will run on
Now consider hand-writing some assembly.

I'd imagine that for image filtering you might benefit from e.g. the availability of SIMD instructions, but not all compilers can automatically compile your code to use them, and not all the time. So in-line assembly or intrinsics can help with that.

One of my senior collegues optimizes a function (he is implementing image filtering) by writing in-line assembly. Is that really necessary?
Obviously I can't comment on your colleagues exact situation, but I wouldn't be surprised if it was necessary. There's many specialised instructions that are used for image filters that won't necessarily be used by the compiler. Inline assembly is often the only way to access those instructions (or through intrinsics).
Wouldn't modern compiler do that for us?
Obviously this depends on what 'that' is, but while modern compilers are certainly good at generating code, they aren't magic. It is often the case where you know something about your code that the compiler doesn't (or can't).
If your line of work involves high performance code then there are definitely places where you can get major improvements from using inline assembly (or even just compiler intrinsics).
If assembly code really brings lots of benefits, when should we convert C/C++ code into assembly and when should we leave the code as it is, since assembly code is hard to read and maintain.
Here's how:
First, profile your code to see what potential benefits are to be gained.
Look at the disassembly to see what the compiler is doing. If it is already doing things optimally then there is no point going further.
If there are opportunities for improvement, consider using compiler intrinsics before hand-written assembly as it is generally easier to maintain and more portable.
Only if all that fails should you go to inline assembly.

The short answer is no it's not necessary, the longer answer is... well, it depends. Modern compilers do indeed do a very good job of optimizing code, but they don't necessarily have access to all the assumptions a human does when optimizing. Hand coded assembler can beat compiled code, but there is a tradeoff between portability and maintenance.
Assuming that you have already determined this bit of code is a hotspot, the first thing you should do is tweak algorithms, then tweak the C++ code to make it faster (for example unrolling loops), and then tweak compiler flags. As a last resort, if you still can't make it go as fast as you need, consider whether its worth paying the cost of hand-optimizing, given all the future cost you will incur in maintenance and portability.

Where image processing is concerned I would be cautious either way as it depends on the input data, the algorithm and the compiler. Intel's ICC has a very good parallelizer and vectorizer for generating SSE code, it may be hard to beat by hand in most general purpose image processing cases. VCC on the other hand might not do such a good job.
However, I would expect that most benefit could be gained using compiler intrinsics rather than inline assembler.

programming languages are very well coded. Unless you are using very simple bitwise operations, like add, bitshift or using pointers or new instruction sets, you should use a practical programming lanugage. you DO NOT need assembly language for anything in your life. standard c operations call the relevant cpu instructions. If somebody makes a new CPU and it supports new instructions and you want to use those instructions, the programming languages or libraries do not support them and adaptation takes time. A new instruction in a cpu, makes things faster but you won't ever work in teams like DirectX or Opengl or MMX, SSE bla bla bla. Think of a day when graphic libraries like directx or opengl were not developed and intel,for say, creates some isntruction sets currently supported by none of the languages or inexistent in none of the libraries developed. then you would want to call some method from cpu and pass your parameters to that, for better performance. You can still do the same things without the new instruction in cpu. Another example, a new cpu by intel can support md5 hash checking, it doesnt mean you can't use md5, it means a library developed which uses md5 instructions will work faster because the cpu has a separate entity inside which will execute the operation efficiently. but normally you would wait until somebody publishes a library which uses md5 instructions int the cpu. cpus today add instrucion sets for zip, hash check, encryption and so on. you would use assembly language for some specific instruction. not for the good old add, multiply, subtract or divide because your programming language is already using them in the most efficient way possible.

How do I find how C++ compiler implements something except inspecting emitted machine code?

Suppose I crafted a set of classes to abstract something and now I worry whether my C++ compiler will be able to peel off those wrappings and emit really clean, concise and fast code. How do I find out what the compiler decided to do?
The only way I know is to inspect the disassembly. This works well for simple code, but there're two drawbacks - the compiler might do it different when it compiles the same code again and also machine code analysis is not trivial, so it takes effort.
How else can I find how the compiler decided to implement what I coded in C++?

I'm afraid you're out of luck on this one. You're trying to find out "what the compiler did". What the compiler did is to produce machine code. The disassembly is simply a more readable form of the machine code, but it can't add information that isn't there. You can't figure out how a meat grinder works by looking at a hamburger.

I was actually wondering about that.
I have been quite interested, for the last few months, in the Clang project.
One of Clang particular interests, wrt optimization, is that you can emit the optimized LLVM IR code instead of machine code. The IR is a high-level assembly language, with the notion of structure and type.
Most of the optimizations passes in the Clang compiler suite are indeed performed on the IR (the last round is of course architecture specific and performed by the backend depending on the available operations), this means that you could actually see, right in the IR, if the object creation (as in your linked question) was optimized out or not.
I know it is still assembly (though of higher level), but it does seem more readable to me:
far less opcodes
typed objects / pointers
no "register" things or "magic" knowledge required
Would that suit you :) ?

Timing the code will directly measure its speed and can avoid looking at the disassembly entirely. This will detect when compiler, code modifications or subtle configuration changes have affected the performance (either for better or worse). In that way it's better than the disassembly which is only an indirect measure.
Things like code size can also serve as possible indicators of problems. At the very least they suggest that something has changed. It can also point out unexpected code bloat when the compiler should have boiled down a bunch of templates (or whatever) into a concise series of instructions.
Of course, looking at the disassembly is an excellent technique for developing the code and helping decide if the compiler is doing a sufficiently good translation. You can see if you're getting your money's worth, as it were.
In other words, measure what you expect and then dive in if you think the compiler is "cheating" you.

You want to know if the compiler produced "clean, concise and fast code".
"Clean" has little meaning here. Clean code is code which promotes readability and maintainability -- by human beings. Thus, this property relates to what the programmer sees, i.e. the source code. There is no notion of cleanliness for binary code produced by a compiler that will be looked at by the CPU only. If you wrote a nice set of classes to abstract your problem, then your code is as clean as it can get.
"Concise code" has two meanings. For source code, this is about saving the scarce programmer eye and brain resources, but, as I pointed out above, this does not apply to compiler output, since there is no human involved at that point. The other meaning is about code which is compact, thus having lower storage cost. This can have an impact on execution speed, because RAM is slow, and thus you really want the innermost loops of your code to fit in the CPU level 1 cache. The size of the functions produced by the compiler can be obtained with some developer tools; on systems which use GNU binutils, you can use the size command to get the total code and data sizes in an object file (a compiled .o), and objdump to get more information. In particular, objdump -x will give the size of each individual function.
"Fast" is something to be measured. If you want to know whether your code is fast or not, then benchmark it. If the code turns out to be too slow for your problem at hand (this does not happen often) and you have some compelling theoretical reason to believe that the hardware could do much better (e.g. because you estimated the number of involved operations, delved into the CPU manuals, and mastered all the memory bandwidth and cache issues), then (and only then) is it time to have a look at what the compiler did with your code. Barring these conditions, cleanliness of source code is a much more important issue.
All that being said, it can help quite a lot if you have a priori notions of what a compiler can do. This requires some training. I suggest that you have a look at the classic dragon book; but otherwise you will have to spend some time compiling some example code and looking at the assembly output. C++ is not the easiest language for that, you may want to begin with plain C. Ideally, once you know enough to be able to write your own compiler, then you know what a compiler can do, and you can guess what it will do on a given code.

You might find a compiler that had an option to dump a post-optimisation AST/representation - how readable it would be is another matter. If you're using GCC, there's a chance it wouldn't be too hard, and that someone might have already done it - GCCXML does something vaguely similar. Of little use if the compiler you want to build your production code on can't do it.
After that, some compiler (e.g. gcc with -S) can output assembly language, which might be usefully clearer than reading a disassembly: for example, some compilers alternate high-level source as comments then corresponding assembly.
As for the drawbacks you mentioned:
the compiler might do it different when it compiles the same code again
absolutely, only the compiler docs and/or source code can tell you the chance of that, though you can put some performance checks into nightly test runs so you'll get alerted if performance suddenly changes
and also machine code analysis is not trivial, so it takes effort.
Which raises the question: what would be better. I can image some process where you run the compiler over your code and it records when variables are cached in registers at points of use, which function calls are inlined, even the maximum number of CPU cycles an instruction might take (where knowable at compile time) etc. and produces some record thereof, then a source viewer/editor that colour codes and annotates the source correspondingly. Is that the kind of thing you have in mind? Would it be useful? Perhaps some more than others - e.g. black-and-white info on register usage ignores the utility of the various levels of CPU cache (and utilisation at run-time); the compiler probably doesn't even try to model that anyway. Knowing where inlining was really being done would give me a warm fuzzy feeling. But, profiling seems more promising and useful generally. I fear the benefits are more intuitively real than actually, and compiler writers are better off pursuing C++0x features, run-time instrumentation, introspection, or writing D "on the side" ;-).

The answer to your question was pretty much nailed by Karl. If you want to see what the compiler did, you have to start going through the assembly code it produced--elbow grease is required. As to discovering the "why" behind the "how" of how it implemented your code...every compiler (and every build, potentially), as you mentioned, is different. There are different approaches, different optimizations, etc. However, I wouldn't worry about whether it's emitting clean, concise machine code--cleanliness and concision should be left to the source code. Speed, on the other hand, is pretty much the programmer's responsibility (profiling ftw). More interesting concerns are correctness, maintainability, readability, etc. If you want to see if it made a specific optimization, the compiler docs might help (if they're available for your compiler). You can also just trying searching to see if the compiler implements a known technique for optimizing whatever. If those approaches fail, though, you're right back to reading assembly code. Keep in mind that the code that you're checking out might have little to no impact on performance or executable size--grab some hard data before diving into any of this stuff.

Actually, there is a way to get what you want, if you can get your compiler to
produce DWARF debugging information. There will be a DWARF description for each
out-of-line function and within that description there will (hopefully) be entries
for each inlined function. It's not trivial to read DWARF, and sometimes compilers
don't produce complete or accurate DWARF, but it can be a useful source of information
about what the compiler actually did, that's not tied to any one compiler or CPU.
Once you have a DWARF reading library there are all sorts of useful tools you can
build around it.
Don't expect to use it with Visual C++ as that uses a different debugging format.
(But you might be able to do similar queries through the debug helper library
that comes with it.)

If your compiler manages to translate your "wrappings and emit really clean, concise and fast code" the effort to follow-up the emitted code should be reasonable.
Contrary to another answer I feel that emitted assembly code may well be "clean" if it is (relatively) easily mappable to the original source code, if it doesn't consist of calls all over the place and that the system of jumps is not too complex. With code scheduling and re-ordering an optimized machine code which is also readable is, alas, a thing of the past.

Mixing assembler code with c/c++

Why is assembly language code often needed along with C/C++ ?
What can't be done in C/C++, which is possible when assembly language code is mixed?
I have some source code of some 3D computer games. There are a lot of assembler code in use.

Things that pop to mind, in no particular order:
Special instructions. In an embedded application, I need to invalidate the cache after a DMA transfer has filled the memory buffer. The only way to do that on an SH-4 CPU is to execute a special instruction, so inline assembly (or a free-standing assembly function) is the only way to go.
Optimizations. Once upon a time, it was common for compilers to not know every trick that was possible to do. In some of those cases, it was worth the effort to replace an inner loop with a hand-crafted version. On the kinds of CPUs you find in small embedded systems (think 8051, PIC, and so forth) it can be valuable to push inner loops into assembly. I will emphasize that for modern processors with pipelines, multi-issue execution, extensive caching and more, it is often exceptionally difficult for hand coding to even approach the capabilities of the optimizer.
Interrupt handling. In an embedded application it is often needed to catch system events such as interrupts and exceptions. It is often the case that the first few instructions executed by an interrupt have special responsibilities and the only way to guarantee that the right things happen is to write the outer layer of a handler in assembly. For example, on a ColdFire (or any descendant of the 68000) only the very first instruction is guaranteed to execute. To prevent nested interrupts, that instruction must modify the interrupt priority level to mask out the priority of the current interrupt.
Certain portions of an OS kernel. For example, task switching requires that the execution state (at least most registers including PC and stack pointer) be saved for the current task and the state loaded for the new task. Fiddling with execution state of the CPU is well outside of the feature set of the language, but can be wrapped in a small amount of assembly code in a way that allows the rest of the kernel to be written in C or C++.
Edit: I've touched up the wording about optimization. Let me emphasize that for targets with large user populations and well supported compilers with decent optimization, it is highly unlikely that an assembly coder can beat the performance of the optimizer.
Before attempting, start by careful profiling to determine where the bottlenecks really lie. With that information in hand, examine assumptions and algorithms carefully, because the best optimization of all is usually to find a better way to handle the larger picture. Then, if all else fails, isolate the bottleneck in a test case, benchmark it carefully, and begin tweaking in assembly.

Why is assembly language code often
needed along with C/C++ ?
Competitive advantage. Like, if you are writing software for the (soon-to-be) #1 gaming company in the world.
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Nothing, unless some absolute performance level is needed, say, X frames per second or Y billions of polygons per second.
Edit: based on other replies, it seems the consensus is that embedded systems (iPhone, Android etc) have hardware accelerators that certainly require the use of assembly.
I have some source code of some 3D
computer games. There are a lot of
assembler code in use.
They are either written in the 80's-90's, or they are used sparingly (maybe 1% - 5% of total source code) inside a game engine.
Edit: to this date, compiler auto-vectorization quality is still poor. So, you may see programs that contain vectorization intrinsics, and since it's not really much different from writing in actual assembly (most intrinsics have one-one mapping to assembly instructions) some folks might just decide to write in assembly.
Update:
According to anecdotal evidence, RollerCoaster Tycoon is written in 99% assembly.
http://www.chrissawyergames.com/faq3.htm

In the past, compilers used to be pretty poor at optimizing for a particular architecture, and architectures used to be simpler. Now the reverse is true. These days, it's pretty hard for a human to write better assembly than an optimizing compiler, for deeply-pipelined, branch-predicting processors. And so you won't see it much. What there is will be short, and highly targeted.
In short, you probably won't need to do this. If you think you do, profile your code to make sure you've identified a hotspot - don't optimize something just because it's slow, if you're only spending 0.1% of your execution time there. See if you can improve your design or algorithm. If you don't find any improvement there, or if you need functionality not exposed by your higher-level language, look into hand-coding assembly.

There are certain things that can only be done in assembler and cannot be done in C/C++.
These include:
generating software interrupts (SWI or INT instructions)
Use of instructions like SWP for creating mutexes
specialist coporcessor instructions (such as those needed to program the MMU and manage RAM caches)
Access to carry and overflow flags.
You may also be able to optimize code better in assembler than C/C++ (eg memcpy on Android is written in assembler)

There may be new instructions that your compiler cannot yet generate, or the compiler does a bad job, or you may need to control the CPU directly.

Why is assembly language code often
needed along with C/C++ ?needed along with C/C++ ?
It isn't
What can't be done in C/C++, which is
possible when assembly language code
is mixed?
Accessing system registers or IO ports on the CPU.
Accessing BIOS functions.
Using specialized instructions that doesn't map directly to the programming language,
e.g. SIMD instructions.
Provide optimized code that's better than the compiler produces.
The two first points you usually don't need unless you're writing an operating system, or code
running without an operatiing system.
Modern CPUs are quite complex, and you'll be hard pressed to find people that actually can write assembly than what the compiler produces. Many compilers come with libraries giving you access
to more advanced features, like SIMD instructions, so nowadays you often don't need to fall back to
assembly for that.

One more thing worth mentioning is:
C & C++ do not provide any convenient way to setup stack frames when one needs to implement a binary level interop with a script language - or to implement some kind of support for closures.

Assembly can be very optimal than what any compiler can generate in certain situations.

C/C++ compiler feedback optimization

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?

10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.

From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.

Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.

Languages faster than C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
It is said that Blitz++ provides near-Fortran performance.
Does Fortran actually tend to be faster than regular C++ for equivalent tasks?
What about other HL languages of exceptional runtime performance? I've heard of a few languages suprassing C++ for certain tasks... Objective Caml, Java, D...
I guess GC can make much code faster, because it removes the need for excessive copying around the stack? (assuming the code is not written for performance)
I am asking out of curiosity -- I always assumed C++ is pretty much unbeatable barring expert ASM coding.

Fortran is faster and almost always better than C++ for purely numerical code. There are many reasons why Fortran is faster. It is the oldest compiled language (a lot of knowledge in optimizing compilers). It is still THE language for numerical computations, so many compiler vendors make a living of selling optimized compilers. There are also other, more technical reasons. Fortran (well, at least Fortran77) does not have pointers, and thus, does not have the aliasing problems, which plague the C/C++ languages in that domain. Many high performance libraries are still coded in Fortran, with a long (> 30 years) history. Neither C or C++ have any good array constructs (C is too low level, C++ has as many array libraries as compilers on the planet, which are all incompatible with each other, thus preventing a pool of well tested, fast code).

Whether fortran is faster than c++ is a matter of discussion. Some say yes, some say no; I won't go into that. It depends on the compiler, the architecture you're running it on, the implementation of the algorithm ... etc.
Where fortran does have a big advantage over C is the time it takes you to implement those algorithms. And that makes it extremely well suited for any kind of numerical computing. I'll state just a few obvious advantages over C:
1-based array indexing (tremendously helpful when implementing larger models, and you don't have to think about it, but just FORmula TRANslate
has a power operator (**) (God, whose idea was that a power function will do ? Instead of an operator?!)
it has, I'd say the best support for multidimensional arrays of all the languages in the current market (and it doesn't seem that's gonna change so soon) - A(1,2) just like in math
not to mention avoiding the loops - A=B*C multiplies the arrays (almost like matlab syntax with compiled speed)
it has parallelism features built into the language (check the new standard on this one)
very easily connectible with languages like C, python, so you can make your heavy duty calculations in fortran, while .. whatever ... in the language of your choice, if you feel so inclined
completely backward compatible (since whole F77 is a subset of F90) so you have whole century of coding at your disposal
very very portable (this might not work for some compiler extensions, but in general it works like a charm)
problem oriented solving community (since fortran users are usually not cs, but math, phy, engineers ... people with no programming, but rather problem solving experience whose knowledge about your problem can be very helpful)
Can't think of anything else off the top of my head right now, so this will have to do.

What Blitz++ is competing against is not so much the Fortran language, but the man-centuries of work going into Fortran math libraries. To some extent the language helps: an older language has had a lot more time to get optimizing compilers (and , let's face it, C++ is one of the most complex languages). On the other hand, high level C++ libraries like Blitz++ and uBLAS allows you to state your intentions more clearly than relatively low-level Fortran code, and allows for whole new classes of compile-time optimizations.
However, using any library effectively all the time requires developers to be well acquainted with the language, the library and the mathematics. You can usually get faster code by improving any one of the three...

FORTAN is typically faster than C++ for array processing because of the different ways the languages implement arrays - FORTRAN doesn't allow aliasing of array elements, whereas C++ does. This makes the FORTRAN compilers job easier. Also, FORTRAN has many very mature mathematical libraries which have been worked on for nearly 50 years - C++ has not been around that long!

This will depend a lot on the compiler, programmers, whether it has gc and can vary too much. If it is compiled directly to machine code then expect to have better performance than interpreted most of the time but there is a finite amount of optimization possible before you have asm speed anyway.
If someone said fortran was slightly faster would you code a new project in that anyway?

the thing with c++ is that it is very close to the hardware level. In fact, you can program at the hardware level (via assembly blocks). In general, c++ compilers do a pretty good job at optimisations (for a huge speed boost, enable "Link Time Code Generation" to allow the inlining of functions between different cpp files), but if you know the hardware and have the know-how, you can write a few functions in assembly that work even faster (though sometimes, you just can't beat the compiler).
You can also implement you're own memory managers (which is something a lot of other high level languages don't allow), thus you can customize them for your specific task (maybe most allocations will be 32 bytes or less, then you can just have a giant list of 32-byte buffers that you can allocate/deallocate in O(1) time). I believe that c++ CAN beat any other language, as long as you fully understand the compiler and the hardware that you are using. The majority of it comes down to what algorithms you use more than anything else.

You must be using some odd managed XML parser as you load this page then. :)
We continously profile code and the gain is consistently (and this is not naive C++, it is just modern C++ with boos). It consistensly paves any CLR implementation by at least 2x and often by 5x or more. A bit better than Java days when it was around 20x times faster but you can still find good instances and simply eliminate all the System.Object bloat and clearly beat it to a pulp.
One thing managed devs don't get is that the hardware architecture is against any scaling of VM and object root aproaches. You have to see it to believe it, hang on, fire up a browser and go to a 'thin' VM like Silverlight. You'll be schocked how slow and CPU hungry it is.
Two, kick of a database app for any performance, yes managed vs native db.

It's usually the algorithm not the language that determines the performance ballpark that you will end up in.
Within that ballpark, optimising compilers can usually produce better code than most assembly coders.
Premature optimisation is the root of all evil
This may be the "common knowledge" that everyone can parrot, but I submit that's probably because it's correct. I await concrete evidence to the contrary.

D can sometimes be faster than C++ in practical applications, largely because the presence of garbage collection helps avoid the overhead of RAII and reference counting when using smart pointers. For programs that allocate large amounts of small objects with non-trivial lifecycles, garbage collection can be faster than C++-style memory management. Also, D's builtin arrays allow the compiler to perform better optimizations in some cases than C++'s STL vector, which the compiler doesn't understand. Furthermore, D2 supports immutable data and pure function annotations, which recent versions of DMD2 optimize based on. Walter Bright, D's creator, wrote a JavaScript interpreter in both D and C++, and according to him, the D version is faster.

C# is much faster than C++ - in C# I can write an XML parser and data processor in a tenth the time it takes me to write it C++.
Oh, did you mean execution speed?
Even then, if you take the time from the first line of code written to the end of the first execution of the code, C# is still probably faster than C++.
This is a very interesting article about converting a C++ program to C# and the effort required to make the C++ faster than the C#.
So, if you take development speed into account, almost anything beats C++.
OK, to address tht OP's runtime only performance requirement: It's not the langauge, it's the implementation of the language that determines the runtime performance. I could write a C++ compiler that produces the slowest code imaginable, but it's still C++. It is also theoretically possible to write a compiler for Java that targets IA32 instructions rather than the Java VM byte codes, giving a runtime speed boost.
The performance of your code will depend on the fit between the strengths of the language and the requirements of the code. For example, a program that does lots of memory allocation / deallocation will perform badly in a naive C++ program (i.e. use the default memory allocator) since the C++ memory allocation strategy is too generalised, whereas C#'s GC based allocator can perform better (as the above link shows). String manipulation is slow in C++ but quick in languages like php, perl, etc.

It all depends on the compiler, take for example the Stalin Scheme compiler, it beats almost all languages in the Debian micro benchmark suite, but do they mention anything about compile times?
No, I suspect (I have not used Stalin before) compiling for benchmarks (iow all optimizations at maximum effort levels) takes a jolly long time for anything but the smallest pieces of code.

if the code is not written for performance then C# is faster than C++.
A necessary disclaimer: All benchmarks are evil.
Here's benchmarks that in favour of C++.
The above two links show that we can find cases where C++ is faster than C# and vice versa.

Performance of a compiled language is a useless concept: What's important is the quality of the compiler, ie what optimizations it is able to apply. For example, often - but not always - the Intel C++ compiler produces better performing code than g++. So how do you measure the performance of C++?
Where language semantics come in is how easy it is for the programmer to get the compiler to create optimal output. For example, it's often easier to parallelize Fortran code than C code, which is why Fortran is still heavily used for high-performance computation (eg climate simulations).
As the question and some of the answers mentioned assembler: the same is true here, it's just another compiled language and thus not inherently 'faster'. The difference between assembler and other languages is that the programmer - who ideally has absolute knowledge about the program - is responsible for all of the optimizations instead of delegating some of them to the 'dumb' compiler.
Eg function calls in assembler may use registers to pass arguments and don't need to create unnecessary stack frames, but a good compiler can do this as well (think inlining or fastcall). The downside of using assembler is that better performing algorithms are harder to implement (think linear search vs. binary seach, hashtable lookup, ...).

Doing much better than C++ is mostly going to be about making the compiler understand what the programmer means. An example of this might be an instance where a compiler of any language infers that a region of code is independent of its inputs and just computes the result value at compile time.
Another example of this is how C# produces some very high performance code simply because the compiler knows what particular incantations 'mean' and can cleverly use the implementation that produces the highest performance, where a transliteration of the same program into C++ results in needless alloc/delete cycles (hidden by templates) because the compiler is handling the general case instead of the particular case this piece of code is giving.
A final example might be in the Brook/Cuda adaptations of C designed for exotic hardware that isn't so exotic anymore. The language supports the exact primitives (kernel functions) that map to the non von-neuman hardware being compiled for.

Is that why you are using a managed browser? Because it is faster. Or managed OS because it is faster. Nah, hang on, it is the SQL database.. Wait, it must be the game you are playing. Stop, there must be a piece of numerical code Java adn Csharp frankly are useless with. BTW, you have to check what your VM is written it to slag the root language and say it is slow.
What a misconecption, but hey show me a fast managed app so we can all have a laugh. VS? OpenOffice?

Ahh... The good old question - which compiler makes faster code?
It only matters in code that actually spends much time at the bottom of the call stack, i.e. hot spots that don't contain function calls, such as matrix inversion, etc.
(Implied by 1) It only matters in code the compiler actually sees. If your program counter spends all its time in 3rd-party libraries you don't build, it doesn't matter.
In code where it does matter, it all comes down to which compiler makes better ASM, and that's largely a function of how smartly or stupidly the source code is written.
With all these variables, it's hard to distinguish between good compilers.
However, as was said, if you've got a lot of Fortran code to compile, don't re-write it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js