Cross Platform Floating Point Consistency

Cross Platform Floating Point Consistency - c++

I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)

Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)

No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.

Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.

1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.

Related

Will statically linked c++ binary work on every system with same architecture?

I'm making a very simple program with c++ for linux usage, and I'd like to know if it is possible to make just one big binary containing all the dependencies that would work on any linux system.
If my understanding is correct, any compiler turns source code into machine instructions, but since there are often common parts of code that can be reused with different programs, most programs depend on another libraries.
However if I have the source code for all my dependencies, I should be able to compile a binary in a way that would not require anything from the system? Will I be able to run stuff compiled on 64bit system on a 32bit system?

In short: Maybe.
The longer answer is:
It depends. You can't, for example, run a 64-bit binary on a 32-bit system, that's just not even nearly possible. Yes, it's the same processor family, but there are twice as many registers in the 64-bit system, which also has twice as long registers. What's the 32-bit processor going to "give back" for the value of those bits and registers that doesn't exist in the hardware in the processor? It just plain won't work. Some of the instructions also completely change meaning, so the system really needs to be "right" for the compiled code, or it won't work - fortunately, Linux will check this and plain refuse if it's not right.
You can BUILD a 32-bit binary on a 64-bit system (assuming you have all the right libraries, etc, installed for both 64- and 32-bit, etc).
Similarly, if you try to run ARM code on an x86 processor, or MIPS code on an ARM processor, it simply has no chance of working, because the actual instructions are completely different (or they would be in breach of some patent/copyright or similar, because processor instruction sets contain portions that are "protected intellectual property" in nearly all cases - so designers have to make sure they do NOT do "the same as someone else's design"). Like for 32-bit and 64-bit, you simply won't get a chance to run the wrong binary here, it just won't work.
Sometimes, there are subtle differences, for example ARM code can be compiled with "hard" or "soft" floating point. If the code is compiled for hard float, and there isn't the right support in the OS, then it won't run the binary. Worse yet, if you compile on x86 for SSE instructions, and try to run on a non-SSE processor, the code will simply crash [unless you specifically build code to "check for SSE, and display error if not present"].
So, if you have a binary that passes the above criteria, the Linux system tends to change a tiny bit between releases, and different distributions have subtle "fixes" that change things. Most of the time, these are completely benign (they fix some obscure corner-case that someone found during testing, but the general, non-corner case behaviour is "normal"). However, if you go from Linux version 2.2 to Linux version 3.15, there will be some substantial differences between the two versions, and the binary from the old one may very well be incompatible with the newer (and almost certainly the other way around) - it's hard to know exactly which versions are and aren't compatible. Within releases that are close, then it should work OK as long as you are not specifically relying on some feature that is present in only one (after all, new things ARE added to the Linux kernel from time to time). Here the answer is "maybe".
Note that in the above is also your implementation of the C and C++ runtime, so if you have a "new" C or C++ runtime library that uses Linux kernel feature X, and try to run it on an older kernel, before feature X was implemented (or working correctly for the case the C or C++ runtime is trying to use it).
Static linking is indeed a good way to REDUCE the dependency of different releases. And a good way to make your binary huge, which may be preventing people from downloading it.
Making the code open source is a much better way to solve this problem, then you just distribute your source code and a list of "minimum requirements", and let other people deal with it needing to be recompiled.

In practice, it depends on "sufficiently simple". If you're using C++11, you'll quickly find that the C++11 libraries have dependencies on modern libc releases. In turn, those only ship with modern Linux distributions. I'm not aware of any "Long Term Support" Linux distribution which today (June 2014) ships with libc support for GCC 4.8

The short answer is no, at least without serious hack.
Different linux distribution may have different glue code between user-space and kernel. For instant, an hello world seemingly without dependency built from ubuntu cannot be executed under CentOS.
EDIT: Thanks for the comment. I re-verify this and the cause is im using 32-bit VM. Sorry for causing confusion. However, as noted above, the rule of thumb is that even same linux distribution may sometime breaks compatibility in order to deploy bugfix, so the conclusion stands.

Filesize difference when cross compiling

I am writing a small application in c++ that runs on my host machine (linux x86) and on a a target machine(arm).
The problem I have is that on the host machine my binary is about 700kb of size but on the target machine it is about 7mb.
I am using the same compile switches for both platforms. My first though was that a library on the arget machine got linked statically but I checked both binaries with objdump and both use the same dynamically link libraries.
So can anyone give me hint how I can figure out why there is such a huge difference in size?

While different computer architectures can theoretically require completely different amounts of executable code for the same program, a factor of 10 is not really expected among modern architectures. ARM and x86 may be different, but they are still designed in the same universe where memory and bandwidth is not something to waste, leading CPU designers to try to keep the executable code as tight as possible.
I would, therefore, look at the following possibilities, in order of probability:
Symbol stripping: if one of the two binaries has been stripped from its symbols, then it would be significantly smaller, especially if compiled with debugging information. You might want to try to strip both binaries and see what happens.
Static linking: I have occasionally encountered build systems for embedded targets that would prefer static linking over using shared libraries. Examining the library dependencies of each binary would probably detect this.
Additional enabled code: The larger binary may have additional code enabled because e.g. the build system found an additional optional library or because the target platform requires specific handle.
Still, a factor of 10 is probably too much for this, unless the smaller binary is missing a lot of functionality or the larger one has linked in some optional library statically.
Different compiler configuration: You should not only look at the compiler options that you supply, but also at the defaults the compiler uses for each target. For example if the compiler has significantly higher inlining or loop unrolling limits in one architecture, the resulting executable could baloon-out noticeably.

first there is no reason to expect the same code compiled for different architectures to have any kind of relationship in size to each other. You can easily have A be larger than B then change one line of code and then B is larger than A.
Second the "binaries" you are talking about are I am guessing elf, which is a little bit of binary and some to a lot of overhead. The overhead can vary between architectures and other such things.
Bottom line if you are compiling the same code for two architectures/platforms or with different compilers or compile options for the same architecture there is no reason to expect the file sizes to have any relationship to each other.

Different architectures can have completely different ways to handle the same thing. For example loading immediate value on CISC (e.g. x86) architecture is usually one instruction, while on RISC (e.g. ppc, arm) it usually is more than one instruction, the actual number needed being dependent on the value. For example if the instruction set only allows 16bit immediate values, you may need up to 7 instructions to load a 64bit value (loading by 16bits and shifting in between the loads). Hence the code is inherently different.

One reason not mentioned so far, but relevant to ARM/x86 comparisons is Floating Point emulation. All x86 chips today come with native FP support (and x86-64 even with SIMD FP support via SSE), but ARM CPU's often lack a FP unit. That in turn means even a simple FP addition has to be turned into a long sequence of integer operations on exponents and mantissa's.

Fortran: differences between generated code compiled using two different compilers

I have to work on a fortran program, which used to be compiled using Microsoft Compaq Visual Fortran 6.6. I would prefer to work with gfortran but I have met lots of problems.
The main problem is that the generated binaries have different behaviours. My program takes an input file and then has to generate an output file. But sometimes, when using the binary compiled by gfortran, it crashes before its end, or gives different numerical results.
This a program written by researchers which uses a lot of float numbers.
So my question is: what are the differences between these two compilers which could lead to this kind of problem?
edit:
My program computes the values of some parameters and there are numerous iterations. At the beginning, everything goes well. After several iterations, some NaN values appear (only when compiled by gfortran).
edit:
Think you everybody for your answers.
So I used the intel compiler which helped me by giving some useful error messages.
The origin of my problems is that some variables are not initialized properly. It looks like when compiling with compaq visual fortran these variables take automatically 0 as a value, whereas with gfortran (and intel) it takes random values, which explain some numerical differences which add up at the following iterations.
So now the solution is a better understanding of the program to correct these missing initializations.

There can be several reasons for such behaviour.
What I would do is:
Switch off any optimization
Switch on all debug options. If you have access to e.g. intel compiler, use ifort -CB -CU -debug -traceback. If you have to stick to gfortran, use valgrind, its output is somewhat less human-readable, but it's often better than nothing.
Make sure there are no implicit typed variables, use implicit none in all the modules and all the code blocks.
Use consistent float types. I personally always use real*8 as the only float type in my codes. If you are using external libraries, you might need to change call signatures for some routines (e.g., BLAS has different routine names for single and double precision variables).
If you are lucky, it's just some variable doesn't get initialized properly, and you'll catch it by one of these techniques. Otherwise, as M.S.B. was suggesting, a deeper understanding of what the program really does is necessary. And, yes, it might be needed to just check the algorithm manually starting from the point where you say 'some NaNs values appear'.

Different compilers can emit different instructions for the same source code. If a numerical calculation is on the boundary of working, one set of instructions might work, and another not. Most compilers have options to use more conservative floating point arithmetic, versus optimizations for speed -- I suggest checking the compiler options that you are using for the available options. More fundamentally this problem -- particularly that the compilers agree for several iterations but then diverge -- may be a sign that the numerical approach of the program is borderline. A simplistic solution is to increase the precision of the calculations, e.g., from single to double. Perhaps also tweak parameters, such as a step size or similar parameter. Better would be to gain a deeper understanding of the algorithm and possibly make a more fundamental change.

I don't know about the crash but some differences in the results of numerical code in an Intel machine can be due to one compiler using 80-doubles and the other 64-bit doubles, even if not for variables but perhaps for temporary values. Moreover, floating-point computation is sensitive to the order elementary operations are performed. Different compilers may generate different sequence of operations.

Differences in different type implementations, differences in various non-Standard vendor extensions, could be a lot of things.
Here are just some of the language features that differ (look at gfortran and intel). Programs written to fortran standard work on every compiler the same, but a lot of people don't know what are the standard language features, and what are the language extensions, and so use them ... when compiled with a different compiler troubles arise.
If you post the code somewhere I could take a quick look at it; otherwise, like this, 'tis hard to say for certain.

How is Assembly used in the modern day (with C/C++ for example)?

I understand how a computer works on the basic principles, such as, a program can be written in a "high" level language like C#, C and then it's broken down in to object code and then binary for the processor to understand. However, I really want to learn about assembly, and how it's used in modern day applications.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
How many assembly languages are there? How many work well with other languages?
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
How would someone then reference the functions/routines within that assembly code from a language like C or C++?
How do we know the code we've written in assembly is the fastest it possibly can be?
Are there any recommended books on assembly languages/using them with modern programs?
Sorry for the quantity of questions, I do hope they're general enough to be useful for other people as well as simple enough for others to answer!

However, I really want to learn about assembly, and how it's used in modern day applications.
On "normal" PCs it's used just for time-critical processing, I'd say that realtime multimedia processing can still benefit quite a bit from hand-forged assembly. On embedded systems, where there's a lot less horsepower, it may have more areas of use.
However, keep in mind that it's not just "hey, this code is slow, I'll rewrite it in assembly and it by magic it will go fast": it must be carefully written assembly, written knowing what it's fast and what it's slow on your specific architecture, and keeping in mind all the intricacies of modern processors (branch mispredictions, out of order executions, ...). Often, the assembly written by a beginner-to-medium assembly programmer will be slower than the final machine code generated by a good, modern optimizing compiler. Performance stuff on x86 is often really complicated, and should be left to people who know what they do => and most of them are compiler writers. :) Have a look at this, for example. C++ code for testing the Collatz conjecture faster than hand-written assembly - why? gets into some of the specific x86 details for that case which you have to understand to match or beat a compiler with optimization enabled, for a single small loop.
I know processors have different instruction sets above the basic x86 instruction set. Do all assembly languages support all instruction sets?
I think you're confusing some things here. Many (=all modern) x86 processors support additional instructions and instruction sets that were introduced after the original x86 instruction set was defined. Actually, almost all x86 software now is compiled to exploit post-Pentium features like cmovcc; you can query the processor to see if it supports some features using the CPUID instruction. Obviously, if you want to use a mnemonic for some newer instruction set instruction your assembler (i.e. the software which translates mnemonics in actual machine code) must be aware of them.
Most C compilers have intrinsics like _mm_popcnt_u32 and/or command line options like -mpopcnt to enable them that let you take advantage of new instructions without hand-written asm. x86 -mbmi / -mbmi2 extensions have several instructions that compilers know how to use when optimizing ordinary C like x << y (shlx instead of the more clunky shl) or x &= x-1; (blsr / _blsr_u32()). GCC has a -march=native option to enable all the instruction sets your CPU supports, and to set the -mtune= option to optimize for your CPU in terms of how much loop unrolling is a good idea, or which instructions or sequences are faster on one CPU, slower on another.
If, instead, you're talking about other (non-x86) instruction sets for other families of processors, well, each assembler should support the instructions that the target processor can run. Not all the instructions of an assembly language have direct replacement in others, and in general porting assembly code from an architecture to another is usually a hard and difficult work.
How many assembly languages are there?
Theoretically, at least one dialect for each processor family. Keep in mind that there are also different notations for the same assembly language; for example, the following two instructions are the same x86 stuff written in AT&T and Intel notation:
mov $4, %eax // AT&T notation
mov eax, 4 // Intel notation
How would someone go about writing a routine in assembly, and then compiling it in to object/binary code?
If you want to embed a routine in an application written in another language, you should use the tools that the language provides you, in C/C++ you'd use the asm blocks.
You can instead make stand-alone .s or .asm files using the same syntax a C compiler would output, for example gcc -O3 -S will compile to a .s file that you can assemble with gcc -c. Separate files are a good idea if you want to write whole functions in asm instead of wrapping one or a couple instructions. A few open source projects like x264 and x265 (video encoders) have extensive amounts of NASM source code for different versions of functions for different versions of SSE or AVX available.
If you, instead, wanted to write a whole application in assembly, you'd have to write just in assembly, following the syntactic rules of the assembler you'd like to use.
How do we know the code we've written in assembly is the fastest it possibly can be?
In theory, because it is the nearest to the bare metal, so you can make the machine do just exactly what you want, without having the compiler take in account for language features that in some specific case do not matter. In practice, since the machine is often much more complicated than what the assembly language expose, as I said often assembly language will be slower than compiler-generated machine code, that takes in account many subtleties that the average programmer do not know.
Addendum
I was forgetting: knowing to read assembly, at least a little bit, can be very useful in debugging strange issues that can come up when the optimizer is broken/only in the release build/you have to deal with heisenbugs/when the source-level debugging is not available or other stuff like that; have a look at the comments here.

Intel and the x86 are big on reverse compatibility, which certainly helped them out but at the same time hurts greatly. The internals of the 8088/8086 to 286 to 386, to 486, pentium, pentium pro, etc to the present are somewhat of a redesign each time. Early on adding protection mechanisms for operating systems to protect apps from each other and the kernel, and then into performance by adding execution units, superscalar and all that comes with it, multi core processors, etc. What used to be a real, single AX register in the original processor turns into who knows how many different things in a modern processor. Originally your program was executed in the order written, today it is diced and sliced and executed in parallel in such a way that the intent of the instructions as presented are honored but the execution can be out of order and in parallel. Lots and lots of new tricks buried behind what on the surface appears to be a very old instruction set.
The instruction set changed from the 8/16 bit roots to 32 bit, to 64 bit, so the assembly language had to change as well. Adding EAX to AX, AH, and AL for example. Occasionally other instructions were added. But the original load, store, add, subtract, and, or, etc instructions are all there. I have not done x86 in a long time and was shocked to see that the syntax has changed and/or a particular assembler messed up the x86 syntax. There are a zillion tools out there so if one doesnt match the book or web page you are using, there is one out there that will.
So thinking in terms of assembly language for this family is right and wrong, the assembly language may have changed syntax and is not necessarily reverse compatible, but the instruction set or machine language or other similar terms (the opcodes/bits the assembly represents) would say that much of the original instruction set is still supported on modern x86 processors. 286 specific nuances may not work perhaps, as with other new features of specific generations, but the core instructions, load, store, add, subtract, push, pop, etc all still work and will continue to work. I feel it is better to "Drive down the center of the lane", dont get into chip or tool specific ghee whiz features, use the basic boring, been working since the beginning of time syntax of the language.
Because each generation in the family is trying for certain features, usually performance, the way the individual instructions are handed out to the various execution units changes...on each generation...In order to hand tune assembler for performance, trying to out-do a compiler, can be difficult at best. You need detailed knowledge about the specific processor you are tuning for. From the early x86 days to the present, unfortunately, what made the code execute faster on one chip, would often cause the next generation to run extra slow. Perhaps that was a marketing tool in disguise, not sure, "Buy the hot new processor that cost twice as much as the one you have now, advertises twice the clock speed, but runs your same copy of windows 30% slower. In a few years when the next version of windows is compiled (and this chip is obsolete) it will then double in performance". Another side effect of this is that at this point in time you cannot take one C program and create one binary that runs fast on all x86 processors, for performance you need to tune for the specific processor, meaning you need to at least tell the compiler to optimize and what family to optimize for. And like windows or office, or something you are distributing as a binary you likely cannot or do not want to somehow bury several differently tuned copies of the same program in one package or in one binary...drive down the center of the road.
As a result of all the hardware improvements it may be in your best interest to not try to tune the compiler output or hand assembler to any one chip in particular. On average the hardware improvements will compensate for the lack of compiler tuning and your same program hopefully just runs a little faster each generation. One of the chip vendors used to aim to make todays popular compiled binaries run faster tomorrow, the other vendor improved the internals such that if you recompiled todays source for the new internals you could run faster tomorrow. Those activities between vendors has not necessarily continued, each generation runs todays binaries slower, but tomorrows recompiled source the same speed or slower. It will run tomorrows re-written programs faster, sometimes with the same compiler sometimes you need tomorrows compiler. Isnt this fun!
So how do we know a particular compiled or hand assembled program is as fast as it possibly can be? We dont, in fact for x86 you can guarantee it isnt, run it on one chip in the family and it is slow, run it on another it may be blazing fast. x86 or not, other than very short programs or very deterministic programs like you would find on a microcontroller, you cannot definitely say this is the fastest possible solution. Caches for example are very hard if even possible to tune for, and the memory behind it, particularly on a pc, where the user can choose various sizes, speeds, ranks, banks, etc and adjust bios settings to change even more settings, you really cannot tell a compiler to tune for that. So even on the same computer same processor same compiled binary you have the ability to turn some of the knobs and make that program run a lot faster or a lot slower. Change processor families, change chipsets, motherboards, etc. And there is no possible way to tune for so many variables. The nature of the x86 pc business has become too chaotic.
Other chip families are not nearly as problematic. Some perhaps but not all. So these are not general statements, but specific to the x86 chip family. The x86 family is the exception not the rule. Probably the last assembler/instruction set you would want to bother learning.
There are tons of websites and books on the subject, cannot say one is better than the other. I learned from the original set of 8088/86 books from intel and then the 386 and 486 book, didnt look for Intel books after that (or any other boos). You will want an instruction set reference, and an assembler like nasm or gas (gnu assembler, part of binutils that comes with most gcc based compiler toolchains). As far as the C to/from assembler interface you can if nothing else figure that out by experimenting, write a small C program with a few small C functions, disassemble or compile to assembler, and look at what registers and/or how the stack is used to pass parameters between functions. Keep your functions simple and use only a few parameters and your assembler will likely work just fine. If not look at the assembler of the function calling your code and figure out where your parameters are. It is all well documented somewhere, and these days probably much better than old. In the early 8088/86 days you had tiny, small, medium, large and huge compiler models and the calling conventions could vary from one to the other. As well as one compiler to the next, watcom (formerly zortech and perhaps other names) was pass by register, borland and microsoft were passed on the stack and pretty close if not the same. Now with 32 and 64 bit flat memory space, and standards, you can use one model and not have to memorize all the nuances (just one set of nuances). Inline assembly is an option but varies from C compiler to C compiler, and getting it to work properly and effectively is more difficult than just writing assembler in its own file. gcc and perhaps other compilers will allow you to put the assembler file on the C compiler command line as if it were just another C file and it will figure out what you have given it and pass it to the assembler for you. That is if you dont want to call the assembler program yourself and put the object on the C compiler command line.
if nothing else disassemble a lot of simple functions, add a few parameters and return them, etc. Change compiler optimization settings and see how that changes the instructions used, often dramatically. Even if you cannot write assembler from scratch being able to read it is very valuable, both from a debugging and performance perspective.
Not all compilers for all processors are good. Gcc for example is a one size fits all, just like a sock or ball cap that one size doesnt really fit anyone well. Does pretty good for most of the targets but not really great. So it is quite possible to do better than the compiler with hand tuned assembler, but on the average for lots of code you are not going to win. That applies to most processors, which are more deterministic, not just the x86 family. It is not about fewer instructions, fewer instructions does not necessarily equate to faster, to outperform even an average compiler in the long run you have to understand the caches, fetch, decode, execution state machines, memory interfaces, memories themselves, etc. With compiler optimizations turned off it is very easy to produce faster code than the compiler, so you should just use the optimizer but also understand that that increases the risk of the compiler making a mistake. You need to know the tool very well, which goes back to disassebling often to understand how your C code and the compiler you use today interact with each other. No compiler is completely standards compliant, because the standards themselves are fuzzy, leaving some features of the language up to the discretion of the compiler (drive down the middle of the road and dont use those parts of the language).
Bottom line from the nature of your questions, I would recommend writing a bunch of small functions or programs with some small functions, compile to assembler or compile to an object and disassemble to see what the compiler does. Be sure to use different optimization settings on each program. Gain a working reading knowledge of the instruction set (granted the asm output of the compiler or disassembler, has a lot of extra fluff that gets in the way of readability, you have to look past that, you need almost none of it if you want to write assembler). Give yourself 5-20 years of studying and experimenting before you can expect to outperform the compiler on a regular basis, if that is your goal. By then you will learn that, particularly with this chip family, it is a futile effort, you win a few but mostly lose...It would be to your benefit to compile (to assembler) the same code to other chip families like arm and mips, and get a general feel for what C code compiles well in general, and what C code doesnt compile well, and make your C programming better instead of trying to make the assembler better. Also try other compilers like llvm. Gcc has a lot of quirks that many think are the C language standards but are instead nuances or problems with the specific compiler. Being able to read and analyze the assembly output of the compilers and their options will provide this knowledge. So I recommend you work on a reading knowledge of the instruction set, without necessarily having to learn to write it from scratch.

You need to look upon it from the hardware's point of view, the assembly language is created with regard to what the CPU can do. Every time a new feature in a CPU is created an appropriate assembly instruction is created so that it can be used.
Assembly is thus very dependent on the CPU, the high level languages like C++ provides abstractions from this to allow us to not have to think about the details like CPU instructions as well as the compiler generates optimized assembly code.
EDIT:
How many assembly languages are there?
How many work well with other
languages?
as many as there are different types of CPU. The second question I didn't understand. Assembly per se is not interacting with any other language, the output, the machine code is.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary
code?`
The principle is similar to writing in any other compiled language, you create a text file with the assembly instructions, use an assembler to compile it to machine code. Then link it with eventual runtime libraries.
How would someone then reference the functions/routines within that
assembly code from a language like C
or C++?
C++ and C provide inline assembly so there is no need to link, but if you want to link you need to create the assembly object following the same/similar calling conventions as the host language. For instance some languages when calling a function push the arguments to the function on the stack in a certain order, so you would have to do the same.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
Because it is closest to the actual hardware. When you are dealing with higher level languages you don't know what the compiler will do with your for loop. However more often than not they do a good and better job of optimizing the code than a human can do (of course in very special circumstances you can probably get a better result).

There are many many different assembly languages out there. Usually there is at least one for every processor instruction set, which means one for every processor type. One thing that you should also keep in mind is that even for a single processor there may be several different assembler programs that may use a different syntax, which from a formal view constitutes a different language. (for x86 there are masm, nasm, yasm, AT&T (what *nix assemblers like the GNU assembler use by default), and probably many more)
For x86 there are lots of different instruction sets because there have been so many changes to the architecture over the years. Some of these changes could be viewed mostly as additional instructions, so they are a super set of the previous assembly. Other changes may actually remove instructions (none are coming to mind for x86, but I've heard of some on other processors). And other changes add modes of operation to processors that make things even more complicated.
There are also other processors with completely different instructions.
To learn assembly you will need to start by picking a target processor and an assembler that you want to use. I'm going to assume that you are going to use x86, so you would need to decide if you want to start with 16 bit segmented, 32 bit, or 64 bit. Many books and online tutorials go the 16 bit route where you write DOS programs. If you are wanting to write parts of C programs in assembly then you will probably want to go the 32 or 64 bit route.
Most of the assembly programming I do is inline in C to either optimize something, to make use of instructions that the compiler doesn't know about, or when I otherwise need to control the instructions used. Writing large amounts of code in assembly is difficult, so I let the C compiler do most of the work.
There are lots of places where assembly is still written by people. This is particularly common in embedded, boot loaders (bios, u-boot, ...), and operating system code, though many developers in these never directly write any assembly. This code may be start up code that has to run before the stack pointer is set to a usable value (or RAM isn't usable yet for some other reason), because they need to fit within small spaces, and/or because they need to talk to hardware in ways that aren't directly supported in C or other higher level languages. Other places where assembly is used in OSes is writing locks (spinlocks, critical sections, mutexes, and semaphores) and context switching (switching from one thread of execution to another).
Other places where assembly is commonly written is in the implementation of some library code. Functions like strcpy are often implemented in assembly for different architectures because there are often several ways that they may be optimized using processor specific operations, while a C implementation might use a more general loop. These functions are also reused so often that optimizing them by hand is often worth the effort in the long run.
Another, related, place where lots of assembly is written is within compilers. Compilers have to know how to implement things and many of them produce assembly, so they have assembly templates (or something similar) built into them for use in generating output code.
Even if you never write any assembly knowing the instructions and registers of your target system are often useful. They can aid in debugging, but they can also aid in writing code. Knowing the target processor can help you write better (smaller and/or faster) code for it (even in a higher level language), and being familiar with a few different processors will help you to write code that will be good for many processors because you will know generally how CPUs work.

We do a fair bit of it in our Real-Time work (more than we should really). A wee bit of assembly can also be quite useful when you are talking to hardware, and need specific machine instructions executed (eg: All writes must be 16-bit writes, or you'll hose nearby registers).
What I tend to see today is assembly insertions in higher-level language code. How exactly this is done depends on your language and sometimes compiler.

I know processors have different
instruction sets above the basic x86
instruction set. Do all assembly
languages support all instruction
sets?
"Assembly language" is a kind of misnomer, at least in the way you are using it. Assemblers are less of a language (CS graduates may object) and more of a converter tool which takes textual representation and generates a binary image from it, with a close to 1:1 relationship between text elements (memnonics, labels and numbers) and binary elements. There is no deeper logic behind the elements of an assembler language because their possibilities to be quoted and redirected ends mostly at level 1; you can, for example, use EAX only in one instruction at a time - the next use of EAX in the next instruction bears no relationship with its previous use EXCEPT for the unwritten logical connection which the programmer had in mind - this is the reason why it is so easy to create bugs in assembler.
How would someone go about writing a
routine in assembly, and then
compiling it in to object/binary code?
One would need to pin down the lowest common denominator of instruction sets and code the function times the expected architectures the code is intended to run on. IOW if you are not coding for a certain hardware platform which is defined at the time of writing (e.g. a game console, an embedded board) you no longer do this.
How would someone then reference the
functions/routines within that
assembly code from a language like C
or C++?
You need to declare them in your HLL - see your compilers handbook.
How do we know the code we've written
in assembly is the fastest it possibly
can be?
There is no way to know. Be happy about that and code on.

Is there a (Linux) g++ equivalent to the /fp:precise and /fp:fast flags used in Visual Studio?

Background:
Many years ago, I inherited a codebase that was using the Visual Studio (VC++) flag '/fp:fast' to produce faster code in a particular calculation-heavy library. Unfortunately, '/fp:fast' produced results that were slightly different to the same library under a different compiler (Borland C++). As we needed to produce exactly the same results, I switched to '/fp:precise', which worked fine, and everything has been peachy ever since. However, now I'm compiling the same library with g++ on uBuntu Linux 10.04 and I'm seeing similar behavior, and I wonder if it might have a similar root cause. The numerical results from my g++ build are slightly different from the numerical results from my VC++ build. This brings me to my question:
Question:
Does g++ have equivalent or similar parameters to the 'fp:fast' and 'fp:precise' options in VC++? (and what are they? I want to activate the 'fp:precise' equivalent.)
More Verbose Information:
I compile using 'make', which calls g++. So far as I can tell (the make files are a little cryptic, and weren't written by me) the only parameters added to the g++ call are the "normal" ones (include folders and the files to compile) and -fPIC (I'm not sure what this switch does, I don't see it on the 'man' page).
The only relevant parameters in 'man g++' seem to be for turning optimization options ON. (e.g. -funsafe-math-optimizations). However, I don't think I'm turning anything ON, I just want to turn the relevant optimization OFF.
I've tried Release and Debug builds, VC++ gives the same results for release and debug, and g++ gives the same results for release and debug, but I can't get the g++ version to give the same results as the VC++ version.

From the GCC manual:
-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
To expand a bit, most of these discrepancies come from the use of the x86 80-bit floating point registers for calculations (vs. the 64-bits used to store double values). If intermediate results are kept in the registers without writing back to memory, you effectively get 16 bits of extra precision in your calculations, making them more precise but possibly divergent from results generated with write/read of intermediate values to memory (or from calculations on architectures that only have 64-bit FP registers).
These flags (both in GCC and MSVC) generally force truncation of each intermediate result to 64-bits, thereby making calculations insensitive to the vagaries of code generation and optimization and platform differences. This consistency generally comes with a slight runtime cost in addition to the cost in terms of accuracy/precision.

Excess register precision is an issue only on FPU registers, which compilers (with the right enabling switches) tend to avoid anyway. When floating point computations are carried out in SSE registers, the register precision equals the memory one.
In my experience most of the /fp:fast impact (and potential discrepancy) comes from the compiler taking the liberty to perform algebraic transforms. This can be as simple as changing summands order:
( a + b ) + c --> a + ( b + c)
can be - distributing multiplications like a*(b+c) at will, and can get to some rather complex transforms - all intended to reuse previous calculations.
In infinite precision such transforms are benign, of course - but in finite precision they actually change the result. As a toy example, try the summand-order-example with a=b=2^(-23), c = 1. MS's Eric Fleegal describes it in much more detail.
In this respect, the gcc switch nearest to /fp:precise is -fno-unsafe-math-optimizations. I think it's on by default - perhaps you can try setting it explicitly and see if it makes a difference. Similarly, you can try explicitly turning off all -ffast-math optimizations: -fno-finite-math-only, -fmath-errno, -ftrapping-math, -frounding-math and -fsignaling-nans (the last 2 options are non default!)

I don't think there's an exact equivalent. You might try -mfpmath=sse instead of the default -mfpmath=387 to see if that helps.

This is definitely not related to optimization flags, assuming by "Debug" you mean "with optimizations off." If g++ gives the same results in debug as in release, that means it's not an optimization-related issue.
Debug builds should always store each intermediate result in memory, thereby guaranteeing the same results as /fp:precise does for MSVC.
This likely means there is (a) a compiler bug in one of the compilers, or more likely (b) a math library bug. I would drill into individual functions in your calculation and narrow down where the discrepancy lies. You'll likely find a workaround at that point, and if you do find a bug, I'm sure the relevant team would love to hear about it.

-mpc32 or -mpc64?
But you may need to recompile C and math libraries with the switch to see the difference... This may apply to options others suggested as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js