Getting hardware floating point with android NDK

Getting hardware floating point with android NDK - c++

I've begun playing with the android NDK. One of the things I've just learnt is about creating an application.mk file to specify the armv7 abi.
I'm building the san-angeles example with the following parameters.
APP_MODULES := sanangeles
APP_PROJECT_PATH := $(call my-dir)/../
APP_OPTIM := release
APP_ABI := armeabi-v7a
However this seems to run at exactly the same speed as it did before (ie badly). Am I just GL limited and not CPU limited or is something wrong here?
I have noticed when I compile that I get the following command line options emitted:
-march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb
The thing that worries me there is the "softfp". There IS mention of the v7 abi, the VFP fpu stuff and I'm guessing the "thumb" refers to the "thumb-2" instructions (Though I don't know what exactly these are). However that "softfp" does concern me. Shouldn't it be "hardfp"?
Anyone got any ideas on these questions? I think I'm probably about ready to start implementing some GL ES 2.0 code for my HTC Desire but I'd like to make sure I'm getting the best possible speed out of it :)
Cheers in advance!

The options you supply to the NDK will only affect the way your code is compiled. It won't change the GL libs or anything else that's part of the platform, which are always generated in an appropriate fashion. If you're just throwing geometry at the GL hardware, you're not going to see a difference.
If you want to see if your options are having an effect, download (or create) a simple benchmark that does a bunch of operations with double-precision floating point values, and time how long it takes to execute before and after.
The -mfloat-abi=softp argument determines how floating-point values are passed between functions. softfp means they're always passed in integer registers or on the stack. If Android didn't specify softfp, the ARMv7-A version of the library would expect floats to show up in hardware registers, and any code built for ARMv5TE would break.
"softfp" adds a little overhead to some functions, but the instructions for moving values in and out of fp registers are cheap on ARM, and the ABI compatibility provided makes it worthwhile.
The "-mthumb" enables generation of Thumb/Thumb2 code. Thumb code tends to be a bit slower but a bit smaller than equivalent ARM; sometimes smaller means you'll fit better in the CPU i-cache and will actually run faster. Size is always a concern on these devices, so Thumb is enabled by default.
When in doubt, "arm-eabi-objdump -d whatever.o" will show you a disassembly of your code.
Update: NDK r9b added support for -mhard-float. This allows you to build NDK libraries with hard-float API conventions for armeabi-v7a targets.

Related

Optimize for a specific machine / processor architecture

In this highly voted answer to a question on the performance differences between C++ and Java I learn that the JIT compiler is sometimes able to optimize better because it can determine the exact specifics of the machine (processor, cache sizes, etc.):
Generally, C# and Java can be just as fast or faster because the JIT
compiler -- a compiler that compiles your IL the first time it's
executed -- can make optimizations that a C++ compiled program cannot
because it can query the machine. It can determine if the machine is
Intel or AMD; Pentium 4, Core Solo, or Core Duo; or if supports SSE4,
etc.
A C++ program has to be compiled beforehand usually with mixed
optimizations so that it runs decently well on all machines, but is
not optimized as much as it could be for a single configuration (i.e.
processor, instruction set, other hardware).
Question: Is there a way to tell the compiler to optimize specifically for my current machine? Is there a compiler which is able to do this?

For GCC, you can use the flag -march=native. Be aware that the generated code may not run on other CPUs because
GCC uses this name to determine what kind of instructions it can emit
when generating assembly code.
So CPU specific assembly can be generated.
If you want your code to run on other CPU types, but tune it for better performance on your CPU, then you should use -mtune=native:
Specify the name of the processor to tune the performance for. The
code will be tuned as if the target processor were of the type
specified in this option, but still using instructions compatible with
the target processor specified by a -mcpu= option.

Certainly a compiler could be instructed to optimize for a specific architecture. This is true of gcc, if you look at the multitude of architecture flags that you can pass in. The same is true to a lesser extent on Visual Studio, as it has the -MACHINE option and /arch options.
However, unlike in Java, this likely means that the generated code is only (safe) to run on that hardware that is being targeted. The assertion that Java can be just as fast or faster only likely holds in the case of generically compiled C++ code. Given the target architecture, C++ code compiled for that specific architecture will likely be as fast or faster than equivalent Java code. Of course, it's much more work to support multiple architectures in this way.

Will statically linked c++ binary work on every system with same architecture?

I'm making a very simple program with c++ for linux usage, and I'd like to know if it is possible to make just one big binary containing all the dependencies that would work on any linux system.
If my understanding is correct, any compiler turns source code into machine instructions, but since there are often common parts of code that can be reused with different programs, most programs depend on another libraries.
However if I have the source code for all my dependencies, I should be able to compile a binary in a way that would not require anything from the system? Will I be able to run stuff compiled on 64bit system on a 32bit system?

In short: Maybe.
The longer answer is:
It depends. You can't, for example, run a 64-bit binary on a 32-bit system, that's just not even nearly possible. Yes, it's the same processor family, but there are twice as many registers in the 64-bit system, which also has twice as long registers. What's the 32-bit processor going to "give back" for the value of those bits and registers that doesn't exist in the hardware in the processor? It just plain won't work. Some of the instructions also completely change meaning, so the system really needs to be "right" for the compiled code, or it won't work - fortunately, Linux will check this and plain refuse if it's not right.
You can BUILD a 32-bit binary on a 64-bit system (assuming you have all the right libraries, etc, installed for both 64- and 32-bit, etc).
Similarly, if you try to run ARM code on an x86 processor, or MIPS code on an ARM processor, it simply has no chance of working, because the actual instructions are completely different (or they would be in breach of some patent/copyright or similar, because processor instruction sets contain portions that are "protected intellectual property" in nearly all cases - so designers have to make sure they do NOT do "the same as someone else's design"). Like for 32-bit and 64-bit, you simply won't get a chance to run the wrong binary here, it just won't work.
Sometimes, there are subtle differences, for example ARM code can be compiled with "hard" or "soft" floating point. If the code is compiled for hard float, and there isn't the right support in the OS, then it won't run the binary. Worse yet, if you compile on x86 for SSE instructions, and try to run on a non-SSE processor, the code will simply crash [unless you specifically build code to "check for SSE, and display error if not present"].
So, if you have a binary that passes the above criteria, the Linux system tends to change a tiny bit between releases, and different distributions have subtle "fixes" that change things. Most of the time, these are completely benign (they fix some obscure corner-case that someone found during testing, but the general, non-corner case behaviour is "normal"). However, if you go from Linux version 2.2 to Linux version 3.15, there will be some substantial differences between the two versions, and the binary from the old one may very well be incompatible with the newer (and almost certainly the other way around) - it's hard to know exactly which versions are and aren't compatible. Within releases that are close, then it should work OK as long as you are not specifically relying on some feature that is present in only one (after all, new things ARE added to the Linux kernel from time to time). Here the answer is "maybe".
Note that in the above is also your implementation of the C and C++ runtime, so if you have a "new" C or C++ runtime library that uses Linux kernel feature X, and try to run it on an older kernel, before feature X was implemented (or working correctly for the case the C or C++ runtime is trying to use it).
Static linking is indeed a good way to REDUCE the dependency of different releases. And a good way to make your binary huge, which may be preventing people from downloading it.
Making the code open source is a much better way to solve this problem, then you just distribute your source code and a list of "minimum requirements", and let other people deal with it needing to be recompiled.

In practice, it depends on "sufficiently simple". If you're using C++11, you'll quickly find that the C++11 libraries have dependencies on modern libc releases. In turn, those only ship with modern Linux distributions. I'm not aware of any "Long Term Support" Linux distribution which today (June 2014) ships with libc support for GCC 4.8

The short answer is no, at least without serious hack.
Different linux distribution may have different glue code between user-space and kernel. For instant, an hello world seemingly without dependency built from ubuntu cannot be executed under CentOS.
EDIT: Thanks for the comment. I re-verify this and the cause is im using 32-bit VM. Sorry for causing confusion. However, as noted above, the rule of thumb is that even same linux distribution may sometime breaks compatibility in order to deploy bugfix, so the conclusion stands.

Cross Platform Floating Point Consistency

I'm developing a cross-platform game which plays over a network using a lockstep model. As a brief overview, this means that only inputs are communicated, and all game logic is simulated on each client's computer. Therefore, consistency and determinism is very important.
I'm compiling the Windows version on MinGW32, which uses GCC 4.8.1, and on Linux I'm compiling using GCC 4.8.2.
What struck me recently was that, when my Linux version connected to my Windows version, the program would diverge, or de-sync, instantly, even though the same code was compiled on both machines! Turns out the problem was that the Linux build was being compiled via 64 bit, whereas the Windows version was 32 bit.
After compiling a Linux 32 bit version, I was thankfully relieved that the problem was resolved. However, it got me thinking and researching on floating point determinism.
This is what I've gathered:
A program will be generally consistent if it's:
ran on the same architecture
compiled using the same compiler
So if I assume, targeting a PC market, that everyone has a x86 processor, then that solves requirement one. However, the second requirement seems a little silly.
MinGW, GCC, and Clang (Windows, Linux, Mac, respectively) are all different compilers based/compatible with/on GCC. Does this mean it's impossible to achieve cross-platform determinism? or is it only applicable to Visual C++ vs GCC?
As well, do the optimization flags -O1 or -O2 affect this determinism? Would it be safer to leave them off?
In the end, I have three questions to ask:
1) Is cross-platform determinism possible when using MinGW, GCC, and Clang for compilers?
2) What flags should be set across these compilers to ensure the most consistency between operating systems / CPUs?
3) Floating point accuracy isn't that important for me -- what's important is that they are consistent. Is there any method to reducing floating point numbers to a lower precision (like 3-4 decimal places) to ensure that the little rounding errors across systems become non-existent? (Every implementation I've tried to write so far has failed)
Edit: I've done some cross-platform experiments.
Using floatation points for velocity and position, I kept a Linux Intel Laptop and a Windows AMD Desktop computer in sync for up to 15 decimal places of the float values. Both systems are, however, x86_64. The test was simple though -- it was just moving entities around over a network, trying to determine any visible error.
Would it make sense to assume that the same results would hold if a x86 computer were to connect to a x86_64 computer? (32 bit vs 64 bit Operating System)

Cross-platform and cross-compiler consistency is of course possible. Anything is possible given enough knowledge and time! But it might be very hard, or very time-consuming, or indeed impractical.
Here are the problems I can foresee, in no particular order:
Remember that even an extremely small error of plus-or-minus 1/10^15 can blow up to become significant (you multiply that number with that error margin with one billion, and now you have a plus-or-minus 0.000001 error which might be significant.) These errors can accumulate over time, over many frames, until you have a desynchronized simulation. Or they can manifest when you compare values (even naively using "epsilons" in floating-point comparisons might not help; only displace or delay the manifestation.)
The above problem is not unique to distributed deterministic simulations (like yours.) The touch on the issue of "numerical stability", which is a difficult and often neglected subject.
Different compiler optimization switches, and different floating-point behavior determination switches might lead to the compiler generate slightly different sequences of CPU instructions for the same statements. Obviously these must be the same across compilations, using the same exact compilers, or the generated code must be rigorously compared and verified.
32-bit and 64-bit programs (note: I'm saying programs and not CPUs) will probably exhibit slightly different floating-point behaviors. By default, 32-bit programs cannot rely on anything more advanced than x87 instruction set from the CPU (no SSE, SSE2, AVX, etc.) unless you specify this on the compiler command line (or use the intrinsics/inline assembly instructions in your code.) On the other hand, a 64-bit program is guaranteed to run on a CPU with SSE2 support, so the compiler will use those instructions by default (again, unless overridden by the user.) While x87 and SSE2 float datatypes and operations on them are similar, they are - AFAIK - not identical. Which will lead to inconsistencies in the simulation if one program uses one instruction set and another program uses another.
The x87 instruction set includes a "control word" register, which contain flags that control some aspects of floating-point operations (e.g. exact rounding behavior, etc.) This is a runtime thing, and your program can do one set of calculations, then change this register, and after that do the exact same calculations and get a different result. Obviously, this register must be checked and handled and kept identical on the different machines. It is possible for the compiler (or the libraries you use in your program) to generate code that changes these flags at runtime inconsistently across the programs.
Again, in case of the x87 instruction set, Intel and AMD have historically implemented things a little differently. For example, one vendor's CPU might internally do some calculations using more bits (and therefore arrive at a more accurate result) that the other, which means that if you happen to run on two different CPUs (both x86) from two different vendors, the results of simple calculations might not be the same. I don't know how and under what circumstances these higher accuracy calculations are enabled and whether they happen under normal operating conditions or you have to ask for them specifically, but I do know these discrepancies exist.
Random numbers and generating them consistently and deterministically across programs has nothing to do with floating-point consistency. It's important and source of many bugs, but in the end it's just a few more bits of state that you have to keep synched.
And here are a couple of techniques that might help:
Some projects use "fixed-point" numbers and fixed-point arithmetic to avoid rounding errors and general unpredictability of floating-point numbers. Read the Wikipedia article for more information and external links.
In one of my own projects, during development, I used to hash all the relevant state (including a lot of floating-point numbers) in all the instances of the game and send the hash across the network each frame to make sure even one bit of that state wasn't different on different machines. This also helped with debugging, where instead of trusting my eyes to see when and where inconsistencies existed (which wouldn't tell me where they originated, anyways) I would know the instant some part of the state of the game on one machine started diverging from the others, and know exactly what it was (if the hash check failed, I would stop the simulation and start comparing the whole state.)
This feature was implemented in that codebase from the beginning, and was used only during the development process to help with debugging (because it had performance and memory costs.)
Update (in answer to first comment below): As I said in point 1, and others have said in other answers, that doesn't guarantee anything. If you do that, you might decrease the probability and frequency of an inconsistency occurring, but the likelihood doesn't become zero. If you don't analyze what's happening in your code and the possible sources of problems carefully and systematically, it is still possible to run into errors no matter how much you "round off" your numbers.
For example, if you have two numbers (e.g. as results of two calculations that were supposed to produce identical results) that are 1.111499999 and 1.111500001 and you round them to three decimal places, they become 1.111 and 1.112 respectively. The original numbers' difference was only 2E-9, but it has now become 1E-3. In fact, you have increased your error 500'000 times. And still they are not equal even with the rounding. You've exacerbated the problem.
True, this doesn't happen much, and the examples I gave are two unlucky numbers to get in this situation, but it is still possible to find yourself with these kinds of numbers. And when you do, you're in trouble. The only sure-fire solution, even if you use fixed-point arithmetic or whatever, is to do rigorous and systematic mathematical analysis of all your possible problem areas and prove that they will remain consistent across programs.
Short of that, for us mere mortals, you need to have a water-tight way to monitor the situation and find exactly when and how the slightest discrepancies occur, to be able to solve the problem after the fact (instead of relying on your eyes to see problems in game animation or object movement or physical behavior.)

No, not in practice. For example, sin() might come from a library or from a compiler intrinsic, and differ in rounding. Sure, that's only one bit, but that's already out of sync. And that one bit error may add up over time, so even an imprecise comparison may not be sufficient.
N/A
You can't reduce FP precision for a given type, and I don't even see how it would help you. You'd turn the occasional 1E-6 difference into an occasional 1E-4 difference.

Next to your concerns on determinism, I have another remark: if you are worried about calculation consistency on a distributed system, you may have a design issue.
You could think about your application as a bunch of nodes, each responsible for their own calculations. If information about another node is needed, it should sent to you by that node.

1.)
In principle cross platform, OS, hardware compatibility is possible but in practice it's a pain.
In general your results will depend on which OS you use, which compiler, and which hardware you use. Change any one of those and your results might change. You have to test all changes.
I use Qt Creator and qmake (cmake is probably better but qmake works for me) and test my code in MSVC on Windows, GCC on Linux, and MinGW-w64 on Windows. I test both 32-bit and 64-bit. This has to be done whenever code changes.
2.) and 3.)
In terms of floating point some compilers will use x87 instead of SSE in 32-bit mode. See this as an example of the consequences of when that happens Why a number crunching program starts running much slower when diverges into NaNs? All 64-bit systems have SSE so I think most use SSE/AVX in 64-bit otherwise, e.g. in 32 bit mode, you might need to force SSE with something like -mfpmath=sse and -msse2.
But if you want a more compatible version of GCC on windows then I would used MingGW-w64 for 32-bit (aka MinGW-w32) or MinGW-w64 in 64bit . This is not the same thing as MinGW (aka mingw32). The projects have diverged. MinGW depends on MSVCRT (the MSVC C runtime library) and MinGW-w64 does not. The Qt project has a pretty good description of MinGW-w64 and installiation. http://qt-project.org/wiki/MinGW-64-bit
You might also want to consider writing a CPU dispatcher cpu dispatcher for visual studio for AVX and SSE.

Filesize difference when cross compiling

I am writing a small application in c++ that runs on my host machine (linux x86) and on a a target machine(arm).
The problem I have is that on the host machine my binary is about 700kb of size but on the target machine it is about 7mb.
I am using the same compile switches for both platforms. My first though was that a library on the arget machine got linked statically but I checked both binaries with objdump and both use the same dynamically link libraries.
So can anyone give me hint how I can figure out why there is such a huge difference in size?

While different computer architectures can theoretically require completely different amounts of executable code for the same program, a factor of 10 is not really expected among modern architectures. ARM and x86 may be different, but they are still designed in the same universe where memory and bandwidth is not something to waste, leading CPU designers to try to keep the executable code as tight as possible.
I would, therefore, look at the following possibilities, in order of probability:
Symbol stripping: if one of the two binaries has been stripped from its symbols, then it would be significantly smaller, especially if compiled with debugging information. You might want to try to strip both binaries and see what happens.
Static linking: I have occasionally encountered build systems for embedded targets that would prefer static linking over using shared libraries. Examining the library dependencies of each binary would probably detect this.
Additional enabled code: The larger binary may have additional code enabled because e.g. the build system found an additional optional library or because the target platform requires specific handle.
Still, a factor of 10 is probably too much for this, unless the smaller binary is missing a lot of functionality or the larger one has linked in some optional library statically.
Different compiler configuration: You should not only look at the compiler options that you supply, but also at the defaults the compiler uses for each target. For example if the compiler has significantly higher inlining or loop unrolling limits in one architecture, the resulting executable could baloon-out noticeably.

first there is no reason to expect the same code compiled for different architectures to have any kind of relationship in size to each other. You can easily have A be larger than B then change one line of code and then B is larger than A.
Second the "binaries" you are talking about are I am guessing elf, which is a little bit of binary and some to a lot of overhead. The overhead can vary between architectures and other such things.
Bottom line if you are compiling the same code for two architectures/platforms or with different compilers or compile options for the same architecture there is no reason to expect the file sizes to have any relationship to each other.

Different architectures can have completely different ways to handle the same thing. For example loading immediate value on CISC (e.g. x86) architecture is usually one instruction, while on RISC (e.g. ppc, arm) it usually is more than one instruction, the actual number needed being dependent on the value. For example if the instruction set only allows 16bit immediate values, you may need up to 7 instructions to load a 64bit value (loading by 16bits and shifting in between the loads). Hence the code is inherently different.

One reason not mentioned so far, but relevant to ARM/x86 comparisons is Floating Point emulation. All x86 chips today come with native FP support (and x86-64 even with SIMD FP support via SSE), but ARM CPU's often lack a FP unit. That in turn means even a simple FP addition has to be turned into a long sequence of integer operations on exponents and mantissa's.

Is there a (Linux) g++ equivalent to the /fp:precise and /fp:fast flags used in Visual Studio?

Background:
Many years ago, I inherited a codebase that was using the Visual Studio (VC++) flag '/fp:fast' to produce faster code in a particular calculation-heavy library. Unfortunately, '/fp:fast' produced results that were slightly different to the same library under a different compiler (Borland C++). As we needed to produce exactly the same results, I switched to '/fp:precise', which worked fine, and everything has been peachy ever since. However, now I'm compiling the same library with g++ on uBuntu Linux 10.04 and I'm seeing similar behavior, and I wonder if it might have a similar root cause. The numerical results from my g++ build are slightly different from the numerical results from my VC++ build. This brings me to my question:
Question:
Does g++ have equivalent or similar parameters to the 'fp:fast' and 'fp:precise' options in VC++? (and what are they? I want to activate the 'fp:precise' equivalent.)
More Verbose Information:
I compile using 'make', which calls g++. So far as I can tell (the make files are a little cryptic, and weren't written by me) the only parameters added to the g++ call are the "normal" ones (include folders and the files to compile) and -fPIC (I'm not sure what this switch does, I don't see it on the 'man' page).
The only relevant parameters in 'man g++' seem to be for turning optimization options ON. (e.g. -funsafe-math-optimizations). However, I don't think I'm turning anything ON, I just want to turn the relevant optimization OFF.
I've tried Release and Debug builds, VC++ gives the same results for release and debug, and g++ gives the same results for release and debug, but I can't get the g++ version to give the same results as the VC++ version.

From the GCC manual:
-ffloat-store
Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
To expand a bit, most of these discrepancies come from the use of the x86 80-bit floating point registers for calculations (vs. the 64-bits used to store double values). If intermediate results are kept in the registers without writing back to memory, you effectively get 16 bits of extra precision in your calculations, making them more precise but possibly divergent from results generated with write/read of intermediate values to memory (or from calculations on architectures that only have 64-bit FP registers).
These flags (both in GCC and MSVC) generally force truncation of each intermediate result to 64-bits, thereby making calculations insensitive to the vagaries of code generation and optimization and platform differences. This consistency generally comes with a slight runtime cost in addition to the cost in terms of accuracy/precision.

Excess register precision is an issue only on FPU registers, which compilers (with the right enabling switches) tend to avoid anyway. When floating point computations are carried out in SSE registers, the register precision equals the memory one.
In my experience most of the /fp:fast impact (and potential discrepancy) comes from the compiler taking the liberty to perform algebraic transforms. This can be as simple as changing summands order:
( a + b ) + c --> a + ( b + c)
can be - distributing multiplications like a*(b+c) at will, and can get to some rather complex transforms - all intended to reuse previous calculations.
In infinite precision such transforms are benign, of course - but in finite precision they actually change the result. As a toy example, try the summand-order-example with a=b=2^(-23), c = 1. MS's Eric Fleegal describes it in much more detail.
In this respect, the gcc switch nearest to /fp:precise is -fno-unsafe-math-optimizations. I think it's on by default - perhaps you can try setting it explicitly and see if it makes a difference. Similarly, you can try explicitly turning off all -ffast-math optimizations: -fno-finite-math-only, -fmath-errno, -ftrapping-math, -frounding-math and -fsignaling-nans (the last 2 options are non default!)

I don't think there's an exact equivalent. You might try -mfpmath=sse instead of the default -mfpmath=387 to see if that helps.

This is definitely not related to optimization flags, assuming by "Debug" you mean "with optimizations off." If g++ gives the same results in debug as in release, that means it's not an optimization-related issue.
Debug builds should always store each intermediate result in memory, thereby guaranteeing the same results as /fp:precise does for MSVC.
This likely means there is (a) a compiler bug in one of the compilers, or more likely (b) a math library bug. I would drill into individual functions in your calculation and narrow down where the discrepancy lies. You'll likely find a workaround at that point, and if you do find a bug, I'm sure the relevant team would love to hear about it.

-mpc32 or -mpc64?
But you may need to recompile C and math libraries with the switch to see the difference... This may apply to options others suggested as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js