Why is there no optimization for uint8? - c++

So I have been researching how the variable uint8 works and I have realized that it is actually not faster than int! In order to multiply, divide, add, or subtract, the program must turn uint8 into an int which will make it about the same speed or slightly slower.
Why did C++ not implement multiplying, dividing, adding, or subtracting directly to uint8?

Why did C++ not implement multiplying, dividing, adding, or subtracting directly to uint8?
Because the optimal way doing that is platform specific.
Most CPU's provide these operations as assembler instructions based on using integer values of a specific default size (e.g. 32 bits, or 64 bits like shown here for 16 bit instructions), they may or may not have such instructions for uint8 values.
The bit size is usually optimized for the CPU's cache lining mechanisms.
So the optimal implementation is dependend on the available target CPU instructions and cannot be covered by the C++ standard.

I'm not sure wether or not a compiler will produce 8bit arithmetic operations for uint8_t when properate (quite unlikely for it is unlikely to be faster).
#harold mentioned, what I said before is not so morden now... Partial register update problem is no longer so serious now for 8bit operations. So, just that most 8bit operations are not faster. While 8bit division is a little faster and I'm trying to figure out why MS's compiler won't use it. (Not so sure: As the partially updating problem is just mostly reduced not completely removed, and even kept by AMD, that one cycle benefit of 8bit division just not worth to be abused).
Original:
On morden x86 processors, 8bit operations face a problem called partial register update that you only change part of the full register, which results in false dependency that seriously impacts performance.
And FYI, at the language level there is no arithmetic for integral types smaller than int in C++. There is the usual arithmetic promotion to lift the type.

Related

Performance of custom bitwise vs native CPU operations

everyone! I have been trying to create my own big integer class for RSA implementation in C++ (for practice purposes only). The only way I see such thing to be implemented well in terms of performance is by using C++'s built-in bitwise operations (&|^), meaning implementing custom full-adders for addition, binary multipliers for multiplication, etc. The thing I am interested in can be formulated as follows: would custom-made emulators of hardware circuits for number arithmetic (like full-adders, multipliers) using bitwise C++ operations be slower in terms of performance. In other words, if I make my own unsigned integer class of the size of 64 bits, and make it "ideal" in terms of number of bitwise operations needed to perform addition, multiplication, division on it, can it have the same performance as built-in unsigned long long? Can it be implemented to be this fast with any programming language at all or you never will be able to make any operation faster than those in CPUs' intrinsic instruction set?
Please note that I am not interested in answers regarding implementations of RSA, but only in performance comparison of native and hand-made arithmetic.
Thank you in advance!
Software implementations cannot even come close to arithmetic circuits that are typically used in hardware. It's not even a question of benchmarking or of different results depending on the system (assuming we're talking about hardware that isn't prehistoric), it's a hands-down win for hardware circuits, guaranteed, every time, by a huge factor. A big difference between hardware and software is this: hardware has almost unlimited parallelism, so the number of bit-operations doesn't matter as much, speed depends primarily on the "depth" of a circuit. Software also has parallelism, but it's very limited.
Consider a typical fast multiplier, as used in modern hardware. They're based on some parallel reduction scheme, such as a Dadda multiplier (the actual circuit doesn't necessarily follow Dadda's algorithm to the letter, but it's going to use a similar parallel reduction). Hardware has almost unlimited parallelism to do that parallel reduction with. As a result, 64bit multiplication takes 3 cycles on many modern machines (not all of them, but for example both Apple M1 and all current Intel and AMD x64 processors). Granted, in 3 cycles you could squeeze more than 3 bitwise operations, but it's just not even a contest - you cannot implement multiplication in just a handful of bitwise operations.
Even just addition is already unbeatable. It's already as fast as bitwise operations are, or perhaps more accurately, bitwise operations are as slow as addition is. The time a bitwise operation takes at the software level has little to do with the latency of the corresponding gate, it's more a property of how the processor was designed in general. By the way you may also be interested in this other question: Why is addition as fast as bit-wise operations in modern processors?
As said by P Kramer in his comment, it totally depends on your system, instruction set and compiler, modern compilers are pretty good at finding optimizations for your specific CPU when you ask them to but it's completely impossible to know if they'll do as good as/better than the native instruction through theory only.
As usual in this case, I suggest A/B testing (don't forget to use -march and -mtune if using gcc/clang) to check which implementation is the fastest on your machine and by how much.

GPU HLSL compute shader warnings int and uint division

I keep having warnings from compute shader compilation in that I'm recommended to use uints instead of ints with dividing.
By default from the data type I assume uints are faster; however various tests online seem to point to the contrary; perhaps this contradiction is on the CPU side only and GPU parallelisation has some unknown advantage?
(Or is it just bad advice?)
I know that this is an extremely late answer, but this is a question that has come up for me as well, and I wanted to provide some information for anyone who sees this in the future.
I recently found this resource - https://arxiv.org/pdf/1905.08778.pdf
The table at the bottom lists the latency of basic operations on several graphics cards. There is a small but consistent savings to be found by using uints on all measured hardware. However, what the warning doesn't state is that the greater optimization is to be found by replacing division with multiplication if at all possible.
https://www.slideshare.net/DevCentralAMD/lowlevel-shader-optimization-for-nextgen-and-dx11-by-emil-persson states that type conversion is a full-rate operation like int/float subtraction, addition, and multiplication, whereas division is very slow.
I've seen it suggested that to improve performance, one should convert to float, divide, then convert back to int, but as shown in the first source, this will at best give you small gains and at worst actually decrease performance.
You are correct that it varies from performance of operations on the CPU, although I'm not entirely certain why.
Looking at https://www.agner.org/optimize/instruction_tables.pdf it appears that which operation is faster (MUL vs IMUL) varies from CPU to CPU - in a few at the top of the list IMUL is actually faster, despite a higher instruction count. Other CPUs don't provide a distinction between MUL and IMUL at all.
TL;DR uint division is faster on the GPU, but on the CPU YMMV

Built-in type efficiency

Under The most efficient types second here
...and when defining an object to store a floating point number, use the double type, ... The double type is two to three times less efficient than the float type...
Seems like it's contradicting itself?
And I read elsewhere (can't remember where) that computations involving ints are faster than shorts on many machines because they are converted to ints to perform the operations? Is this true? Any links on this?
One can always argue about the quality of the contents on the site you link to. But the two quotes you refer to:
...and when defining an object to store a floating point number, use the double type, ...
and
... The double type is two to three times less efficient than the float type...
Refer to two different things, the first hints that using doubles will give much less problems due to the increased precision, while the other talks about performance. But honestly I wouldn't pay too much attention to that, chance is that if your code performs suboptimal it is due to incorrect choice of algorithm rather than wrong choice of primitive data type.
Here is a quote about performance comparison of single and double precision floats from one of my old teachers: Agner Fog, who has a lot of interesting reads over at his website: http://www.agner.org about software optimizations, if you are really interested in micro optimizations go take a look at it:
In most cases, double precision calculations take no more time than single precision. When the floating point registers are used, there is simply no difference in speed between single and double precision. Long double precision takes only slightly more time. Single precision division, square root and mathematical functions are calculated faster than double precision when the XMM registers are used, while the speed of addition, subtraction, multiplication, etc. is still the same regardless of precision on most processors (when vector operations are not used).
source: http://agner.org/optimize/optimizing_cpp.pdf
While there might be different variations for different compilers, and different processors, the lesson one should learn from it, is that most likely you do not need to worry about optimizations at this level, look at choice of algorithm, even data container, not the primitive data type.
These optimizations are negligible unless you are writing software for space shuttle launches (which recently have not been doing too well). Correct code is far more important than fast code. If you require the precision, using doubles will barely affect the run time.
Things that affect execution time way more than type definitions:
Complexity - The more work there is to do, the more slowly the code will run. Reduce the amount of work needed, or break it up into smaller, faster tasks.
Repetition - Repetition can often be avoided and will inevitably ruin code performance. It comes in many guises-- for example, failing to cache the results of expensive calculations or of remote procedure calls. Every time you recompute, you waste efficiency. They also extend the executable size.
Bad Design - Self explanatory. Think before you code!
I/O - A program whose execution is blocked waiting for input or output (to and from the user, the disk, or a network connection) is bound to perform badly.
There are many more reasons, but these are the biggest. Personally, bad design is where I've seen most of it happen. State machines that could have been stateless, dynamic allocation where static would have been fine, etc. are the real problems.
Depending on the hardware, the actual CPU (or FPU if you like) performance of double is somewhere between half the speed and same speed on modern CPU's [for example add or subtract is probably same speed, multiply or divide may be different for larger type], when compared to float.
On top of that, there are "fewer per cache-line", so if when there is a large number of them, it gets slower still because memory speed is slower. Per cache-line, there are half as many double values -> about half the performance if the application is fully memory bound. It will be much less of a factor in a CPU-bound application.
Similarly, if you use SSE or similar SIMD technologies, the double will take up twice as much space, so the number of actual calculation with be half as many "per instruction", and typically, the CPU will allow the same number of instructions per cycle for both float and double - except for some operations that take longer for double. Again, leading to about half the performance.
So, yes, I think the page in the link is confusing and mixing up the ideal performance setup between double and float. That is, from a pure performance perspective. It is often much easier to get noticeable calculation errors when using float - which can be a pain to track down - so starting with double and switching to float if it's deemed necessary because you have identified it as a performance issue (either from experience or measurements).
And yes, there are several architectures where only one size integer exists - or only two sizes, such as 8-bit char and 32-bit int, and 16-bit short would be simulated by performing the 32-bit math, and then dropping the top part of the value. For example MIPS has only got 32-bit operations, but can store and load 16-bit values to memory. It doesn't necessarily make it slower, but it certainly means that it's "not faster".

using 64 bits integers in 64 bits compilers and OSes

I have a doubt about when to use 64 bits integers when targeting 64 bits OSes.
Has anyone done conclusive studies focused on the speed of the generated code?
It is better to use 64 bits integers as params for funcs or methods? (Ex: uint64 myFunc(uint64 myVar))
If we use 64 bits integers as params it takes more memory but maybe it will be more efficient.
What about if we know that some value should be always less than, for example, 10. We still continue using 64 bit integers for this param?
It is better to use 64 bits integers as return types?
Is there some penalty for using 32-bit as return value?
It is better to use 64 bits integers for loops? (for(size_t i=0; i<...)) In this case, I suppose it.
Is there some penalty for using 32-bit variables for loops?
It is better to use 64 bits integers as indexes for pointers? (Ex: myMemory[index]) In this case, I suppose it.
Is there some penalty for using 32-bit variables for indexes?
It is better to use 64 bits integers to store data in classes or structs? (that we won't want to save to disk or something like this)
It is better to use 64 bits for a bool type?
What about conversions between 64 bits integers and floats? Will be better to use doubles now?
Until now doubles are slower than floats.
Is there some penalty every time we access a 32-bit variable?
Regards!
I agree with #MarkB but want to provide more detail on some topics.
On x64, there are more registers available (twice as many). The standard calling conventions have therefore been designed to take more parameters in registers by default. So as long as the number of parameters is not excessive (typically 4 or fewer), their types will make no difference. They will be promoted to 64 bit and passed in registers anyway.
Space will be allocated on the stack for those 64 bit registers even though they are passed in registers. This is by design to make their storage locations simple and contiguous with the those of surplus parameters. The surplus parameters will be placed on the stack regardless, so size may matter in those cases.
This issue is particularly important for memory data structures. Using 64 bit where 32 bit is sufficient will waste memory, and more importantly, occupy space in cache lines. The cache impact is not simple though. If your data access pattern is sequential, that's when you will pay for it by essentially making half of your cache unusable. (Assuming you only needed half of each 64 bit quantity.)
If your access pattern is random, there is no impact on cache performance. This is because every access occupies a full cache line anyway.
There can be a small impact in accessing integers that are smaller than word size. However, pipelining and multiple issue of instructions will make it so that the extra instruction (zero or sign extend) will almost always become completely hidden and go unobserved.
The upshot of all this is simple: choose the integer size that matters for your problem. For parameters, the compiler can promote them as needed. For memory structure, smaller is typically better.
You have managed to cram a ton of questions into one question here. It looks to me like all your questions basically concern micro-optimizations. As such I'm going to make a two-part answer:
Don't worry about size from a performance perspective but instead use types that are indicative of the data that they will contain and trust the compiler's optimizer to sort it out.
If performance becomes a concern at some point during development, profile your code. Then you can make algorithmic adjustments as appropriate and if the profiler shows that integer operations are causing a problem you can compare different sizes side-by-side for comparison purposes.
Use int and trust the platform and compiler authors that they have done their job and chose the most efficient representation for it. On most 64-bit platforms it is 32-bits which means that it's no less efficient than 64-bit types.

x86 4byte floats vs. 8byte doubles (vs. long long)?

We have a measurement data processing application and currently all data is held as C++ float which means 32bit/4byte on our x86/Windows platform. (32bit Windows Application).
Since precision is becoming an issue, there have been discussions to move to another datatype. The options currently discussed are switching to double (8byte) or implementing a fixed decimal type on top of __int64 (8byte).
The reason the fixed-decimal solution using __int64 as underlying type is even discussed is that someone claimed that double performance is (still) significantly worse than processing floats and that we might see significant performance benefits using a native integer type to store our numbers. (Note that we really would be fine with fixed decimal precision, although the code would obviously become more complex.)
Obviously we need to benchmark in the end, but I would like to ask whether the statement that doubles are worse holds any truth looking at modern processors? I guess for large arrays doubles may mess up cache hits more that floats, but otherwise I really fail to see how they could differ in performance?
It depends on what you do. Additions, subtractions and multiplies on double are just as fast as on float on current x86 and POWER architecture processors. Divisions, square roots and transcendental functions (exp, log, sin, cos, etc.) are usually notably slower with double arguments, since their runtime is dependent on the desired accuracy.
If you go fixed point, multiplies and divisions need to be implemented with long integer multiply / divide instructions which are usually slower than arithmetic on doubles (since processors aren't optimized as much for it). Even more so if you're running in 32 bit mode where a long 64 bit multiply with 128 bit results needs to be synthesized from several 32-bit long multiplies!
Cache utilization is a red herring here. 64-bit integers and doubles are the same size - if you need more than 32 bits, you're gonna eat that penalty no matter what.
Look it up. Both and Intel publish the instruction latencies for their CPUs in freely available PDF documents on their websites.
However, for the most part, performance won't be significantly different, or a couple of reasons:
when using the x87 FPU instead of SSE, all floating point operations are calculated at 80 bits precision internally, and then rounded off, which means that the actual computation is equally expensive for all floating-point types. The only cost is really memory-related then (in terms of CPU cache and memory bandwidth usage, and that's only an issue in float vs double, but irrelevant if you're comparing to int64)
with or without SSE, nearly all floating-point operations are pipelined. When using SSE, the double instructions may (I haven't looked this up) have a higher latency than their float equivalents, but the throughput is the same, so it should be possible to achieve similar performance with doubles.
It's also not a given that a fixed-point datatype would actually be faster either. It might, but the overhead of keeping this datatype consistent after some operations might outweigh the savings. Floating-point operations are fairly cheap on a modern CPU. They have a bit of latency, but as mentioned before, they're generally pipelined, potentially hiding this cost.
So my advice:
Write some quick tests. It shouldn't be that hard to write a program that performs a number of floating-point ops, and then measure how much slower the double version is relative to the float one.
Look it up in the manuals, and see for yourself if there's any significant performance difference between float and double computations
I've trouble the understand the rationale "as double as slower than float we'll use 64 bits int". Guessing performance has always been an black art needing much of experience, on today hardware it is even worse considering the number of factors to take into account. Even measuring is difficult. I know of several cases where micro-benchmarks lent to one solution but in context measurement showed that another was better.
First note that two of the factors which have been given to explain the claimed slower double performance than float are not pertinent here: bandwidth needed will the be same for double as for 64 bits int and SSE2 vectorization would give an advantage to double...
Then consider than using integer computation will increase the pressure on the integer registers and computation units when apparently the floating point one will stay still. (I've already seen cases where doing integer computation in double was a win attributed to the added computation units available)
So I doubt that rolling your own fixed point arithmetic would be advantageous over using double (but I could be showed wrong by measures).
Implementing 64 fixed points isn't really fun. Especially for more complex functions like Sqrt or logarithm. Integers will probably still a bit faster for simple operations like additions. And you'll need to deal with integer overflows. And you need to be careful when implementing rounding, else errors can easily accumulate.
We're implementing fixed points in a C# project because we need determinism which floatingpoint on .net doesn't guarantee. And it's relatively painful. Some formula contained x^3 bang int overflow. Unless you have really compelling reasons not to, use float or double instead of fixedpoint.
SIMD instructions from SSE2 complicate the comparison further, since they allow operation on several floating point numbers(4 floats or 2 doubles) at the same time. I'd use double and try to take advantage of these instructions. So double will probably be significantly slower than floats, but comparing with ints is difficult and I'd prefer float/double over fixedpoint is most scenarios.
It's always best to measure instead of guess. Yes, on many architectures, calculations on doubles process twice the data as calculations on floats (and long doubles are slower still). However, as other answers, and comments on this answer, have pointed out, the x86 architecture doesn't follow the same rules as, say, ARM processors, SPARC processors, etc. On x86 floats, doubles and long doubles are all converted to long doubles for computation. I should have known this, because the conversion causes x86 results to be more accurate than SPARC and Sun went through a lot of trouble to get the less accurate results for Java, sparking some debate (note, that page is from 1998, things have since changed).
Additionally, calculations on doubles are built in to the CPU where calculations on a fixed decimal datatype would be written in software and potentially slower.
You should be able to find a decent fixed sized decimal library and compare.
With various SIMD instruction sets you can perform 4 single precision floating point operations at the same cost as one, essentially you pack 4 floats into a single 128 bit register. When switching to doubles you can only pack 2 doubles into these registers and hence you can only do two operations at the same time.
As many people have said, a 64bit int is probably not worth it if double is an option. At least when SSE is available. This might be different on micro controllers of various kinds but I guess that is not your application. If you need additional precision in long sums of floats, you should keep in mind that this operation is sometimes problematic with floats and doubles and would be more exact on integers.