What is faster: (Performance)
__int64 x,y;
x=y;
or
int x,y,a,b;
x=a;
y=b;
?
Or they are equal?
__int64 is a non-standard compiler extension, so whilst it may or may not be faster, you don't want to use it if you want cross platform code. Instead, you should consider using #include <cstdint> and using uint64_t etc. These derive from the C99 standard which provides stdint.h and inttypes.h for fixed width integer arithmetic.
In terms of performance, it depends on the system you are on - x86_64 for example should not see any performance difference in adding 32- and 64- bit integers, since the add instruction can handle 32 or 64 bit registers.
However, if you're running code on a 32-bit platform or compiling for a 32-bit architecture, the addition of a 64-bit integers will actually require adding two 32-bit registers, as opposed to one. So if you don't need the extra space, it would be wasteful to allocate it.
I have no idea if compilers can or do optimise the types down to a smaller size if necessary. I expect not, but I'm no compiler engineer.
I hate these sort of questions.
1) If you don't know how to measure for yourself then you almost certainly don't need to know the answer.
2) On modern processors it's very hard to predict how fast something will be based on single instructions, it's far more important to understand your program's cache usage and overall performance than to worry about optimising silly little snippets of code that use an assignment. Let the compiler worry about that and spend your time improving the algorithm used or other things that will have a much bigger impact.
So in short, you probably can't tell and it probably doesn't matter. It's a silly question.
The compiler's optimizer will remove all the code in your example so in that way there is no difference. I think you want to know whether it is faster to move data 32 bits at a time or 64 bits at a time. If your data is aligned to 8 bytes and you are on a 64 bit machine then it should be faster to move data at 8 bytes at a time. There are several caveat to that, however. You may find that your compiler is already doing this optimization for you (you would have to look at the emitted assembly code to be sure), in which case you would see no difference. Also, consider using memcpy instead of rolling your own if you are moving a lot of data. If you are considering casting an array of 32 bit ints to 64 bit in order to copy faster or do some other operation faster (i.e. in half the number of instructions), be sure to Google for the strict aliasing rule.
__int64 should be faster on most platforms, but be careful - some of the architectures require alignment to 8 for this to take effect, and some would even crash your app if it's not aligned.
Related
I have to multiply a vector of integers with an other vector of integers, and then add the result (so a vector of integers) with a vector of floating points values.
Should I use MMX or SSE4 for integers, or can I just use SSE with all these values (even if there is integer ?) putting integers in __m128 registers ?
Indeed, I am often using integers in __m128 registers, and I don't know if I am wasting time (implicit casting values) or if it's the same thing.
I am compiling with -O3 option.
You should probably just use SSE for everything (MMX is just a very out-dated precursor to SSE). If you're going to be targetting mainly newer CPUs then you might even consider AVX/AVX2.
Start by implementing everything cleanly and robustly in scalar code, then benchmark it. It's possible that a scalar implementation will be fast enough, and you won't need to do anything else. Furthermore, gcc and other compilers (e.g. clang, ICC, even Visual Studio) are getting reasonably good at auto-vectorization, so you may get SIMD-vectorized code "for free" that meets your performance needs. However if you still need better performance at this point then you can start to convert your scalar code to SSE. Keep the original scalar implementation for validation and benchmarking purposes though - it's very easy to introduce bugs when optimising code, and it's useful to know how much faster your optimised code is than the baseline code (you're probably looking for somewhere between 2x and 4x faster for SSE versus scalar code).
While previous answer is reasonable, there is one significant difference - data organization. For direct SSE use data better be organized as Structure-of-Arrays (SoA). Typically, you scalar code might have data made around Array-of-Structures (AoS) layout. If it is the case, conversion from scalar to vectorized form would be difficult
More reading https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions
This is purely a theoretical problem, nothing I have really found myself in, but it has piqued my curiosity and wanted to see if anyone has a better solution for it:
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
Say we have a file format that uses a 64 bit header struct immediately followed by a variable length array of 32 bit structures:
Header: magic : 32 bit
count : 32 bit
Field : id : 16 bit
data : 16 bit
My first instinct would be to write something like:
struct Field
{
uint16_t id ;
uint16_t data ;
};
Except that our compiler may decide that padding is advisable and we end up with a 64 bit structure. So our next bet is:
using Field = uint16_t[2];
and work on that.
That is, unless someone has carefully read the standard and noticed that uint16_t is optional. At this point our next best friend is uint_least16_t, which is guaranteed to be at least 16 bits long, but for all we know could be 20 bits long in a 10 bit / char processor.
At this point, the only real solution I can come up with is some sort of bit stream, capable of reading and writing specific amounts of bits, and adaptable by std::numeric_limits.
So, is there someone out there who has very carefully read the standard and found the point I'm missing? Or it is this the only real way of having a portable guarantee.
Notes:
- I've just realized that endianness would probably add another layer of complexity.
- I'm using the current working draft of the ISO standard (N3797).
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
You can't. Not in C++, which was standardized against an abstract platform where little more than the existence of a "byte" that is made up of bits can be assumed. We can't even say for certain, in looking only at the Standard, how many bits are in a char. You can use bitfields for everything, as bits are indivsible, but then you'll have padding to contend with at the least.
Sometimes it is best to give up on the idea of absolute Standards conformance for the sake of conformance, and look to other means to get the job done efficiently and effectively. In this case, platform specifics in combination with almost absolute Standards conformance (aka, good programming practices) will set you free.
Every platform I work on regularly (linux & windows) provides a means to regulate the padding the compiler will actually apply. For network communications, under Linux & Windows I use:
#pragma pack (push, 1)
as a preface to all the data structures I'm going to send over the wire. Endianness is indeed another challenge, but one more or less easily dealt with using other resources provided by every platform: ntohl and the like.
Standards conformance is a laudable goal, and indeed in a code review I would reject most code that is non-conformant. The lack of conformance is really just a moniker for the rejection however; not the reason itself. The actual reason for the rejection is in large part difficulty in maintaining and porting non-conformant code when moving to another platform, or indeed even just upgrading the compiler on the same platform. Non-conformant code might compile and even appear to work, but it will very often fail in subtle and miserable ways when you least expect it, even after thorough testing.
The moral of the story is:
You should always write Standards-conformant code, except when you
shouldn't.
This really is just a re-imagining of Einstein's articulation of Occam's Razor:
Make everything as simple as possible, but no simpler.
If you want to ensure portability to everything standard-conforming, including platforms for which CHAR_BITS isn't 8, well, you've got your work cut out for you.
If you are comfortable limiting yourself to 98% of the computers you'll ever program, I recommend writing explicit serialization for anything that has to adhere to a particular wire-format. That includes breaking integers into bytes, etc.
Write appropriate abstractions around things and the code won't be too bad. Don't put shifts and masks everywhere. Encapsulate it.
I would use network types and network byte orders. See this link.http://www.beej.us/guide/bgnet/output/html/multipage/htonsman.html. The example uses uint16_t. You can write the values a field at a time to prevent padding.
Or if you want to read and write the entire structure at one see this link C++ struct alignment question
Make the structure easy for the program to use.
Provide input methods that extract data from the input and write to the data members. This removes the issue of padding, alignment boundaries and endianness. Similarly with output.
For example, if your input data is 16-bits wide, but your platform is 32-bits wide, declare the structure using 32-bit fields. Copy the 16 bits from the input into the 32-bit fields.
Most programs read into a structure fewer times than they access the data members. Your program is not reading the input 100% of the time.
I was wondering that maybe double is faster on some machines than float.
However, the operations I am performing really only require the precision of floats. However, they are in image processing and I would desire using the fastest possible one.
Can I use float everywhere and trust that the optimizing VC++ 2008 compiler will convert it to double if it deems it is more appropriate? I don't see how this would break code.
Thanks in advance!
No, the compiler will not change a fundamental type like float to a double for optimization.
If you think this is likely, use a typedef for your floating point in a common header, e.g. typedef float FASTFLOAT; and use FASTFLOAT (or whatever you name it) throughout your code. You can then change one central typedef, and change the type throughout your code.
My own experience is that float and double are basically comparable in performance on x86/x64 platforms now for math operations, and I tend to prefer double. If you are processing a lot of data (and hitting memory bandwidth issues, instead of computationally bound), you may get some performance benefit from the fact that floats are half the size of doubles.
You will also want to explore the effects of the various optimization flags. Depending on your target platform requirements, you may be able to optimize more aggresively.
Firstly, the compiler doesn't change float types unless it has to, and never in storage declarations.
float will be no slower than double, but if you really want fast processing, you need to look into either using a compiler that can generate SSE2 or SSE3 code or you need to write your heavy-processing routines using those instructions. IIRC, there are tools that can help you micromanage the processor's pipeline if necessary. Last I messed with this (years ago), Intel had a library called IPP that could help as well by vectorizing your math.
I have never heard of an architecture where float was slower than double, if only for the fact that memory bandwidth requirements double if you use double. Any FPU that can do a single-cycle double operation can do a single-cycle float operation with a little modification at most.
Mark's got a good idea, though: profile your code if you think it's slow. You might find the real problem is somewhere else, like hidden typecasts or function-call overhead from something you thought was inlined not getting inlined.
When the code needs to store the variable in memory chances are on most architectures it will take 32 bits for a float and 64 bits for a double. Doing the memory size conversion would prevent complete optimization of such.
Are you sure that the floating point math is the bottleneck in your application? Perhaps profiling would reveal another possible source of improvement.
SSE and/or 3D now! have vector instructions, but what do they optimize in practice ? Are 8 bits characters treated 4 by 4 instead of 1 by 1 for example ? Are there optimisation for some arithmetical operations ? Does the word size have any effect (16 bits, 32 bits, 64 bits) ?
Does all compilers use them when they are available ?
Does one really have to understand assembly to use SSE instructions ? Does knowing about electronics and gate logics helps understanding this ?
Background: SSE has both vector and scalar instructions. 3DNow! is dead.
It is uncommon for any compiler to extract a meaningful benefit from vectorization without the programmer's help. With programming effort and experimentation, one can often approach the speed of pure assembly, without actually mentioning any specific vector instructions. See your compiler's vector programming guide for details.
There are a couple portability tradeoffs involved. If you code for GCC's vectorizer, you might be able to work with non-Intel architectures such as PowerPC and ARM, but not other compilers. If you use Intel intrinsics to make your C code more like assembly, then you can use other compilers but not other architectures.
Electronics knowledge will not help you. Learning the available instructions will.
In the general case, you can't rely on compilers to use vectorized instructions at all. Some do (Intel's C++ compiler does a reasonable job of it in many simple cases, and GCC attempts to do so too, with mixed success)
But the idea is simply to apply the same operation to 4 32-bit words (or 2 64 bit values in some cases).
So instead of the traditional `add´ instruction which adds together the values from 2 different 32-bit wide registers, you can use a vectorized add, which uses special, 128-bit wide registers containing four 32-bit values, and adds them together as a single operation.
Duplicate of other questions:
Using SSE instructions
In short, SSE is short for Streaming SIMD Extensions, where SIMD = Single Instruction, Multiple Data. This is useful for performing a single mathematical or logical operation on many values at once, as is typically done for matrix or vector math operations.
The compiler can target this instruction set as part of it's optimizations (research your /O options), however you typically have to restructure code and either code SSE manually, or use a library like Intel Performance Primitives to really take advantage of it.
If you know what you are doing, you might get a huge performance boost. See for example here, where this guy improved the performances of his algorithm 6 times.
I need to pack some bits in a byte in this fashion:
struct
{
char bit0: 1;
char bit1: 1;
} a;
if( a.bit1 ) /* etc */
or:
if( a & 0x2 ) /* etc */
From the source code clarity it's pretty obvious to me that bitfields are neater. But which option is faster? I know the speed difference won't be too much if any, but as I can use any of them, if one's faster, better.
On the other hand, I've read that bitfields are not guaranteed to arrange bits in the same order across platforms, and I want my code to be portable.
Notes: If you plan to answer 'Profile' ok, I will, but as I'm lazy, if someone already has the answer, much better.
The code may be wrong, you can correct me if you want, but remember what the point to this question is and please try and answer it too.
Bitfields make the code much clearer if they are used appropriately. I would use bitfields as a space saving device only. One common place I've seen them used is in compilers: Often type or symbol information consists of a bunch of true/false flags. Bitfields are ideal here since a typical program will have many thousands of these nodes created when it is compiled.
I wouldn't use bitfields to do a common embedded programming job: reading and writing device registers. I prefer using shifts and masks here because you get exactly the bits the documentation tells you you need and you don't have to worry about differences in various compilers implementation of bitfields.
As for speed, a good compiler will give the same code for bitfields that masking will.
I would rather use the second example in preference for maximum portability. As Neil Butterworth pointed out, using bitfields is only for the native processor. Ok, think about this, what happens if Intel's x86 went out of business tomorrow, the code will be stuck, which means having to re-implement the bitfields for another processor, say RISC chip.
You have to look at the bigger picture and ask how did OpenBSD manage to port their BSD systems to a lot of platforms using one codebase? Ok, I'll admit that is a bit over the top, and debatable and subjective, but realistically speaking, if you want to port the code to another platform, its the way to do it by using the second example you used in your question.
Not alone that, compilers for different platforms would have their own way of padding, aligning bitfields for the processor where the compiler is on. And furthermore, what about the endianess of the processor?
Never rely on bitfields as a magic bullet. If you want speed for the processor and will be fixed on it, i.e. no intention of porting, then feel free to use bitfields. You cannot have both!
C bitfields were stillborn from the moment they were invented - for unknown reason. People didn't like them and used bitwise operators instead. You have to expect other developers to not understand C bitfield code.
With regard to which is faster: Irrelevant. Any optimizing compiler (which means practically all) will make the code do the same thing in whatever notation. It's a common misconception of C programmers that the compiler will just search-and-replace keywords into assembly. Modern compilers use the source code as a blueprint to what shall be achieved and then emit code that often looks very different but achieves the intended result.
The first one is explicit and whatever the speed the second expression is error-prone because any change to your struct might make the second expression wrong.
So use the first.
If you want portability, avoid bitfields. And if you are interested in performance of specific code, there is no alternative to writing your own tests. Remember, bitfields will be using the processor's bitwise instructions under the hood.
I think a C programmer would tend toward the second option of using bit masks and logic operations to deduce the value of each bit. Rather than having the code littered with hex values, enumerations would be set up, or, usually when more complex operations are involved, macros to get/set specific bits. I've heard on the grapevine, that struct implemented bitfields are slower.
Don't read to much in the "non portable bit-fields". There are two aspects of bit fields which are implementation defined: the signedness and the layout and one unspecified one: the alignment of the allocation unit in which they are packed. If you don't need anything else that the packing effect, using them is as portable (provided you specify explicitly a signed keyword where needed) as function calls which also have unspecified properties.
Concerning the performance, profile is the best answer you can get. In a perfect world, there would be no difference between the two writings. In practice, there can be some, but I can think of as many reasons in one direction as the other. And it can be very sensitive to context (logically meaningless difference between unsigned and signed for instance) so measure in context...
To sum up, the difference is thus mostly a style difference in cases where you have truly the choice (i.e. not if the precise layout is important). In those case, it is a optimization (in size, not in speed) and so I'd tend first to write the code without it and add it after if needed. So, bit fields are the obvious choice (the modifications to be done are the smallest one to achieve the result and are contained to the unique place of definition instead of being spread of to all the places of use).