Do bit operations cause programs to run slower? - c++

I'm dealing with a problem which needs to work with a lot of data. Currently its values are represented as an unsigned int. I know that real values do not exceed a limit of 1000.
Questions
I can use unsigned short to store it. An upside to this is that it'll use less storage space to store the value. Will performance suffer?
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values. Will performance suffer? Will the loss in performance be dramatic?
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?

This all depends on architecture. Bit-fields are generally slower, but if you are able to to significantly cut down memory usage with them, you can even gain in performance due to better CPU caching and similar things. Likewise with short (though it is not dramatic in any case).
The best way is to make your source code able to switch representation easily (at compilation time, of course). Then you will be able to test and profile different implementations in your specific circumstances just by, say, changing one #define.
Also, don't forget about premature optimization rule. Make it work first. If it turns out to be slow/not fast enough, only then try to speed up.

I can use unsigned short to store it.
Yes you can use unsigned short (assuming (sizeof(unsigned short) * CHAR_BITS) >= 10)
An upside to this is that it'll use less storage space to store the value.
Less than what? Less than int? Depends what is the sizeof(int) on your system?
Will performance suffer?
Depends. The type int is supposed to be the most efficient integer type for your system so potentially using short may affect your performance. Whether it does will depend on the system. Time it and find out.
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values.
Yes. But the compiler will do the conversion automatically. One thing you need to watch though is conversion between signed and unsigned types. If the value does not fit the exact result may be implementation defined.
Will performance suffer?
Maybe. if sizeof(unsigned int) == sizeof(unsigned short) then probably not. Time it and see.
Will the loss in performance be dramatic?
Time it and see.
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?
Time it and see.

A good compromise for you is probably packing three values into a 32 bit int (with two bits unused). Untangling 10 bits from a bit array is a lot more expensive, and doesn't save much space. You can either use bit fields, or do it by hand yourself:
(i&0x3FF) // Get i[0]
(i>>10)&0x3FF // Get i[1]
(i>>20)&0x3FF // Get i[2]
i = (i&0x3FFFFC00) | (j&0x3FF) // Set i[0] to j
i = (i&0x3FF003FF) | ((j&0x3FF)<<10) // Set i[1] to j
i = (i&0xFFFFF) | ((j&0x3FF)<<20) // Set i[2] to j
You can see here how much extra expense it is: a bit operation and 2/3 of a shift (on average) for get, and three bit operations and 2/3 of a shift (on average) to set. Probably not too bad, especially if you're mostly getting the values not setting them.

Related

Advantages/Disadvantages of using __int16 (or int16_t) over int

As far as I understand, the number of bytes used for int is system dependent. Usually, 2 or 4 bytes are used for int.
As per Microsoft's documentation, __int8, __int16, __int32 and __int64 are Microsoft Specific keywords. Furthermore, __int16 uses 16-bits (i.e. 2 bytes).
Question: What are advantage/disadvantage of using __int16 (or int16_t)? For example, if I am sure that the value of my integer variable will never need more than 16 bits then, will it be beneficial to declare the variable as __int16 var (or int16_t var)?
UPDATE: I see that several comments/answers suggest using int16_t instead of __int16, which is a good suggestion but not really an advantage/disadvantage of using __int16. Basically, my question is, what is the advantage/disadvantage of saving 2 bytes by using 16-bit version of an integer instead of int ?
Saving 2 bytes is almost never worth it. However, saving thousands of bytes is. If you have an large array containing integers, using a small integer type can save quite a lot of memory. This leads to faster code, because the less memory one uses the less cache misses one receives (cache misses are a major loss of performance).
TL;DR: this is beneficial to do in large arrays, but pointless for 1-off variables.
The second use of these is if for dealing with binary files and messages. If you are reading a binary file that uses 16-bit integers, well, it's pretty convenient if you can represent that type exactly in your code.
BTW, don't use microsoft's versions. Use the standard versions (std::int16_t)
It depends.
On x86, primitive types are generally aligned on their size. So 2-byte types would be aligned on a 2-byte boundary. This is useful when you have more than one of these short variables, because you will be saving 50% of space. That directly translates to better memory and cache utilization and thus theoretically, better performance.
On the other hand, doing arithmetic on shorter-than-int types usually involves widening conversion to int. So if you do a lot of arithmetic on these types, using int types might result in better performance (contrived example).
So if you care about performance of a critical section of code, profile it to find out for sure if using a certain data type is faster or slower.
A possible rule of thumb would be - if you're memory-bound (i.e. you have lots of variables and especially arrays), use as short a data types as possible. If not - don't worry about it and use int types.
If you for some reason just need a shorter integer type it's already have that in the language - called short - unless you know you need exactly 16 bits there's really no good reason not to just stick with the agnostic short and int types. The broad idea is that these types should align well the target architecture (for example see word ).
That being said, theres no need to use the platform specific type (__int16), you can just use the standard one:
int16_t
See https://en.cppreference.com/w/cpp/types/integer for more information and standard types
Even if you still insist on __int16 you probably want a typedef something ala.:
using my_short = __int16;
Update
Your main question is:
What is the advantage/disadvantage of
saving 2 bytes by using 16-bit version of an integer instead of int ?
If you have a lot of data (In the ballpark of at least some 100.000-1.000.000 elements as a rule of thumb) - then there could be an overall performance saving in terms of using less cpu-cache. Overall there's no disadvantage of using a smaller type - except for the obvious one - and possible conversions as explained in this answer
The main reason for using these types is to make sure about the size of your variable in different architectures and compilers. we call it "code reusability" and "portability"
in higher-level modern languages, all this will handle with compiler/interpreter/virtual machine/etc. that you don't need to worry about, but it has some performance and memory usage costs.
When you have some kind of limitation you may need to optimize everything. The best example is embedded systems that have a very limited size of memory and work at low frequency. In the other hand, there are lots of compilers out there with different implementations. Some of them interpret "int" as a "16bit" value and some as a "32bit".
for example, you receive and specific stream of values over a communication system, you want to save them in a buffer or array and you want to make sure the input data is always interpreted as a 16bit noting else.

Why QVector::size returns int?

std::vector::size() returns a size_type which is unsigned and usually the same as size_t, e.g. it is 8 bytes on 64bit platforms.
In constrast, QVector::size() returns an int which is usually 4 bytes even on 64bit platforms, and at that it is signed, which means it can only go half way to 2^32.
Why is that? This seems quite illogical and also technically limiting, and while it is nor very likely that you may ever need more than 2^32 number of elements, the usage of signed int cuts that range in half for no apparent good reason. Perhaps to avoid compiler warnings for people too lazy to declare i as a uint rather than an int who decided that making all containers return a size type that makes no sense is a better solution? The reason could not possibly be that dumb?
This has been discussed several times since Qt 3 at least and the QtCore maintainer expressed that a while ago no change would happen until Qt 7 if it ever does.
When the discussion was going on back then, I thought that someone would bring it up on Stack Overflow sooner or later... and probably on several other forums and Q/A, too. Let us try to demystify the situation.
In general you need to understand that there is no better or worse here as QVector is not a replacement for std::vector. The latter does not do any Copy-On-Write (COW) and that comes with a price. It is meant for a different use case, basically. It is mostly used inside Qt applications and the framework itself, initially for QWidgets in the early times.
size_t has its own issue, too, after all that I will indicate below.
Without me interpreting the maintainer to you, I will just quote Thiago directly to carry the message of the official stance on:
For two reasons:
1) it's signed because we need negative values in several places in the API:
indexOf() returns -1 to indicate a value not found; many of the "from"
parameters can take negative values to indicate counting from the end. So even
if we used 64-bit integers, we'd need the signed version of it. That's the
POSIX ssize_t or the Qt qintptr.
This also avoids sign-change warnings when you implicitly convert unsigneds to
signed:
-1 + size_t_variable => warning
size_t_variable - 1 => no warning
2) it's simply "int" to avoid conversion warnings or ugly code related to the
use of integers larger than int.
io/qfilesystemiterator_unix.cpp
size_t maxPathName = ::pathconf(nativePath.constData(), _PC_NAME_MAX);
if (maxPathName == size_t(-1))
io/qfsfileengine.cpp
if (len < 0 || len != qint64(size_t(len))) {
io/qiodevice.cpp
qint64 QIODevice::bytesToWrite() const
{
return qint64(0);
}
return readSoFar ? readSoFar : qint64(-1);
That was one email from Thiago and then there is another where you can find some detailed answer:
Even today, software that has a core memory of more than 4 GB (or even 2 GB)
is an exception, rather than the rule. Please be careful when looking at the
memory sizes of some process tools, since they do not represent actual memory
usage.
In any case, we're talking here about having one single container addressing
more than 2 GB of memory. Because of the implicitly shared & copy-on-write
nature of the Qt containers, that will probably be highly inefficient. You need
to be very careful when writing such code to avoid triggering COW and thus
doubling or worse your memory usage. Also, the Qt containers do not handle OOM
situations, so if you're anywhere close to your memory limit, Qt containers
are the wrong tool to use.
The largest process I have on my system is qtcreator and it's also the only
one that crosses the 4 GB mark in VSZ (4791 MB). You could argue that it is an
indication that 64-bit containers are required, but you'd be wrong:
Qt Creator does not have any container requiring 64-bit sizes, it simply
needs 64-bit pointers
It is not using 4 GB of memory. That's just VSZ (mapped memory). The total
RAM currently accessible to Creator is merely 348.7 MB.
And it is using more than 4 GB of virtual space because it is a 64-bit
application. The cause-and-effect relationship is the opposite of what you'd
expect. As a proof of this, I checked how much virtual space is consumed by
padding: 800 MB. A 32-bit application would never do that, that's 19.5% of the
addressable space on 4 GB.
(padding is virtual space allocated but not backed by anything; it's only
there so that something else doesn't get mapped to those pages)
Going into this topic even further with Thiago's responses, see this:
Personally, I'm VERY happy that Qt collection sizes are signed. It seems
nuts to me that an integer value potentially used in an expression using
subtraction be unsigned (e.g. size_t).
An integer being unsigned doesn't guarantee that an expression involving
that integer will never be negative. It only guarantees that the result
will be an absolute disaster.
On the other hand, the C and C++ standards define the behaviour of unsigned
overflows and underflows.
Signed integers do not overflow or underflow. I mean, they do because the types
and CPU registers have a limited number of bits, but the standards say they
don't. That means the compiler will always optimise assuming you don't over-
or underflow them.
Example:
for (int i = 1; i >= 1; ++i)
This is optimised to an infinite loop because signed integers do not overflow.
If you change it to unsigned, then the compiler knows that it might overflow
and come back to zero.
Some people didn't like that: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
unsigned numbers are values mod 2^n for some n.
Signed numbers are bounded integers.
Using unsigned values as approximations for 'positive integers' runs into the problem that common values are near the edge of the domain where unsigned values behave differently than plain integers.
The advantage is that unsigned approximation reaches higher positive integers, and under/overflow are well defined (if random when looked at as a model of Z).
But really, ptrdiff_t would be better than int.

What C++ type use for fastest "for cycles"?

I think this is not answered on this site yet.
I made a code which goes through many combinations of 4 numbers. The number values are from 0 to 51, so they can be stored in 6 bits, so in 1 byte, am I right? I use these 4 numbers in nested for cycles and then use them in the lowest level for cycle. So what c++ type from those which can store at least 52 values is the fastest for iterating through 4 nested for cycles?
The code looks like:
for(type first = 0; first != 49; ++first)
for(type second = first+1; second != 50; ++second)
for(type third = second+1; third != 51; ++third)
for(type fourth = third+1; fourth != 52; ++fourth) {
//using those values for about 1 bilion bit operations made in another for cycles
}
That code is very simplified and maybe there is also a better way for this kind of iterating, you can help me also with that.
Use the typedef std::uint_fast8_t from the header <cstdint>. It is supposed to be the "fastest" unsigned integer type with at least 8 bits.
The fastest is whatever the underlying processor ALU can natively work with. Now registers may be addressable in multiple formats. In that case all those formats are equally fast.
So this becomes very processor architecture specific rather than C++ specific.
If you are working on a modern day PC processor then an int is as fast as anything else for your for loops.
On an embedded system there are more things to consider. Eg. Whether the variable is stored in an aligned location or not?
On most machines, int is the fastest integer type. On all of the computers I work with, int is faster than unsigned, significantly faster than signed char.
Another issue, perhaps a bigger one, is what you are doing with those numbers. You didn't show the code, so there's no way of telling. Use int if you expect first*second to produce the expected integral value.
Yet another issue is how widely portable you expect this code to be. There's a huge distinction between code that will be ported to a number of different architectures, different compilers versus code that will be used in a limited and controlled setting. If it's the latter, write some benchmarks, and use the type under which the benchmarks perform best. The problem is a bit tougher if you are writing something for wide consumption.

Difference between uint8_t and unspecified int for large matrices

I have a matrix that is over 17,000 x 14,000 that I'm storing in memory in C++. The values will never get over 255 so I'm thinking I should store this matrix as a uint8_t type instead of a regular int type. Will the regular int type will assume the native word size (64 bit so 8 bytes per cell) even with an optimizing compiler? I'm assuming I'll use 8x less memory if I store the array as uint8_t?
If you doubt this, you could have just tried it.
Of course it will be smaller.
However, it wholly depends on your usage patterns which will be faster. Profile! Profile! Profile!
Reasons for unexpected performance considerations:
alignment issues
elements sharing cache lines (could be positive on sequential access; negative in multicore scenarios)
increased need for locking on atomic reads/writes (in case of threading)
reduced applicability of certain optimized MIPS instructions (? - I'm not up-to-date with details here; also a very good optimizing compiler might simply register-allocate temporaries of the right size)
other, unrelated border conditions, originating from the surrounding code
The standard doesn't specify the exact size of int other than it's at least the size of short. On some 64-bit architectures (for example many Linux and Solaris x86 systems I work with) int is 32 bits and long is 64 bits. The exact size of each type will of course vary by compiler/hardware.
The best way to find out is to use sizeof(int) on your system and see how big it is. If you have enough RAM using the native type may in fact be significantly faster than the uint8_t.
Even the best optimizing compiler is not going to do an analysis of the values of the data that you put into your matrix and assume (anthropomorphizing here) "Hmmm. He said int but everything is between 0 and 255. I'm going to make that an array of uint8_t."
The compiler can interpret some keywords such as register and inline as suggestions rather than mandates. Types on the other hand are mandates. You told the compiler to use int so the compiler must use int. So switching to a uint8_t matrix will save you a considerable amount of memory here.

which of these two methods of converting this array to integer you would suggest?

consider the following array of bytes that is intended to be converted into a single unsigned integer:
unsigned char arr[3] = {0x23, 0x45, 0x67};
each byte represents the equivalent byte in integer, now which one of the following methods would you suggest specially performance-wise:
unsigned int val1 = arr[2] << 16 | arr[1] << 8 | arr[0];
//or
unsigned int val2=arr[0];
*((char *)&val2+1)=arr[1];
*((char *)&val2+2)=arr[2];
I prefer the first method because it is portable. The second isn't due to endianness issues.
This depends on your specific processor, a lot.
For example, on the PowerPC, the second form -- writing through the character pointers -- runs into a tricky implementation detail called a load-hit-store. This is a CPU stall that occurs when you store to a location in memory, then read it back again before the store has completed. The load op cannot complete until the store has finished (most PPCs do not have memory store-forwarding), and the store may take many cycles to make it from the CPU out to the memory cache.
Because of the way the store and arithmetic units are arranged in the pipeline, the CPU will have to flush the pipeline completely until the store completes: this can be a stall of twenty cycles or more during which the CPU has stopped dead. In general, writing to memory and then reading it back immediately is very bad on this platform. So on this case, the sequential bitshifts will be much faster, as they all occur on registers, and will not incur a pipeline stall.
On the Pentium series, the situation may be entirely reversed, because that chipset does have store forwarding and a fast stack architecture, and relatively few architectural registers. On the Core Duos and i7s, it may reverse yet again, because their pipelines are very deep.
Remember: it is not the case that every opcode takes one cycle. CPUs are not simple, and things like superscalar pipes and data hazards may cause instructions to take many cycles, or even many instructions to occur per cycle, depending on just how you arrange your code.
All of this just to underscore the point: this sort of optimization is extremely specific to a particular compiler and chipset. So you must compile, test and measure.
the first is faster, translated in x86 asm. It depends on your architecture anyway. Usually the compilers are able to optimize the first expression very well, and it's more portable too
The performance depends on the compiler and the machine. For example, in my experiment with gcc 4.4.5 on x64 the second was marginally faster, while others report the first as being faster. Therefore I recommend to stick with the first one because it is cleaner (no casts) and safer (no endianness issues).
I believe bitshift will the fastest solution. In my mind the CPU can just slide in the values, but by going directly to the address, like your second example, it will have to use many temp storages.
I would suggest a solution with union :
union color {
// first representation (member of union)
struct s_color {
unsigned char a, b, g, r;
} uc_color;
// second representation (member of union)
unsigned int int_color;
};
int main()
{
color a;
a.int_color = 0x23567899;
a.uc_color.a;
a.uc_color.b;
a.uc_color.g;
a.uc_color.r;
}
Take care that it platform dependent (which endianess)