What C++ type use for fastest "for cycles"? - c++

I think this is not answered on this site yet.
I made a code which goes through many combinations of 4 numbers. The number values are from 0 to 51, so they can be stored in 6 bits, so in 1 byte, am I right? I use these 4 numbers in nested for cycles and then use them in the lowest level for cycle. So what c++ type from those which can store at least 52 values is the fastest for iterating through 4 nested for cycles?
The code looks like:
for(type first = 0; first != 49; ++first)
for(type second = first+1; second != 50; ++second)
for(type third = second+1; third != 51; ++third)
for(type fourth = third+1; fourth != 52; ++fourth) {
//using those values for about 1 bilion bit operations made in another for cycles
}
That code is very simplified and maybe there is also a better way for this kind of iterating, you can help me also with that.

Use the typedef std::uint_fast8_t from the header <cstdint>. It is supposed to be the "fastest" unsigned integer type with at least 8 bits.

The fastest is whatever the underlying processor ALU can natively work with. Now registers may be addressable in multiple formats. In that case all those formats are equally fast.
So this becomes very processor architecture specific rather than C++ specific.
If you are working on a modern day PC processor then an int is as fast as anything else for your for loops.
On an embedded system there are more things to consider. Eg. Whether the variable is stored in an aligned location or not?

On most machines, int is the fastest integer type. On all of the computers I work with, int is faster than unsigned, significantly faster than signed char.
Another issue, perhaps a bigger one, is what you are doing with those numbers. You didn't show the code, so there's no way of telling. Use int if you expect first*second to produce the expected integral value.
Yet another issue is how widely portable you expect this code to be. There's a huge distinction between code that will be ported to a number of different architectures, different compilers versus code that will be used in a limited and controlled setting. If it's the latter, write some benchmarks, and use the type under which the benchmarks perform best. The problem is a bit tougher if you are writing something for wide consumption.

Related

Why QVector::size returns int?

std::vector::size() returns a size_type which is unsigned and usually the same as size_t, e.g. it is 8 bytes on 64bit platforms.
In constrast, QVector::size() returns an int which is usually 4 bytes even on 64bit platforms, and at that it is signed, which means it can only go half way to 2^32.
Why is that? This seems quite illogical and also technically limiting, and while it is nor very likely that you may ever need more than 2^32 number of elements, the usage of signed int cuts that range in half for no apparent good reason. Perhaps to avoid compiler warnings for people too lazy to declare i as a uint rather than an int who decided that making all containers return a size type that makes no sense is a better solution? The reason could not possibly be that dumb?
This has been discussed several times since Qt 3 at least and the QtCore maintainer expressed that a while ago no change would happen until Qt 7 if it ever does.
When the discussion was going on back then, I thought that someone would bring it up on Stack Overflow sooner or later... and probably on several other forums and Q/A, too. Let us try to demystify the situation.
In general you need to understand that there is no better or worse here as QVector is not a replacement for std::vector. The latter does not do any Copy-On-Write (COW) and that comes with a price. It is meant for a different use case, basically. It is mostly used inside Qt applications and the framework itself, initially for QWidgets in the early times.
size_t has its own issue, too, after all that I will indicate below.
Without me interpreting the maintainer to you, I will just quote Thiago directly to carry the message of the official stance on:
For two reasons:
1) it's signed because we need negative values in several places in the API:
indexOf() returns -1 to indicate a value not found; many of the "from"
parameters can take negative values to indicate counting from the end. So even
if we used 64-bit integers, we'd need the signed version of it. That's the
POSIX ssize_t or the Qt qintptr.
This also avoids sign-change warnings when you implicitly convert unsigneds to
signed:
-1 + size_t_variable => warning
size_t_variable - 1 => no warning
2) it's simply "int" to avoid conversion warnings or ugly code related to the
use of integers larger than int.
io/qfilesystemiterator_unix.cpp
size_t maxPathName = ::pathconf(nativePath.constData(), _PC_NAME_MAX);
if (maxPathName == size_t(-1))
io/qfsfileengine.cpp
if (len < 0 || len != qint64(size_t(len))) {
io/qiodevice.cpp
qint64 QIODevice::bytesToWrite() const
{
return qint64(0);
}
return readSoFar ? readSoFar : qint64(-1);
That was one email from Thiago and then there is another where you can find some detailed answer:
Even today, software that has a core memory of more than 4 GB (or even 2 GB)
is an exception, rather than the rule. Please be careful when looking at the
memory sizes of some process tools, since they do not represent actual memory
usage.
In any case, we're talking here about having one single container addressing
more than 2 GB of memory. Because of the implicitly shared & copy-on-write
nature of the Qt containers, that will probably be highly inefficient. You need
to be very careful when writing such code to avoid triggering COW and thus
doubling or worse your memory usage. Also, the Qt containers do not handle OOM
situations, so if you're anywhere close to your memory limit, Qt containers
are the wrong tool to use.
The largest process I have on my system is qtcreator and it's also the only
one that crosses the 4 GB mark in VSZ (4791 MB). You could argue that it is an
indication that 64-bit containers are required, but you'd be wrong:
Qt Creator does not have any container requiring 64-bit sizes, it simply
needs 64-bit pointers
It is not using 4 GB of memory. That's just VSZ (mapped memory). The total
RAM currently accessible to Creator is merely 348.7 MB.
And it is using more than 4 GB of virtual space because it is a 64-bit
application. The cause-and-effect relationship is the opposite of what you'd
expect. As a proof of this, I checked how much virtual space is consumed by
padding: 800 MB. A 32-bit application would never do that, that's 19.5% of the
addressable space on 4 GB.
(padding is virtual space allocated but not backed by anything; it's only
there so that something else doesn't get mapped to those pages)
Going into this topic even further with Thiago's responses, see this:
Personally, I'm VERY happy that Qt collection sizes are signed. It seems
nuts to me that an integer value potentially used in an expression using
subtraction be unsigned (e.g. size_t).
An integer being unsigned doesn't guarantee that an expression involving
that integer will never be negative. It only guarantees that the result
will be an absolute disaster.
On the other hand, the C and C++ standards define the behaviour of unsigned
overflows and underflows.
Signed integers do not overflow or underflow. I mean, they do because the types
and CPU registers have a limited number of bits, but the standards say they
don't. That means the compiler will always optimise assuming you don't over-
or underflow them.
Example:
for (int i = 1; i >= 1; ++i)
This is optimised to an infinite loop because signed integers do not overflow.
If you change it to unsigned, then the compiler knows that it might overflow
and come back to zero.
Some people didn't like that: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475
unsigned numbers are values mod 2^n for some n.
Signed numbers are bounded integers.
Using unsigned values as approximations for 'positive integers' runs into the problem that common values are near the edge of the domain where unsigned values behave differently than plain integers.
The advantage is that unsigned approximation reaches higher positive integers, and under/overflow are well defined (if random when looked at as a model of Z).
But really, ptrdiff_t would be better than int.

Library, which helps packing structures with good performance

Days ago I heard (maybe I've even seen it!) about library, that helps with packing structures. Unfortunately - I can't recall it's name.
In my program I have to keep large amount of data, therefore I need to pack them and avoid loosing bits on gaps. For example: I have to keep many numbers from range 1...5. If I would keep them in char - it would take 8bits, but this number can be kept on 3 bits. Moreover - if I would keep this numbers in packs of 8bits with maximum number 256 I could pack there 51 numbers (instead of 1 or 2!).
Is there any librarary, which helps this actions, or do I have do this on my own?
As Tomalak Garet'kal already mentioned, this is a feature of ANSI C, called bit-fields. The wikipedia article is quite useful. Typically you declare them as structs.
For your example: as you mentioned you have one number in the range of 0..5 you can use 3 bits on this number, which leaves you 5 bits of use:
struct s
{
unsigned int mynumber : 3;
unsigned int myother : 5;
}
These can now be accesses simply like this:
struct s myinstance;
myinstance.mynumber = 3;
myinstance.myother = 1;
Be awared that bit fields are slower than usual struct members/variables, since the runtime has to perform bit-shifting/masking to allow access to simple bits.

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.
The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).
Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.
Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).
By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).
It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".
Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.
Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)

Do bit operations cause programs to run slower?

I'm dealing with a problem which needs to work with a lot of data. Currently its values are represented as an unsigned int. I know that real values do not exceed a limit of 1000.
Questions
I can use unsigned short to store it. An upside to this is that it'll use less storage space to store the value. Will performance suffer?
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values. Will performance suffer? Will the loss in performance be dramatic?
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?
This all depends on architecture. Bit-fields are generally slower, but if you are able to to significantly cut down memory usage with them, you can even gain in performance due to better CPU caching and similar things. Likewise with short (though it is not dramatic in any case).
The best way is to make your source code able to switch representation easily (at compilation time, of course). Then you will be able to test and profile different implementations in your specific circumstances just by, say, changing one #define.
Also, don't forget about premature optimization rule. Make it work first. If it turns out to be slow/not fast enough, only then try to speed up.
I can use unsigned short to store it.
Yes you can use unsigned short (assuming (sizeof(unsigned short) * CHAR_BITS) >= 10)
An upside to this is that it'll use less storage space to store the value.
Less than what? Less than int? Depends what is the sizeof(int) on your system?
Will performance suffer?
Depends. The type int is supposed to be the most efficient integer type for your system so potentially using short may affect your performance. Whether it does will depend on the system. Time it and find out.
If I decided to store data as short but all the calling functions use int, it's recognized that I need to convert between these datatypes when storing or extracting values.
Yes. But the compiler will do the conversion automatically. One thing you need to watch though is conversion between signed and unsigned types. If the value does not fit the exact result may be implementation defined.
Will performance suffer?
Maybe. if sizeof(unsigned int) == sizeof(unsigned short) then probably not. Time it and see.
Will the loss in performance be dramatic?
Time it and see.
If I decided to not use short but just 10 bits packed into an array of unsigned int. What will happen in this case comparing with previous ones?
Time it and see.
A good compromise for you is probably packing three values into a 32 bit int (with two bits unused). Untangling 10 bits from a bit array is a lot more expensive, and doesn't save much space. You can either use bit fields, or do it by hand yourself:
(i&0x3FF) // Get i[0]
(i>>10)&0x3FF // Get i[1]
(i>>20)&0x3FF // Get i[2]
i = (i&0x3FFFFC00) | (j&0x3FF) // Set i[0] to j
i = (i&0x3FF003FF) | ((j&0x3FF)<<10) // Set i[1] to j
i = (i&0xFFFFF) | ((j&0x3FF)<<20) // Set i[2] to j
You can see here how much extra expense it is: a bit operation and 2/3 of a shift (on average) for get, and three bit operations and 2/3 of a shift (on average) to set. Probably not too bad, especially if you're mostly getting the values not setting them.

C++, using one byte to store two variables

I am working on representation of the chess board, and I am planning to store it in 32 bytes array, where each byte will be used to store two pieces. (That way only 4 bits are needed per piece)
Doing it in that way, results in a overhead for accessing particular index of the board.
Do you think that, this code can be optimised or completely different method of accessing indexes can be used?
c++
char getPosition(unsigned char* c, int index){
//moving pointer
c+=(index>>1);
//odd number
if (index & 1){
//taking right part
return *c & 0xF;
}else
{
//taking left part
return *c>>4;
}
}
void setValue(unsigned char* board, char value, int index){
//moving pointer
board+=(index>>1);
//odd number
if (index & 1){
//replace right part
//save left value only 4 bits
*board = (*board & 0xF0) + value;
}else
{
//replacing left part
*board = (*board & 0xF) + (value<<4);
}
}
int main() {
char* c = (char*)malloc(32);
for (int i = 0; i < 64 ; i++){
setValue((unsigned char*)c, i % 8,i);
}
for (int i = 0; i < 64 ; i++){
cout<<(int)getPosition((unsigned char*)c, i)<<" ";
if (((i+1) % 8 == 0) && (i > 0)){
cout<<endl;
}
}
return 0;
}
I am equally interested in your opinions regarding chess representations, and optimisation of the method above, as a stand alone problem.
Thanks a lot
EDIT
Thanks for your replies. A while ago I created checkers game, where I was using 64 bytes board representation. This time I am trying some different methods, just to see what I like. Memory is not such a big problem. Bit-boards is definitely on my list to try. Thanks
That's the problem with premature optimization. Where your chess board would have taken 64 bytes to store, now it takes 32. What has this really boughten you? Did you actually analyze the situation to see if you needed to save that memory?
Assuming that you used one of the least optimal search method, straight AB search to depth D with no heuristics, and you generate all possible moves in a position before searching, then absolute maximum memory required for your board is going to be sizeof(board) * W * D. If we assume a rather large W = 100 and large D = 30 then you're going to have 3000 boards in memory at depth D. 64k vs 32k...is it really worth it?
On the other hand, you've increased the amount of operations necessary to access board[location] and this will be called many millions of times per search.
When building chess AI's the main thing you'll end up looking for is cpu cycles, not memory. This may vary a little bit if you're targeting a cell phone or something, but even at that you're going to worry more about speed before you'll ever reach enough depth to cause any memory issues.
As to which representation I prefer...I like bitboards. Haven't done a lot of serious measurements but I did compare two engines I made, one bitboard and one array, and the bitboard one was faster and could reach much greater depths than the other.
Let me be the first to point out a potential bug (depending on compilers and compiler settings). And bugs being why premature optimization is evil:
//taking left part
return *c>>4;
if *c is negative, then >> may repeat the negative high bit. ie in binary:
0b10100000 >> 4 == 0b11111010
for some compilers (ie the C++ standard leaves it to the compiler to decide - both whether to carry the high bit, and whether a char is signed or unsigned).
If you do want to go forward with your packed bits (and let me say that you probably shouldn't bother, but it is up to you), I would suggest wrapping the packed bits into a class, and overriding [] such that
board[x][y]
gives you the unpacked bits. Then you can turn the packing on and off easily, and having the same syntax in either case. If you inline the operator overloads, it should be as efficient as the code you have now.
Well, 64 bytes is a very small amount of RAM. You're better off just using a char[8][8]. That is, unless you plan on storing a ton of chess boards. Doing char[8][8] makes it easier (and faster) to access the board and do more complex operations on it.
If you're still interested in storing the board in packed representation (either for practice or to store a lot of boards), I say you're "doing it right" regarding the bit operations. You may want to consider inlining your accessors if you're going for speed using the inline keyword.
Is space enough of a consideration where you can't just use a full byte to represent a square? That would make accesses easier to follow on the program and additionally most likely faster as the bit manipulations are not required.
Otherwise to make sure everything goes smoothly I would make sure all your types are unsigned: getPosition return unsigned char, and qualify all your numeric literals with "U" (0xF0U for example) to make sure they're always interpreted as unsigned. Most likely you won't have any problems with signed-ness but why take chances on some architecture that behaves unexpectedly?
Nice code, but if you are really that deep into performance optimization, you should probably learn more about your particular CPU architecture.
AFAIK, you may found that storing a chess piece in as much 8 bytes will be more efficient. Even if you recurse 15 moves deep, L2 cache size would hardly be a constraint, but RAM misalignment may be. I would guess that proper handling of a chess board would include Expand() and Reduce() functions to translate between board representations during different parts of the algorithm: some may be faster on compact representation, and some vice versa. For example, caching, and algorithms involving hashing by composition of two adjacent cells might be good for the compact structure, all else no.
I would also consider developing some helper hardware, like some FPGA board, or some GPU code, if performance is so important..
As a chess player, I can tell you: There's more to a position than the mere placement of each piece. You have to take in to consideration some other things:
Which side has to move next?
Can white and/or black castle king and/or queenside?
Can a pawn be taken en passant?
How many moves have passed since the last pawn move and/or capturing move?
If the data structure you use to represent a position doesn't reflect this information, then you're in big trouble.