I am doing an algorithmic contest, and I'm trying to optimize my code. Maybe what I want to do is stupid and impossible but I was wondering.
I have these requirements:
An inventory which can contains 4 distinct types of item. This inventory can't contain more than 10 items (all type included). Example of valid inventory: 1 / 1 / 1 / 0. Example of invalid inventories: 11 / 0 / 0 / 0 or 5 / 5 / 5 / 0
I have some recipe which consumes or adds items into my inventory. The recipe can't add or consume more than 10 items since the inventory can't have more than 10 items. Example of valid recipe: -1 / -2 / 3 /
0. Example of invalid recipe: -6 / -6 / +12 / 0
For now, I store the inventory and the recipe into 4 integers. Then I am able to perform some operations like:
ApplyRecepe: Inventory(1/1/1/0).Apply(Recepe(-1/1/0/0)) = Inventory(0/2/1/0)
CanAfford: Iventory(1/1/0/0).CanAfford(Recepe(-2/1/0/0)) = False
I would like to know if it is possible (and if yes, how) to store the 4 values of an inventory/recipe into one single integer and to performs previous operations on it that would be faster than comparing / adding the 4 integers as I'm doing now.
I thought of something like having the inventory like that:
int32: XXXX (number of items of the first type) - YYYY (number of items of the second type) - ZZZ (number of items of the third type) - WWW (number of item of the fourth type)
But I have two problems with that:
I don't know how to handle the possible negative values
It seems to me much slower than just adding the 4 integers since I have to bit shift the inventory and the recipe to get the value I want and then proceed with the addition.
Storing multiple int values into one variable
Here are two alternatives:
An array. The advantage of this is that you may iterate over the elements:
int variable[] {
1,
1,
1,
0,
};
Or a class. The advantage of this is the ability to name the members:
struct {
int X;
int Y;
int Z;
int W;
} variable {
1,
1,
1,
0,
};
Then I am able to perform some operations like:
Those look like SIMD vector operations (Single Instruction Multiple Data). The array is the way to go in this case. Since the number of operands appears to be constant and small in your description, an efficient way to perform them are vector operations on the CPU 1.
There is no standard way to use SIMD operations directly in C++. To give the compiler optimal opportunity to use them, these steps need to be followed:
Make sure that the CPU you use supports the operations that you need. AVX-2 instruction set and its expansions have wide support for integer vector operations.
Make sure that you tell the compiler that the program should be optimised for that architecture.
Make sure to tell the compiler to perform vectorisation optimisations.
Make sure that the integers are sufficiently aligned as required by the operations. This can be achieved with alignas.
Make sure that the number of integers is known at compile time.
If the prospect of relying on the optimiser worries you, then you may instead prefer to use vector extensions that may be provided by your compiler. The use of language extensions would come at the cost of portability to other compilers naturally. Here is an example with GCC:
constexpr int count = 4;
using v4si = int __attribute__ ((vector_size (sizeof(int) * count)));
#include <iostream>
int main()
{
v4si inventory { 1, 1, 1, 0};
v4si recepe {-1, 1, 0, 0};
v4si applied = inventory + recepe;
for (int i = 0; i < count; i++) {
std::cout << applied[i] << ", ";
}
}
1 If the number of operands were large, then specialised vector processor such as a GPU could be faster.
Especially if you're learning, it's not a bad opportunity to try implementing your own helper class for vectorization, and consequently deepen your understanding about data in C++, even if your use case might not warrant the technique.
The insight you want to exploit is that arithmetic operations seem invariant to bitshifts, if one considers the pesky carry-bit and effects of signage (e.g. two's complement). But precisely because of these latter factors, it's much better to use some standardized underlying type like an int8_t[], as #Botje suggests.
To begin, implement the following functions. (My C++ is rusty, consider this pseudocode.)
int8_t* add(int8_t[], int8_t[], size_t);
int8_t* multiply(int8_t[], int8_t[], size_t);
int8_t* zeroes(size_t); // additive identity
int8_t* ones(size_t); // multiplicative identity
Also considering:
How would you like to handle overflows and underflows? Let them be and ask the developer to be cautious? Or throw exceptions?
Maybe you'd like to pin down the size of the array and avoid having to deal with a dynamic size_t?
Maybe you'd like to go as far as overloading operators?
The end result of an exercise like this, but generalized and polished, is something like Armadillo. But you'll understand it on a whole different level by doing the exercise yourself first. Also, if all this makes sense so far, you can now take a look at How to vectorize my loop with g++?—even the compiler can vectorize for you in certain cases.
Bitpacking as #Botje mentions is another step beyond this. You won't even have the safety and convenience of an integer type like int8_t or int4_t. Which additionally means the code you write might stop being platform-independent. I recommend at least finishing the vectorization exercise before delving into this.
This will be something of a non-answer, just intended to show what you're up against if you do bitpacking.
Suppose, for simplicity's sake, that recipes can only remove from inventory, and only contain positive values (you could represent negative numbers using two's complement, but it would take more bits, and add much complexity to working with the bit-packed numbers).
You then have 11 possible values for an item, so you need 4 bits for each item. Four items can then be represented in one uint16.
So, say you have an inventory with 10,4,6,9 items; this would be uint16_t inv = 0b1010'0100'0110'1001.
Then, a recipe with 2,2,2,2 items or uint16_t rec = 0b0010'0010'0010'0010.
inv - rec would give 0b1000'0010'0100'0111 for 8,2,4,7 items.
So far, so good. No need here to shift and mask to get at the individual values before doing the calculation. Yay.
Now, a recipe with 6,6,6,6 items which would be 0b0110'0110'0110'0110, giving inv - rec = 0b0011'1110'0000'0011 for 3,14,0,3 items.
Oops.
The arithmetic will work, but only if you check beforehand that the individual 4-bit results don't go out of bounds; in this example this would mean that you know beforehand that there are enough items in the inventory to fill a recipe.
You could get at, say, the third item in the inventory by doing: (inv >> 4) & 0b1111 or (inv << 8) >> 12 for doing your checks.
For testing, you would then get expressions like:
if ((inv >> 4) & 0b1111 >= (rec >> 4) & 0b1111)
or, comparing the 4 bits "in place":
if (inv & 0b0000000011110000 >= rec & 0b0000000011110000)
for each 4-bit part.
All these things are doable, but do you want to? It probably won't be faster than what is suggested in the other answers after the compiler has done its job, and it certainly won't be more readable.
It becomes even more horrible when you allow negative numbers (two's complement or otherwise) in recipes, especially if you want to bit-shift them.
So, bitpacking is nice for storage, and in some rare cases you can even do math without unpacking the bits, but I wouldn't try to go there (unless you are very performance and memory constrained).
Having said that, it could be fun to try to get it to work; there's always that.
Related
I have structured data like below:
struct Leg
{
char type;
char side;
int qty;
int id;
} Legs[5];
where
type is O or E,
side is B or S;
qty is 1 to 9999 and qty in all Legs is relative prime to each other i.e. 1 2 3 not 2 4 6
id is an integer from 1 to 9999999 and all ids are unique in the group of Legs
To build unique signature of above data, currently I am building a string like below:
first sort Legs based on id;
then
signature=""
for i=1 to 5
signature+=id+type+qty+side of leg-i
and I insert into unordered_map so that if any matching structured data comes, I can, lookup by building a signature as above and looking up.
unorderd_map on string means key-compare which is string compare and also hash function which needs to traverse the string which is usually around 25 chars.
For efficiency, it it is possible to build a unique integer out of above data for each structure above, the lookups/insertions in unorderd_map will be extremely faster.
Just wondering if there is any mathematical properties I can take advantage of.
Edit:
The map will contain key,value pairs like
<unique-signature=key, value=int-value needs to be located on looking up another repeating Leg group by constructing signature like above after sorting Legs based on id>
<123O2B234E3S456O3S567O2S789E2B, 989>
The goal is to build unique signature from each such unique repeating group of legs. Legs can be in different order and yet they can be match with another group of legs which are in different order thats why I sort based on id which is unique and build the signature.
My signature is string based, if there was a way to construct a unique number signature, then my lookups/insertions will be faster.
You can just create a unique 40-bit number from the fields you have. Why 40 bits? I'm glad you asked.
You have 9,999,999 possible id values, which means you can use 24 bits to represent all possibilities (log2(9999999) = a little over 23).
You have 9,999 possible qty values, which requires another 14 bits.
type and side require 1 bit each, which gives you a total of 40 bits of information. Store this number as a long long and you have a nice, fast key for your map.
If you really want a unique int key then you're probably out of luck because it's going to be pretty tricky to get rid of 8 bits of information. You might be able to take advantage of the co-primality of the qty field to represent it in fewer than 14 bits, however I doubt that you can get it down to 6 bits because that only gives you 64 possible values for qty.
That's a way to get what you asked for, but #David Schwartz's answer is probably what you actually need: hash collisions are generally not expensive unless you have a really bad hash function - see Application vulnerability due to Non Random Hash Functions for an example of how that can bite you - or a carefully crafted data set that happens to hit the worst-case.
In your case you should be fine with David's answer. It'll be fast enough unless you are extremely unfortunate with your set of data.
EDIT: Just noticed that you are computing your signature over the set of 5 Legs. The same math applies, you just will need 200 bits rather than 4. So it won't fit in a long long unless you have some information that can be shared amongst all 5 Leg objects; if each set of 5 shares the same id, for example.
Stick with David's answer.
It doesn't have to be unique. I would suggest something like:
std::size_t hash_value(const Leg& l)
{
std::size_t ret = l.type;
ret << = 8;
ret |= l.side;
ret *= 2654435761;
ret += l.qty;
ret *= 2654435761;
ret += l.id;
return ret * 2654435761;
}
In order to create an order-independent hash function for groups of five legs, first choose a hash function for individual legs -- David's answer looks great. Compute the hashes for each of the five legs. Now choose an order-independent function to combine these five hash values. You could, for example, xor the hashes together, or add them all together, or multiply them all together.
The fact that multiplication distributes over addition, and multiplication was the last operation to happen, makes me a little bit wary of using that. I think xor might be the best option of the ones I give here; but before using this in production, you should definitely run a few tests to see if you can easily generate collisions with any of them.
Probably superfluous, but here is a simple implementation that calls hash_value from David's answer:
std::size_t hash_value(const Leg_Array& legs) {
std::size_t ret = 0;
for (int i = 0; i < 5; ++i) {
ret ^= hash_value(legs[i]);
}
return ret;
}
I think this is not answered on this site yet.
I made a code which goes through many combinations of 4 numbers. The number values are from 0 to 51, so they can be stored in 6 bits, so in 1 byte, am I right? I use these 4 numbers in nested for cycles and then use them in the lowest level for cycle. So what c++ type from those which can store at least 52 values is the fastest for iterating through 4 nested for cycles?
The code looks like:
for(type first = 0; first != 49; ++first)
for(type second = first+1; second != 50; ++second)
for(type third = second+1; third != 51; ++third)
for(type fourth = third+1; fourth != 52; ++fourth) {
//using those values for about 1 bilion bit operations made in another for cycles
}
That code is very simplified and maybe there is also a better way for this kind of iterating, you can help me also with that.
Use the typedef std::uint_fast8_t from the header <cstdint>. It is supposed to be the "fastest" unsigned integer type with at least 8 bits.
The fastest is whatever the underlying processor ALU can natively work with. Now registers may be addressable in multiple formats. In that case all those formats are equally fast.
So this becomes very processor architecture specific rather than C++ specific.
If you are working on a modern day PC processor then an int is as fast as anything else for your for loops.
On an embedded system there are more things to consider. Eg. Whether the variable is stored in an aligned location or not?
On most machines, int is the fastest integer type. On all of the computers I work with, int is faster than unsigned, significantly faster than signed char.
Another issue, perhaps a bigger one, is what you are doing with those numbers. You didn't show the code, so there's no way of telling. Use int if you expect first*second to produce the expected integral value.
Yet another issue is how widely portable you expect this code to be. There's a huge distinction between code that will be ported to a number of different architectures, different compilers versus code that will be used in a limited and controlled setting. If it's the latter, write some benchmarks, and use the type under which the benchmarks perform best. The problem is a bit tougher if you are writing something for wide consumption.
Days ago I heard (maybe I've even seen it!) about library, that helps with packing structures. Unfortunately - I can't recall it's name.
In my program I have to keep large amount of data, therefore I need to pack them and avoid loosing bits on gaps. For example: I have to keep many numbers from range 1...5. If I would keep them in char - it would take 8bits, but this number can be kept on 3 bits. Moreover - if I would keep this numbers in packs of 8bits with maximum number 256 I could pack there 51 numbers (instead of 1 or 2!).
Is there any librarary, which helps this actions, or do I have do this on my own?
As Tomalak Garet'kal already mentioned, this is a feature of ANSI C, called bit-fields. The wikipedia article is quite useful. Typically you declare them as structs.
For your example: as you mentioned you have one number in the range of 0..5 you can use 3 bits on this number, which leaves you 5 bits of use:
struct s
{
unsigned int mynumber : 3;
unsigned int myother : 5;
}
These can now be accesses simply like this:
struct s myinstance;
myinstance.mynumber = 3;
myinstance.myother = 1;
Be awared that bit fields are slower than usual struct members/variables, since the runtime has to perform bit-shifting/masking to allow access to simple bits.
Sometimes I need to be sure that some integer is even. As such I could use the following code:
int number = /* magic initialization here */;
// make sure the number is even
if ( number % 2 != 0 ) {
number--;
}
but that does not seem to be very efficient the most efficient way to do it, so I could do the following:
int number = /* magic initialization here */;
// make sure the number is even
number &= ~1;
but (besides not being readable) I am not sure that solution is completely portable.
Which solution do you think is best?
Is the second solution completely portable?
Is the second solution considerably faster that the first?
What other solutions do you know for this problem?
What if I do this inside an inline method? It should (theoretically) be as fast as these solutions and readability should no longer be an issue, does that make the second solution more viable?
note: This code is supposed to only work with positive integers but having a solution that also works with negative numbers would be a plus.
Personally, I'd go with an inline helper function.
inline int make_even(int n)
{
return n - n % 2;
}
// ....
int m = make_even(n);
Before accepting an answer I will make my own that tries to summarize and
complete some of the information found here:
Four possible methods where described (and some small variations of these).
if (number % 2 != 0) {
number--;
}
number&= ~1
number = number - (number % 2);
number = (number / 2) * 2;
Before proceeding any further let me clarify something:
The expected gain for using any of these methods is minimal, even if we could
prove that one method is 200% faster than the others the worst one is so fast
that the only way to have visible gain in speed would be if this method was
called many times in a CPU bound application. As such this is more of an
exercise for fun than a real optimization.
Analysis
Readability
As far as readability goes I would rank method 1 as the most readable,
method 4 as the second best and method 2 as the worse.
People are free to disagree but I ranked them like this because:
In method 1 it is as explicit as possible that if the number is odd you
want to subtract from it making it even.
Method 4 is also very much explicit but I ranked it second because at
first glance you might think it is doing nothing, and only a fraction of a
second latter you're like "Oh... Integer division.".
Method 2 and 3 are almost equivalent in terms of readability, but many
people are not used to bitwise operations and as such I ranked method 2 as
the worse.
With that in mind I would add that it is generally accepted that the best way
to implement this is using an inline function, and none of the options is
that unreadable, readability is not really an issue (direct uses in the code
are explicit and clear and reading the method will never be that hard).
If you don't want to use an inline method I would recommend that you only use
method 1 or method 4.
Compatibility issues
Underflow
It has been mentioned that method 1 may underflow, depending on the way the
processor represents integers. Just to be sure you can add the following
STATIC_ASSERT when using method 1.
STATIC_ASSERT(INT_MIN % 2 == 0, make_even_may_underflow);
As for method 3, even when INT_MIN is not even it may not underflow
depending on whether the result has the same sign of the divisor or the
dividend. Having the same sign of the divisor never underflows because
INT_MIN - (-1) is closer to 0.
Add the following STATIC_ASSERT just to be sure:
STATIC_ASSERT(INT_MIN % 2 == 0 || -1 % 2 < 0, make_even_may_underflow);
Of course you can still use these methods when the STATIC_ASSERT fails since
it would only be a problem when you pass INT_MIN to your make_even method,
but I would STRONGLY advice against it.
(Un)supported bit representations
When using method 2 you should make sure your compiler bit representation
behaves as expected:
STATIC_ASSERT( (1 & ~1) == 0, unsupported_bit_representation);
// two's complement OR sign-and-magnitude.
STATIC_ASSERT( (-3 & ~1) == -4 || (-3 & ~1) == -2 , unsupported_bit_representation);
Speed
I also did some naive speed tests using the Unix time utility. I ran every
different method (and its variations) 4 times and recorded the results,
since the results didn't vary much I didn't find necessary to run more tests.
The obtained results show method 4 and method 2 as the fastest of them
all.
Conclusion
According to the provided information, I would recommend using method 4. Its
readable, I am not aware of any compatibility issues and performs great.
I hope you enjoy this answer and use the information contained here to make
your own informed choice. :)
The source code is available if you want to check my results. Please note
that the tests where compiled using g++ and run in Mac OS X. Different
platforms and compilers may give different results.
int even_number = (number / 2) * 2;
This should work regardless architecture as long as optimizer is not going in the way (it shouldn't but who knows).
I would use the second solution. In any binary representation, regardless of the number of bits, big-endian vs. little-endian, or other architecture differences, that operation will have the effect of setting the lowest bit to zero. It's fast and completely portable. The intent of the code can be explained via comments, if you meet any poor C programmers who can't figure out what it means.
The &= solution looks best to me. If you want to make it more portable and more readable:
const int MakeEven = -2;
int number = /* magic initialization here */
// Make sure number is even
number &= MakeEven;
The second solution should be considerably faster than the first. Is it completely portable? Most likely, although there's probably some computer somewhere that does math differently.
This should work for positive and negative integers.
Use your second solution as inline function and put static assert into implementation of it to document and test that it works on platform that it is compiled on.
BOOST_STATIC_ASSERT( (1 & ~1) == 0 );
BOOST_STATIC_ASSERT( (-1 & ~1) == -2 );
Your second solution only works if your sign representation is "two's complement" or "sign and magnitude". To do it in place I'd go with suszterpatt's variant, which should (almost) always work
number -= (number % 2);
You don't know for sure in which direction this will "round" for negative values, so in extreme cases you might have an underflow.
even_integer = (any_integer >> 1) << 1;
This solution achieves the goal in the most performant way compared to the other suggested solutions.
In general, bitwise shift is the cheapest possible operation. Some compilers generate the same assembly for "number = (number / 2) * 2" as well but that is not guaranteed on all target platforms and programming languages.
The following approach is simple and requires no multiplication or division.
number = number & ~1;
or
number = (number + 1) & ~1;
I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.
The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).
Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.
Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).
By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).
It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".
Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.
Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)