Filling unordered_set is too slow

Filling unordered_set is too slow - c++

We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?

What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.

Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this case—it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.

If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.

Related

Which container is most efficient for multiple insertions / deletions in C++?

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?

We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.

Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.

The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}

Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.

Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

Any way to "factor out" common fields to save space?

I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)

(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.

Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.

This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.

I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)

An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html

Floating point keys in std:map

The following code is supposed to find the key 3.0in a std::map which exists. But due to floating point precision it won't be found.
map<double, double> mymap;
mymap[3.0] = 1.0;
double t = 0.0;
for(int i = 0; i < 31; i++)
{
t += 0.1;
bool contains = (mymap.count(t) > 0);
}
In the above example, contains will always be false.
My current workaround is just multiply t by 0.1 instead of adding 0.1, like this:
for(int i = 0; i < 31; i++)
{
t = 0.1 * i;
bool contains = (mymap.count(t) > 0);
}
Now the question:
Is there a way to introduce a fuzzyCompare to the std::map if I use double keys?
The common solution for floating point number comparison is usually something like a-b < epsilon. But I don't see a straightforward way to do this with std::map.
Do I really have to encapsulate the double type in a class and overwrite operator<(...) to implement this functionality?

So there are a few issues with using doubles as keys in a std::map.
First, NaN, which compares less than itself is a problem. If there is any chance of NaN being inserted, use this:
struct safe_double_less {
bool operator()(double left, double right) const {
bool leftNaN = std::isnan(left);
bool rightNaN = std::isnan(right);
if (leftNaN != rightNaN)
return leftNaN<rightNaN;
return left<right;
}
};
but that may be overly paranoid. Do not, I repeat do not, include an epsilon threshold in your comparison operator you pass to a std::set or the like: this will violate the ordering requirements of the container, and result in unpredictable undefined behavior.
(I placed NaN as greater than all doubles, including +inf, in my ordering, for no good reason. Less than all doubles would also work).
So either use the default operator<, or the above safe_double_less, or something similar.
Next, I would advise using a std::multimap or std::multiset, because you should be expecting multiple values for each lookup. You might as well make content management an everyday thing, instead of a corner case, to increase the test coverage of your code. (I would rarely recommend these containers) Plus this blocks operator[], which is not advised to be used when you are using floating point keys.
The point where you want to use an epsilon is when you query the container. Instead of using the direct interface, create a helper function like this:
// works on both `const` and non-`const` associative containers:
template<class Container>
auto my_equal_range( Container&& container, double target, double epsilon = 0.00001 )
-> decltype( container.equal_range(target) )
{
auto lower = container.lower_bound( target-epsilon );
auto upper = container.upper_bound( target+epsilon );
return std::make_pair(lower, upper);
}
which works on both std::map and std::set (and multi versions).
(In a more modern code base, I'd expect a range<?> object that is a better thing to return from an equal_range function. But for now, I'll make it compatible with equal_range).
This finds a range of things whose keys are "sufficiently close" to the one you are asking for, while the container maintains its ordering guarantees internally and doesn't execute undefined behavior.
To test for existence of a key, do this:
template<typename Container>
bool key_exists( Container const& container, double target, double epsilon = 0.00001 ) {
auto range = my_equal_range(container, target, epsilon);
return range.first != range.second;
}
and if you want to delete/replace entries, you should deal with the possibility that there might be more than one entry hit.
The shorter answer is "don't use floating point values as keys for std::set and std::map", because it is a bit of a hassle.
If you do use floating point keys for std::set or std::map, almost certainly never do a .find or a [] on them, as that is highly highly likely to be a source of bugs. You can use it for an automatically sorted collection of stuff, so long as exact order doesn't matter (ie, that one particular 1.0 is ahead or behind or exactly on the same spot as another 1.0). Even then, I'd go with a multimap/multiset, as relying on collisions or lack thereof is not something I'd rely upon.
Reasoning about the exact value of IEEE floating point values is difficult, and fragility of code relying on it is common.

Here's a simplified example of how using soft-compare (aka epsilon or almost equal) can lead to problems.
Let epsilon = 2 for simplicity. Put 1 and 4 into your map. It now might look like this:
1
\
4
So 1 is the tree root.
Now put in the numbers 2, 3, 4 in that order. Each will replace the root, because it compares equal to it. So then you have
4
\
4
which is already broken. (Assume no attempt to rebalance the tree is made.) We can keep going with 5, 6, 7:
7
\
4
and this is even more broken, because now if we ask whether 4 is in there, it will say "no", and if we ask for an iterator for values less than 7, it won't include 4.
Though I must say that I've used maps based on this flawed fuzzy compare operator numerous times in the past, and whenever I digged up a bug, it was never due to this. This is because datasets in my application areas never actually amount to stress-testing this problem.

As Naszta says, you can implement your own comparison function. What he leaves out is the key to making it work - you must make sure that the function always returns false for any values that are within your tolerance for equivalence.
return (abs(left - right) > epsilon) && (left < right);
Edit: as pointed out in many comments to this answer and others, there is a possibility for this to turn out badly if the values you feed it are arbitrarily distributed, because you can't guarantee that !(a<b) and !(b<c) results in !(a<c). This would not be a problem in the question as asked, because the numbers in question are clustered around 0.1 increments; as long as your epsilon is large enough to account for all possible rounding errors but is less than 0.05, it will be reliable. It is vitally important that the keys to the map are never closer than 2*epsilon apart.

You could implement own compare function.
#include <functional>
class own_double_less : public std::binary_function<double,double,bool>
{
public:
own_double_less( double arg_ = 1e-7 ) : epsilon(arg_) {}
bool operator()( const double &left, const double &right ) const
{
// you can choose other way to make decision
// (The original version is: return left < right;)
return (abs(left - right) > epsilon) && (left < right);
}
double epsilon;
};
// your map:
map<double,double,own_double_less> mymap;
Updated: see Item 40 in Effective STL!
Updated based on suggestions.

Using doubles as keys is not useful. As soon as you make any arithmetic on the keys you are not sure what exact values they have and hence cannot use them for indexing the map. The only sensible usage would be that the keys are constant.

Optimization of Point to Voxel mapping

I used a profiler to look over some code which does not yet run fast enough. It found that the following function took most of the time, and half of the time in this function was spent in floor. Now, there are two possibilities: optimizing this function or going one level above and reducing the calls to this function. I wonder, if the first one is possible.
int Sph::gridIndex (Vector3 position) const {
int mx = ((int)floor(position.x / _gridIntervalSize) % _gridSize);
int my = ((int)floor(position.y / _gridIntervalSize) % _gridSize);
int mz = ((int)floor(position.z / _gridIntervalSize) % _gridSize);
if (mx < 0) {
mx += _gridSize;
}
if (my < 0) {
my += _gridSize;
}
if (mz < 0) {
mz += _gridSize;
}
int x = mx * _gridSize * _gridSize;
int y = my * _gridSize;
int z = mz * 1;
return x + y + z;
}
Vector3 is just some simple class which stores three floats and provides some overloaded operators. _gridSize is of type int and _gridIntervalSize is a float. There are _gridSize ^ 3 buckets.
The purpose of the function is to provide hash table support. Every 3d-point is mapped to an index, and points which lie in the same voxel of size _gridIntervalSize ^ 3 should land in the same bucket.

First rule of optimization when there is math involved: Eliminate division, square roots, and trig functions.
inverse_size = 1 / _gridIntervalSize;
....that should be done only once, not once per call.
int mx = ((int)floor(position.x * inverse_size) % _gridSize);
int my = ((int)floor(position.y * inverse_size) % _gridSize);
int mz = ((int)floor(position.z * inverse_size) % _gridSize);
I would also recommend dropping the mod operation because that's another division - if your grid size is a power of 2 you can use & (gridsize-1) which will also allow you to delete the conditional code at the bottom which is another big savings.
On another note, using overloaded operators may be hurting you. This is a touchy subject here so I'll let you experiment with it and decide for yourself.

I assume you use floor because negative values are possible, and because you don't want an anomaly due to the default truncation when you cast to int (values rounding toward zero from both sides, making some oversized voxels).
If you can specify a safe most-negative value for each value in the vector, you could subtract that (negative) value, or rather the nearest more-negative multiple of _gridIntervalSize, before the cast, and drop the floor.
Using fmod may ensure you have a safe most-negative value, and replace the integer %, but it's probably an anti-optimisation. Still, as a quick change, it may be worth checking.
Also, check whether your platform supports vector instructions, and whether your compiler can easily be encouraged to use them. x86 chips certainly have integer vector instructions as well as float (the old Pentium 1 MMX instructions, for a start) and might be able to handle this much more efficiently than the "normal" CPU instruction set. This may even be a case for digging out the list of vector instruction intrinsics for your compiler and doing some hand-optimisation. Just check what the compiler can do for you first - I'm not sure how much of this kind of optimisation compilers will do for you already.
One probably trivial piece of micro-optimisation...
return (mx * _gridSize + my) * _gridSize + mz;
Saves one integer multiplication. Trivial, of course, and the compiler may catch it anyway, but this is an old habitual thing.
Oh - watch the leading underscores. Those are reserved identifiers. Not likely to cause a problem, but you can't complain if they do.
EDIT
Another way to avoid the floor is to handle positive and negative separately. If you are willing to accept that items bang-on-the-edge of a grid cell may be in the wrong cell (possible anyway since floats should be considered approximate). Just apply a -1 offset in the negative case, to pull it away from the zero by almost exactly right amount to compensate for the truncation. You might consider a bit-fiddling increment-the-mantissa afterwards (to get already integer values in the cell you'd expect) but this is probably unnecessary.
If you can impose power-of-two limitations to your sizes, there may be a bit-fiddling way to efficiently extract the grid position from a float, avoiding some or all of the multiply, floor and % for each of x, y and z, assuming a standard floating point representation (ie this is non-portable). Again, handle positive and negative separately. Extract the exponent, bit-shift the mantissa accordingly, then mask out unwanted bits.

I think you need to look higher up the hierarchy to get real speed improvements. That is, is storing points in a hash-map really the most efficent solution? I assume you have an array of Vector3 arrays, i.e:
Vector3 *points [size][size][size]
where each element in the 3D array is an array of Vector3.
The algorithm you're using doesn't guarantee uniform distribution of points in each Vector3 array, which may be a problem. A cluster of points within _gridIntervalSize will map to the same array.
An alternative method would be to use oct-trees, which are like binary trees but each node has eight child nodes. Each node requires the min/max x/y/z values to define the volume the node covers. To add values to the tree:
Recursive search tree to find smallest node that can contain point
Add point to node
If number of points in node > upper limit to number of points in a node
Create child nodes and move points to child nodes
You may want to use quad-trees if there is little variation in values along a particular axis. Another method is to use BSPs - divide the world into two halves and recurse to find the container to add your point to. Again, these can be dynamic.
Converting the floats to ints and having the division planes lie on integer values will speed up the process as well.
Googling the above terms will lead you to more in depth analysis of the algorithms.
Finally, using floats (or doubles) for co-ordinates in an infinite plane is a bad idea - the further you get from (0,0,0) the less precision you have (the gaps between floating point values increases as the value increases). You will need to 'reset' the floating point values to keep the precision. One method is to 'tile' the space and change the co-ordinates to use integer and floating point parts. The integer part defines the 'tile' and the floating point part defines the position in the tile. This method gets you a much simpler hashing method - just use the integer parts, no call to floor required and only integer calculations required. Another approach is to use fixed-point values rather than floating point values, but this would constrain your precision. This would make calculations accross tile boundaries much easier.
If you could expand on what the top-level requriements of your coordinate system is, there are probably better algorithms available to you.

If-else-if versus map

Suppose I have such an if/else-if chain:
if( x.GetId() == 1 )
{
}
else if( x.GetId() == 2 )
{
}
// ... 50 more else if statements
What I wonder is, if I keep a map, will it be any better in terms of performance? (assuming keys are integers)

Maps (usually) are implemented using red-black trees which gives O(log N) lookups as the tree is constantly kept in balance. Your linear list of if statements will be O(N) worst case. So, yes a map would be significantly faster for lookup.
Many people are recommending using a switch statement, which may not be faster for you, depending on your actual if statements. A compiler can sometimes optimize switch by using a jump table which would be O(1), but this is only possible for values that an undefined criteria; hence this behavior can be somewhat nondeterministic. Though there is a great article with a few tips on optimizing switch statements here Optimizing C and C++ Code.
You technically could even formulate a balanced tree manually, this works best for static data and I happened to just recently create a function to quickly find which bit was set in a byte (This was used in an embedded application on an I/O pin interrupt and had to be quick when 99% of the time only 1 bit would be set in the byte):
unsigned char single_bit_index(unsigned char bit) {
// Hard-coded balanced tree lookup
if(bit > 0x08)
if(bit > 0x20)
if(bit == 0x40)
return 6;
else
return 7;
else
if(bit == 0x10)
return 4;
else
return 5;
else
if(bit > 0x02)
if(bit == 0x04)
return 2;
else
return 3;
else
if(bit == 0x01)
return 0;
else
return 1;
}
This gives a constant lookup in 3 steps for any of the 8 values which gives me very deterministic performance, a linear search -- given random data -- would average 4 step lookups, with a best-case of 1 and worst-case of 8 steps.
This is a good example of a range that a compiler would probably not optimize to a jump table since the 8 values I am searching for are so far apart: 1, 2, 4, 8, 16, 32, 64, and 128. It would have to create a very sparse 128 position table with only 8 elements containing a target, which on a PC with a ton of RAM might not be a big deal, but on a microcontroller it'd be killer.

why dont you use a a switch ?
swich(x.GetId())
{
case 1: /* do work */ break; // From the most used case
case 2: /* do work */ break;
case ...: // To the less used case
}
EDIT:
Put the most frequently used case in the top of the switch (This can have some performance issue if x.GetId is generally equal to 50)

switch is the best thing I think

The better solution would be a switch statement. This will allow you to check the value of x.GetId() just once, rather than (on average) 25 times as your code is doing now.
If you want to get fancy, you can use a data structure containing pointers to functions that handle whatever it is that's in the braces. If your ID values are consecutive (i.e. numbers between 1 and 50) then an array of function pointers would be best. If they are spread out, then a map would be more appropriate.

The answer, as with most performance related questions, is maybe.
If the IDs are in a fortunate range, a switch might become a jump-table, providing constant time lookups to all IDs. You won't get much better than this, short of redesigning. Alternatively, if the IDs are consecutive but you don't get a jump-table out of the compiler, you can force the issue by filling an array with function pointers.
[from here on out, switch refers to a generic if/else chain]
A map provides worst-case logarithmic lookup for any given ID, while a switch can only guarantee linear. However, if the IDs are not random, sorting the switch cases by usage might ensure the worst-case scenario is sufficiently rare that this doesn't matter.
A map will incur some initial overhead when loading the IDs and associating them with the functions, and then incur a the overhead of calling a function pointer every time you access an ID. A switch incurs additional overhead when writing the routine, and possibly significant overhead when debugging it.
Redesigning might allow you to avoid the question all together. No matter how you implement it, this smells like trouble. I can't help but think there's a better way to handle this.

If I really had a potential switch of fifty possibilities, I'd definitely think about a vector of pointers to functions.
#include <cstdio>
#include <cstdlib>
#include <ctime>
const unsigned int Max = 4;
void f1();
void f2();
void f3();
void f4();
void (*vp[Max])();
int main()
{
vp[ 0 ] = f1;
vp[ 1 ] = f2;
vp[ 2 ] = f3;
vp[ 3 ] = f4;
srand( std::time( NULL ) );
vp[( rand() % Max )]();
}
void f1()
{
std::printf( "Hello from f1!\n" );
}
void f2()
{
std::printf( "Hello from f2!\n" );
}
void f3()
{
std::printf( "Hello from f3!\n" );
}
void f4()
{
std::printf( "Hello from f4!\n" );
}

There are a lot of suggestions involving switch-case. In terms of efficiency, this might be better, might be the same. Won't be worse.
But if you're just setting/returning a value or name based on the ID, then YES. A map is exactly what you need. STL containers are optimised, and if you think you can optimise better, then you are either incredibly smart or staggeringly dumb.
e.g A single call using a std::map called mymap,
thisvar = mymap[x.getID()];
is much better than 50 of these
if(x.getID() == ...){thisvar = ...;}
because it's more efficient as the number of IDs increases. If you're interested in why, search for a good primer on data structures.
But what I'd really look at here is maintenance/fixing time. If you need to change the name of the variable, or change from using getID() or getName(), or make any kind of minor change, you've got to do it FIFTY TIMES in your example. And you need a new line every time you add an ID.
The map reduces that to one code change NO MATTER HOW MANY IDs YOU HAVE.
That said, if you're actually carrying out different actions for each ID, a switch-case might be better. With switch-case rather than if statements, you can improve performance and readability. See here: Advantage of switch over if-else statement
I'd avoid pointers to functions unless you're very clear on how they'd improve your code, because if you're not 100% certain what you're doing, the syntax can be messed up, and it's overkill for anything you'd feasibly use a map for.
Basically, I'd be interested in the problem you're trying to solve. You might be better off with a map or a switch-case, but if you think you can use a map, that is ABSOLUTELY what you should be using instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js