Fastest way to find if a 3D coordinate is already used - c++

Using C++ (and Qt), I need to process a big amount of 3D coordinates.
Specifically, when I receive a 3D coordinate (made of 3 doubles), I need to check in a list if this coordinate has already been processed.
If not, then I process it and add it to the list (or container).
The amount of coordinates can become very big, so I need to store the processed coordinates in a container which will ensure that checking if a 3D coordinate is already contained in the container is fast.
I was thinking of using a map of a map of a map, storing the x coordinate, then the y coordinate then the z coordinate, but this makes it quite tedious to use, so I'm actually hoping there is a much better way to do it that I cannot think of.

Probably the simplest way to speed up such processing is to store the already-processed points in Octree. Checking for duplication will become close to logarithmic.
Also, make sure you tolerate round-off errors by checking the distance between the points, not the equality of the coordinates.

Divide your space into discrete bins. Could be infinitely deep squares, or could be cubes. Store your processed coordinates in a simple linked list, sorted if you like in each bin. When you get a new coordinate, jump to the enclosing bin, and walk the list looking for the new point.
Be wary of floating point comparisons. You need to either turn values into integers (say multiply by 1000 and truncate), or decide how close 2 values are to be considered equal.

You can easily use a set as follows:
#include <set>
#include <cassert>
const double epsilon(1e-8);
class Coordinate {
public:
Coordinate(double x, double y, double z) :
x_(x), y_(y), z_(z) {}
private:
double x_;
double y_;
double z_;
friend bool operator<(const Coordinate& cl, const Coordinate& cr);
};
bool operator<(const Coordinate& cl, const Coordinate& cr) {
if (cl.x_ < cr.x_ - epsilon) return true;
if (cl.x_ > cr.x_ + epsilon) return false;
if (cl.y_ < cr.y_ - epsilon) return true;
if (cl.y_ > cr.y_ + epsilon) return false;
if (cl.z_ < cr.z_ - epsilon) return true;
return false;
}
typedef std::set<Coordinate> Coordinates;
// Not thread safe!
// Return true if real processing is done
bool Process(const Coordinate& coordinate) {
static Coordinates usedCoordinates;
// Already processed?
if (usedCoordinates.find(coordinate) != usedCoordinates.end()) {
return false;
}
usedCoordinates.insert(coordinate);
// Here goes your processing code
return true;
}
// Test it
int main() {
assert(Process(Coordinate(1, 2, 3)));
assert(Process(Coordinate(1, 3, 3)));
assert(!Process(Coordinate(1, 3, 3)));
assert(!Process(Coordinate(1+epsilon/2, 2, 3)));
}

Assuming you already have a Coordinate class, add a hash function and maintain a hash_set of the coordinates.
Would look something like:
struct coord_eq
{
bool operator()(const Coordinate &s1, const Coordinate &s2) const
{
return s1 == s2;
// or: return s1.x() == s2.x() && s1.y() == s2.y() && s1.z() == s2.z();
}
};
struct coord_hash
{
size_t operator()(const Coordinate &s) const
{
union {double d, unsigned long ul} c[3];
c[0].d = s.x();
c[1].d = s.y();
c[2].d = s.z();
return static_cast<size_t> ((3 * c[0].ul) ^ (5 * c[1].ul) ^ (7 * c[2].ul));
}
};
std::hash_map<Coordinate, coord_hash, coord_eq> existing_coords;

Well, it depends on what's most important... if a tripple map is too tedious to use, then is implementing other data structures not worth the effort?
If you want to get around the uglyness of the tripple map solution, just wrap it up in another container class with an access function with three parameter, and hide all the messing around with maps internally in that.
If you're more worried about the runtime performance of this thing, storing the coordinates in an Octree might be a good idea.
Also worth mentioning is that doing these sorts of things with floats or doubles you should be very careful about precision -- if (0, 0, 0.01) the same coordinate as (0, 0, 0.01000001)? If it is, you'll need to look at the comparison functions you use, regardless of the data structure. That also depends on the source of your coordinates I guess.

Are you expecting/requiring exact matches? These might be hard to enforce with doubles. For example, if you have processed (1.0, 1.0, 1.0) and you then receive (0.9999999999999, 1.0, 1.0) would you consider it the same? If so, you will need to either apply some kind of approximation or else define error bounds.
However, to answer the question itself: the first method that comes to mind is to create a single index (either a string or a bitstring, depending how readable you want things to be). For example, create the string "(1.0,1.0,1.0)" and use that as the key to your map. This will make it easy to look up the map, keeps the code readable (and also lets you easily dump the contents of the map for debugging purposes) and gives you reasonable performance. If you need much faster performance you could use a hashing algorithm to combine the three coordinates numerically without going via a string.

Use any unique transformation of your 3D coordinates and store only the list of the results.
Example:
md5('X, Y, Z') is unique and you can store only the resulting string.
The hash is not a performant idea but you get the concept. Find any methematic unique transformation and you have it.
/Vey

Use an std::set. Define a type for the 3d coordinate (or use a boost::tuple) that has operator< defined. When adding elements, you can add it to the set, and if it was added, do your processing. If it was not added (because it already exists in there), do not do your processing.
However, if you are using doubles, be aware that your algorithm can potentially lead to unpredictable behavior. IE, is (1.0, 1.0, 1.0) the same as (1.0, 1.0, 1.000000001)?

How about using a boost::tuple for the coordinates, and storing the tuple as the index for the map?
(You may also need to do the divide-by-epsilon idea from this answer.)

Pick a constant to scale the coordinates by so that 1 unit describes an acceptably small box and yet the integer part of the largest component by magnitude will fit into a 32-bit integer; convert the X, Y and Z components of the result to integers and hash them together. Use that as a hash function for a map or hashtable (NOT as an array index, you need to deal with collisions).
You may also want to consider using a fudge factor when comparing the coordinates, since you may get floating point values which are only slightly different, and it is usually preferable to weld those together to avoid cracks when rendering.

If you write a helper class with a simple public interface, that greatly reduces the practical tedium of implementation details like use of a map<map<map<>>>. The beauty of encapsulation!
That said, you might be able to rig a hashmap to do the trick nicely. Just hash the three doubles together to get the key for the point as a whole. If you're concerned about to many collisions between points with symmetric coordinates (e.g., (1, 2, 3) and (3, 2, 1) and so on), just make the hash key asymmetric with respect to the x, y, and z coordinates, using bit shift or some such.

You could use a hash_set of any hashable type - for example, turn each tuple into a string "(x, y, z)". hash_set does fast lookups but handles collisions well.

Whatever your storage method, I would suggest you decide on an epsilon (minimum floating point distance that differentiates two coordinates), then divide all coordinates by the epsilon, round and store them as integers.

Something in this direction maybe:
struct Coor {
Coor(double x, double y, double z)
: X(x), Y(y), Z(z) {}
double X, Y, Z;
}
struct coords_thesame
{
bool operator()(const Coor& c1, const Coor& c2) const {
return c1.X == c2.X && c1.Y == c2.Y && c1.Z == c2.Z;
}
};
std::hash_map<Coor, bool, hash<Coor>, coords_thesame> m_SeenCoordinates;
Untested, use at your own peril :)

You can easily define a comparator for a one-level std::map, so that lookup becomes way less cumbersome. There is no reason of being afraid of that. The comparator defines an ordering of the _Key template argument of the map. It can then also be used for the multimap and set collections.
An example:
#include <map>
#include <cassert>
struct Point {
double x, y, z;
};
struct PointResult {
};
PointResult point_function( const Point& p ) { return PointResult(); }
// helper: binary function for comparison of two points
struct point_compare {
bool operator()( const Point& p1, const Point& p2 ) const {
return p1.x < p2.x
|| ( p1.x == p2.x && ( p1.y < p2.y
|| ( p1.y == p2.y && p1.z < p2.z )
)
);
}
};
typedef std::map<Point, PointResult, point_compare> pointmap;
int _tmain(int argc, _TCHAR* argv[])
{
pointmap pm;
Point p1 = { 0.0, 0.0, 0.0 };
Point p2 = { 0.1, 1.0, 1.0 };
pm[ p1 ] = point_function( p1 );
pm[ p2 ] = point_function( p2 );
assert( pm.find( p2 ) != pm.end() );
return 0;
}

There are more than a few ways to do it, but you have to ask yourself first what are your assumptions and conditions.
So, assuming that your space is limited in size and you know what is the maximum accuracy, then you can form a function that given (x,y,z) will convert them to a unique number or string -this can be done only if you know that your accuracy is limited (for example - no two entities can occupy the same cubic centimeter).
Encoding the coordinate allows you to use a single map/hash with O(1).
If this is not tha case, you can always use 3 embedded maps as you suggested, or go towards space division algorithms (such as OcTree as mentioned) which although given O(logN) on a average search, they also give you additional information you might want (neighbors, population, etc..), but of course it is harder to implement.

You can either use a std::set of 3D coordinates, or a sorted std::vector. Both will give you logarithmic time lookup. In either case, you'll need to implement the less than comparison operator for your 3D coordinate class.

Why bother? What "processing" are you doing? Unless it's very complex, it's probably faster to just do the calculation again, rather then waste time looking things up in a huge map or hashtable.
This is one of the more counter-intuitive things about modern cpu's. Computation is fast, memory is slow.
I realize this isn't really an answer to your question, it's questioning your question.

Good question... it's one that has many solutions, because this type of problem comes
up many times in Graphical and Scientific applications.
Depending on the solution you require it may be rather complex under the hood, in this
case less code doesn't necessarily mean faster.
"but this makes it quite tedious to use" --- generally, you can get around this by
typedefs or wrapper classes (wrappers in this case would be highly recommended).
If you don't need to use the 3D co-ordinates in any kind of spacially significant way (
things like "give me all the points within X distance of point P") then I suggest you
just find a way to hash each point, and use a single hash map... O(n) creation, O(1)
access (checking to see if it's been processed), you can't do much better than that.
If you do need more spacial information you'll need a container that explicitly takes
it into account.
The type of container you choose will be dependant on your data set. If you have good
knowledge of the range of values that you recieve this will help.
If you are recieving well distributed data over a known range... go with octree.
If you have a distribution that tends to cluster, then go with k-d trees. You'll need
to rebuild a k-d tree after inputting new co-ordinates (not necessarily every time,
just when it becomes overly imbalanced). Put simply, Kd-trees are like Octrees, but with non uniform division.

Related

Data structure to represent multidimensional ranges with fast "open space" lookup?

So I'm looking to represent non-overlapping ranges in an N dimensional space.
I think CGAL has this functionality, and facilitates fast querying of points as the example shows below.
What I'm not sure of is how to extend this kind of query to find open windows.
So in this case I make 2 rectangles and it would be nice if there was a way find an opening of a certain size.
#include <CGAL/Cartesian.h>
#include <CGAL/Segment_tree_k.h>
#include <CGAL/Range_segment_tree_traits.h>
typedef CGAL::Cartesian<double> K;
typedef CGAL::Segment_tree_map_traits_2<K, char> Traits;
typedef CGAL::Segment_tree_2<Traits > Segment_tree_2_type;
int main()
{
typedef Traits::Interval Interval;
typedef Traits::Pure_interval Pure_interval;
typedef Traits::Key Key;
std::list<Interval> InputList, OutputList1, OutputList2;
InputList.push_back(Interval(Pure_interval(Key(1,2), Key(1,2)),'a'));
InputList.push_back(Interval(Pure_interval(Key(2,3), Key(2,3)),'b'));
Segment_tree_2_type Segment_tree_2(InputList.begin(),InputList.end());
// ??? probably has multiple solutions?
Interval find_me=Interval(Pure_interval(Key(0,3), Key(0,1)),'');
Interval opening = Segment_tree_2.find_opening(find_me);
return 0;
}
I don't think the Segment Tree from the CGAL library can help you to solve this problem, because this tree was designed to perform only two types of queries (window_query and enclosing_query). In both cases the search process returns a subset of the original set of D-dimensional intervals, which was used to build the tree - however you are interested in the open space "between" these intervals, which is not represented explicitly by this data structure.
Problems, similar to the problem you're asking about, were studied in Computational Geometry for a long time - the simplest case is finding a largest (by area or by perimeter) empty rectangle among a set of points on the plane. Generalizations of this problem have been studied as well - for more general obstacles (segments, rectangles, polygons) and higher dimensions. Please see this Wikipage for more information (including references).
However, finding the largest rectangle might be overabundant for you - any rectangle with size more or equal than the given size will suffice, and it will save you some time compared to the largest rectangle search. If your set of rectangles is static, but the size of the empty rectangle you wish to find varies, then it makes sense to preprocess this set into some data structure (as you mentioned above). There are some publications where they present algorithms to find all the maximal empty rectangles and save them in a list. Maximal empty rectangle is defined as a rectangle, which can’t be extended in any direction without intersecting with obstacles. Sorry to say, I couldn’t find any such publications, which can be accessed for free.
I’m suggesting a simple recursive algorithm, which can find an empty rectangle with width and height more or equal than the requested size. The idea is to start from the bounding box of the rectangle set and process rectangles from this set one by one. Each such rectangle is subtracted from the current empty rectangle, however result of this subtraction is represented as a set of maximal rectangles, which may overlap. For example, result of subtraction of the rectangle [0,2)x[0,2) from the rectangle [1,3)x[1,3) is a set of two rectangles [2,3)x[1,3) and [1,3)x[2,3). The algorithm returns an empty rectangle and a Boolean flag, indicating success or fail.
using RVec = std::vector<Rectangle>;
using Result = std::pair<Rectangle, bool>;
Result find(const RVec& V, double W, double H, const Rectangle& R, unsigned J)
{
if (R.sizeIsLess(W, H))
{
// ------ the empty rectangle R is too small
return {R, false};
}
else if (J < V.size())
{
// ------ process the obstacle rectangle with number J
for (const auto& r: subtract(R, V[J]))
{
const auto res = find(V, W, H, r, J + 1);
if (res.second) return {res.first, true};
}
return {R, false};
}
else
{
// ------ the empty rectangle R is big enough, and all the obstacles are processed
return {R, true};
}
}
auto find(const RVec& V, double W, double H)
{
return find(V, W, H, bbox(V), 0);
}
I can’t prove that this algorithm works correctly for any possible set of rectangular obstacles and for any requested width and height, however it worked well in all my tests. The algorithm is recursive, so the limited stack size might be a problem for really large rectangle sets.
The algorithm can be randomized by shuffling the rectangle set and/or the result of rectangles subtraction – then you’ll possibly get multiple solutions for the given rectangle set and given width and height. The algorithm can be extended to higher dimensions as well – then the function subtract will need to be modified. If you are interested I’ll add this function (for 2D case) into this answer.

Data structure to store a grid (that will have negative indices)

I'm studying robotics at the university and I have to implement on my own SLAM algorithm. To do it I will use ROS, Gazebo and C++.
I have a doubt about what data structure I have to use to store the map (and what I'm going to store it, but this is another story).
I have thought to represent the map as a 2D grid and robot's start location is (0,0). But I don't know where exactly is the robot on the world that I have to map. It could be at the top left corner, at the middle of the world, or in any other unknonw location inside the world.
Each cell of the grid will be 1x1 meters. I will use a laser to know where are the obstacles. Using current robot's location, I will set to 1 on all the cells that represent an obstacle. For example, it laser detects an obstacle at 2 meters in front of the robot, I will set to 1 the cell at (0,2).
Using a vector, or a 2D matrix, here is a problem, because, vector and matrices indices start at 0, and there could be more room behind the robot to map. And that room will have an obstacle at (-1,-3).
On this data structure, I will need to store the cells that have an obstacle and the cells that I know they are free.
Which kind of data structure will I have to use?
UPDATE:
The process to store the map will be the following:
Robot starts at (0,0) cell. It will detect the obstacles and store them in the map.
Robot moves to (1,0) cell. And again, detect and store the obstacles in the map.
Continue moving to free cells and storing the obstacles it founds.
The robot will detect the obstacles that are in front of it and to the sides, but never behind it.
My problem comes when the robot detects an obstacle on a negative cell (like (0,-1). I don't know how to store that obstacle if I have previously stored only the obstacle on "positive" cells. So, maybe the "offset", it is not a solution here (or maybe I'm wrong).
This is where you can write a class to help you:
class RoboArray
{
constexpr int width_ = ...
constexpr int height_ = ...
Cell grid_[width_ * 2][height_ * 2];
...
public:
...
Cell get(int x, int y) // can make this use [x][y] notation with a helper class
{
return grid_[x + width_][y + height];
}
...
}
The options you have:
Have an offset. Simple and dirty. Your grid is 100x100 but stores -50,-50 to 50x50.
Have multiple offset'ed grids. When you go out of the grid allocate a new one beside it, with a different offset. A list or map of grids.
Have sparse structure. A set or map of coordinates.
Have an hierarchical structure. Your whole, say 50x50, grid is one cell in a grid at a higher level. Implement it with a linked list or something so when you move you build a tree of nest grids. Very efficient for memory and compute time, but much more complex to implement.
You can use a std::set to represent a grid layout by using a position class you create. It contains a x and y variable and can therefore be used to intuitively be used to find points inside the grid. You can also use a std::map if you want to store information about a certain location inside the grid.
Please don't forget to fulfill the C++ named requirements for set/map such as Compare if you don't want to provide a comparison operator externally.
example:
position.h
/* this class is used to store the position of things
* it is made up by a horizontal and a vertical position.
*/
class position{
private:
int32_t horizontalPosition;
int32_t verticalPosition;
public:
position::position(const int hPos = 0,const int vPos = 0) : horizontalPosition{hPos}, verticalPosition{vPos}{}
position::position(position& inputPos) : position(inputPos.getHorPos(),inputPos.getVerPos()){}
position::position(const position& inputPos) : position((inputPos).getHorPos(),(inputPos).getVerPos()){}
//insertion operator, it enables the use of cout on this object: cout << position(0,0) << endl;
friend std::ostream& operator<<(std::ostream& os, const position& dt){
os << dt.getHorPos() << "," << dt.getVerPos();
return os;
}
//greater than operator
bool operator>(const position& rh) const noexcept{
uint64_t ans1 = static_cast<uint64_t>(getVerPos()) | static_cast<uint64_t>(getHorPos())<<32;
uint64_t ans2 = static_cast<uint64_t>(rh.getVerPos()) | static_cast<uint64_t>(rh.getHorPos())<<32;
return(ans1 < ans2);
}
//lesser than operator
bool operator<(const position& rh) const noexcept{
uint64_t ans1 = static_cast<uint64_t>(getVerPos()) | static_cast<uint64_t>(getHorPos())<<32;
uint64_t ans2 = static_cast<uint64_t>(rh.getVerPos()) | static_cast<uint64_t>(rh.getHorPos())<<32;
return(ans1 > ans2);
}
//equal comparison operator
bool operator==(const position& inputPos)const noexcept {
return((getHorPos() == inputPos.getHorPos()) && (getVerPos() == inputPos.getVerPos()));
}
//not equal comparison operator
bool operator!=(const position& inputPos)const noexcept {
return((getHorPos() != inputPos.getHorPos()) || (getVerPos() != inputPos.getVerPos()));
}
void movNorth(void) noexcept{
++verticalPosition;
}
void movEast(void) noexcept{
++horizontalPosition;
}
void movSouth(void) noexcept{
--verticalPosition;
}
void movWest(void) noexcept{
--horizontalPosition;
}
position getNorthPosition(void)const noexcept{
position aPosition(*this);
aPosition.movNorth();
return(aPosition);
}
position getEastPosition(void)const noexcept{
position aPosition(*this);
aPosition.movEast();
return(aPosition);
}
position getSouthPosition(void)const noexcept{
position aPosition(*this);
aPosition.movSouth();
return(aPosition);
}
position getWestPosition(void)const noexcept{
position aPosition(*this);
aPosition.movWest();
return(aPosition);
}
int32_t getVerPos(void) const noexcept {
return(verticalPosition);
}
int32_t getHorPos(void) const noexcept {
return(horizontalPosition);
}
};
std::set<position> gridNoData;
std::map<position, bool> gridWithData;
gridNoData.insert(point(1,1));
gridWithData.insert(point(1,1),true);
gridNoData.insert(point(0,0));
gridWithData.insert(point(0,0),true);
auto search = gridNoData.find(point(0,0));
if (search != gridNoData.end()) {
std::cout << "0,0 exists" << '\n';
} else {
std::cout << "0,0 doesn't exist\n";
}
auto search = gridWithData.find(point(0,0));
if (search != gridWithData.end()) {
std::cout << "0,0 exists with value" << search->second << '\n';
} else {
std::cout << "0,0 doesn't exist\n";
}
The above class was used by me in a similar setting and we used a std::map defined as:
std::map<position,directionalState> exploredMap;
To store if we had found any walls at a certain position.
By using this std::map based method you avoid having to do math to know what offset you have to have inside an 2D array (or some structure like that). It also allows you to move freely as there is no chance that you'll travel outside of the predefined bounds you set at construction. This structure is also more space efficient against a 2D array as this structure only saves the areas where the robot has been. This is also a C++ way of doing things: relying on the STL instead of creating your own 2D map using C constructs.
With offset solution (translation of values by fixed formula (we called it "mapping function" in math class), like doing "+50" to all coordinates, i.e. [-30,-29] will become [+20,+21] and [0,0] will become [+50,+50] ) you still need to have idea what is your maximum size.
In case you want to be dynamic like std::vector<> going from 0 to some N (as much as free memory allows), you can create more complex mapping function, for example map(x) = x*2 when (0 <= x) and x*(-2)-1 when (x < 0) ... this way you can use standard std::vector and let it grow as needed by reaching new maximum coordinates.
With 2D grid vs std::vector this is a bit more complicated as vector of vectors is sometimes not the best idea from performance point of view, but as long as your code can prefer shortness and simplicity over performance, maybe you can use the same mapping for both coordinates and use vector of vectors (using reserve(..) on all of them with some reasonable default to avoid resizing of vectors in common use cases, like if you know the 100m x 100m area will be usual maximum, you can reserve everything to capacity 201 initially to avoid vector resizing for common situations, but it can still grow infinitely (until heap memory is exhausted) in less common situations.
You can also add another mapping function converting 2D coordinates to 1D and use single vector only, and if you want really complicate things, you can for example map those 2D into 0,1,2,... sequence growing from area around center outward to save memory usage for small areas... you will probably easily spend 2-4 weeks on debugging it, if you are kinda fresh to C++ development, and you don't use unit testing and TDD approach (I.e. just go by simple vector of vectors for a start, this paragraph is JFYI, how things can get complicated if you are trying to be too smart :) ).
Class robotArray
{
Int* left,right;
}
RobotArray::RobotArray ()
{
Int* a=new int [50][50];
Int* b=new int[50][50];
//left for the -ve space and right for the positive space with
0,0 of the two arrays removed
Left=a+1;
Right=b+1;
}
I think I see what you are after here: you don't know how big the space is, or even what the coordinates may be.
This is very general, but I would create a class that holds all of the data using vectors (another option -- vector of pairs, or vector of Eigen (the library) vectors). As you discover new regions, you'll add the coordinates and occupancy information to the Map (via AddObservation(), or something similar).
Later, you can determine the minimum and maximum x and y coordinates, and create the appropriate grid, if you like.
class RoboMap{
public:
vector<int> map_x_coord;
vector<int> map_y_coord;
vector<bool> occupancy;
RoboMap();
void AddObservation(int x, int y, bool in_out){
map_x_coord.push_back(x);
map_y_coord.push_back(y);
occupancy.push_back(in_out);
}
};

C++ Hash Table - How is collision for unordered_map with custom data type as keys resolved?

I have defined a class called Point which is to be used as a key inside an unordered_map. So, I have provided an operator== function inside the class and I have also provided a template specialization for std::hash. Based on my research, these are the two things I found necessary. The relevant code is as shown:
class Point
{
int x_cord = {0};
int y_cord = {0};
public:
Point()
{
}
Point(int x, int y):x_cord{x}, y_cord{y}
{
}
int x() const
{
return x_cord;
}
int y() const
{
return y_cord;
}
bool operator==(const Point& pt) const
{
return (x_cord == pt.x() && y_cord == pt.y());
}
};
namespace std
{
template<>
class hash<Point>
{
public:
size_t operator()(const Point& pt) const
{
return (std::hash<int>{}(pt.x()) ^ std::hash<int>{}(pt.y()));
}
};
}
// Inside some function
std::unordered_map<Point, bool> visited;
The program compiled and gave the correct results in the cases that I tested. However, I am not convinced if this is enough when using a user-defined class as key. How does the unordered_map know how to resolve collision in this case? Do I need to add anything to resolve collision?
That's a terrible hash function. But it is legal, so your implementation will work.
The rule (and really the only rule) for Hash and Equals is:
if a == b, then std::hash<value_type>(a) == std::hash<value_type>(b).
(It's also important that both Hash and Equals always produce the same value for the same arguments. I used to think that went without saying, but I've seen several SO questions where unordered_map produced unexpected results precisely because one or both of these functions depended on some external value.)
That would be satisfied by a hash function which always returned 42, in which case the map would get pretty slow as it filled up. But other than the speed issue, the code would work.
std::unordered_map uses a chained hash, not an open-addressed hash. All entries with the same hash value are placed in the same bucket, which is a linked list. So low-quality hashes do not distribute entries very well among the buckets.
It's clear that your hash gives {x, y} and {y, x} the same hash value. More seriously, any collection of points in a small rectangle will share the same small number of different hash values, because the high-order bits of the hash values will all be the same.
Knowing that Point is intended to store coordinates within an image, the best hash function here is:
pt.x() + pt.y() * width
where width is the width of the image.
Considering that x is a value in the range [0, width-1], the above hash function produces a unique number for any valid value of pt. No collisions are possible.
Note that this hash value corresponds to the linear index for the point pt if you store the image as a single memory block. That is, given y is also in a limited range ([0, height-1]), all hash values generated are within the range [0, width* height-1], and all integers in that range can be generated. Thus, consider replacing your hash table with a simple array (i.e. an image). An image is the best data structure to map a pixel location to a value.

Filling unordered_set is too slow

We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?
What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.
Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this case—it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.
If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.

Floating point keys in std:map

The following code is supposed to find the key 3.0in a std::map which exists. But due to floating point precision it won't be found.
map<double, double> mymap;
mymap[3.0] = 1.0;
double t = 0.0;
for(int i = 0; i < 31; i++)
{
t += 0.1;
bool contains = (mymap.count(t) > 0);
}
In the above example, contains will always be false.
My current workaround is just multiply t by 0.1 instead of adding 0.1, like this:
for(int i = 0; i < 31; i++)
{
t = 0.1 * i;
bool contains = (mymap.count(t) > 0);
}
Now the question:
Is there a way to introduce a fuzzyCompare to the std::map if I use double keys?
The common solution for floating point number comparison is usually something like a-b < epsilon. But I don't see a straightforward way to do this with std::map.
Do I really have to encapsulate the double type in a class and overwrite operator<(...) to implement this functionality?
So there are a few issues with using doubles as keys in a std::map.
First, NaN, which compares less than itself is a problem. If there is any chance of NaN being inserted, use this:
struct safe_double_less {
bool operator()(double left, double right) const {
bool leftNaN = std::isnan(left);
bool rightNaN = std::isnan(right);
if (leftNaN != rightNaN)
return leftNaN<rightNaN;
return left<right;
}
};
but that may be overly paranoid. Do not, I repeat do not, include an epsilon threshold in your comparison operator you pass to a std::set or the like: this will violate the ordering requirements of the container, and result in unpredictable undefined behavior.
(I placed NaN as greater than all doubles, including +inf, in my ordering, for no good reason. Less than all doubles would also work).
So either use the default operator<, or the above safe_double_less, or something similar.
Next, I would advise using a std::multimap or std::multiset, because you should be expecting multiple values for each lookup. You might as well make content management an everyday thing, instead of a corner case, to increase the test coverage of your code. (I would rarely recommend these containers) Plus this blocks operator[], which is not advised to be used when you are using floating point keys.
The point where you want to use an epsilon is when you query the container. Instead of using the direct interface, create a helper function like this:
// works on both `const` and non-`const` associative containers:
template<class Container>
auto my_equal_range( Container&& container, double target, double epsilon = 0.00001 )
-> decltype( container.equal_range(target) )
{
auto lower = container.lower_bound( target-epsilon );
auto upper = container.upper_bound( target+epsilon );
return std::make_pair(lower, upper);
}
which works on both std::map and std::set (and multi versions).
(In a more modern code base, I'd expect a range<?> object that is a better thing to return from an equal_range function. But for now, I'll make it compatible with equal_range).
This finds a range of things whose keys are "sufficiently close" to the one you are asking for, while the container maintains its ordering guarantees internally and doesn't execute undefined behavior.
To test for existence of a key, do this:
template<typename Container>
bool key_exists( Container const& container, double target, double epsilon = 0.00001 ) {
auto range = my_equal_range(container, target, epsilon);
return range.first != range.second;
}
and if you want to delete/replace entries, you should deal with the possibility that there might be more than one entry hit.
The shorter answer is "don't use floating point values as keys for std::set and std::map", because it is a bit of a hassle.
If you do use floating point keys for std::set or std::map, almost certainly never do a .find or a [] on them, as that is highly highly likely to be a source of bugs. You can use it for an automatically sorted collection of stuff, so long as exact order doesn't matter (ie, that one particular 1.0 is ahead or behind or exactly on the same spot as another 1.0). Even then, I'd go with a multimap/multiset, as relying on collisions or lack thereof is not something I'd rely upon.
Reasoning about the exact value of IEEE floating point values is difficult, and fragility of code relying on it is common.
Here's a simplified example of how using soft-compare (aka epsilon or almost equal) can lead to problems.
Let epsilon = 2 for simplicity. Put 1 and 4 into your map. It now might look like this:
1
\
4
So 1 is the tree root.
Now put in the numbers 2, 3, 4 in that order. Each will replace the root, because it compares equal to it. So then you have
4
\
4
which is already broken. (Assume no attempt to rebalance the tree is made.) We can keep going with 5, 6, 7:
7
\
4
and this is even more broken, because now if we ask whether 4 is in there, it will say "no", and if we ask for an iterator for values less than 7, it won't include 4.
Though I must say that I've used maps based on this flawed fuzzy compare operator numerous times in the past, and whenever I digged up a bug, it was never due to this. This is because datasets in my application areas never actually amount to stress-testing this problem.
As Naszta says, you can implement your own comparison function. What he leaves out is the key to making it work - you must make sure that the function always returns false for any values that are within your tolerance for equivalence.
return (abs(left - right) > epsilon) && (left < right);
Edit: as pointed out in many comments to this answer and others, there is a possibility for this to turn out badly if the values you feed it are arbitrarily distributed, because you can't guarantee that !(a<b) and !(b<c) results in !(a<c). This would not be a problem in the question as asked, because the numbers in question are clustered around 0.1 increments; as long as your epsilon is large enough to account for all possible rounding errors but is less than 0.05, it will be reliable. It is vitally important that the keys to the map are never closer than 2*epsilon apart.
You could implement own compare function.
#include <functional>
class own_double_less : public std::binary_function<double,double,bool>
{
public:
own_double_less( double arg_ = 1e-7 ) : epsilon(arg_) {}
bool operator()( const double &left, const double &right ) const
{
// you can choose other way to make decision
// (The original version is: return left < right;)
return (abs(left - right) > epsilon) && (left < right);
}
double epsilon;
};
// your map:
map<double,double,own_double_less> mymap;
Updated: see Item 40 in Effective STL!
Updated based on suggestions.
Using doubles as keys is not useful. As soon as you make any arithmetic on the keys you are not sure what exact values they have and hence cannot use them for indexing the map. The only sensible usage would be that the keys are constant.