Data structure that maps unique id to an object - c++

I'm looking for a C++ data structure that will let me associate objects with a unique numeric value (a key), and that will re-use these keys after the corresponding object have been removed from the container. So it's basically somewhat of a hybrid map/queue structure. My current implementation is a
std::map<size_t, TObject>
and I insert objects like this:
size_t key = (m_Container.end()--)->first + 1;
m_Container[key] = some_object;
which works fine for my purposes (I will never allocate more than size_t objects); yet still I keep wondering is there is a more specialized container available, preferably already in the stl or boost, or that there is a way to use another container to achieve this goal.
(Of course I could, rather than taking the highest key in my map and adding one, every time go through the map and search for the first available key but that would reduce complexity from O(1) to O(n) Also it would be nice if the API was a simple
size_t new_key = m_Container.insert(object);
).
Any ideas?

If you're never going to allocate more than size_t keys then I recommend you simply use a static counter:
size_t assign_id()
{
static size_t next_id;
return next_id++;
}
And if you want a nice API:
template<class Container>
size_t insert(Container & container, TObject const & obj)
{
container.insert(obj);
return assign_id();
}
std::set<TObject> m_Container;
size_t new_key = insert(m_Container, object);

I'm not certain what you exactly want from your ID. As it happens, each object already has a unique ID: its address! There are no two distinct objects with the same address, and the address of an object doesn't change over its lifetime.
std::set<T> typically stores its T values as members of larger nodes, not independent objects. Still, the T subobjects are never moved, and thus their addresses too are stable, unique identifiers.

Create std::set<key_type> removed_keys; of the removed keys. If removed_keys is not empty then use key from removed_keys else create a new key.

Why not just use a vector?
std::vector<TObject> m_Container;
size_t key = m_Container.size();
m_Container.push_back(some_object);
Of course this could be completely useless if you have other usage characteristics. But since you only describe insert and the need for a key (so extracting) it is hard to give any other clear answer. But from these two requirements a std::vector<> should work just fine.
If you have some other usage characteristics like: (Elements can be removed), (we insert elements in large blocks), (we insert elements infrequently) etc these would be interesting factoids that may change the recommendations people give.
You mention that you want to search for un-used elements ID's. This suggests that you may be deleting elements but I don't see any explicit requirements or usage where elements are ebing deleted.
Looking at your code above:
size_t key = (m_Container.end()--)->first + 1;
This is not doing what you think it is doing.
It is equivalent too:
size_t key = m_Container.end()->first + 1;
m_Container.end()--;
The post decrement operator modifies an lvalue. But the result of the expression is the original value. Thus you are applying the operator -> to the value returned by end(). This is (probably) undefined behavior.
See the standard:
Section:5.2.6 Increment and decrement
The value of a postfix ++ expression is the value of its operand.
m_Container.end()-- // This sub-expresiion returns m_Container.end()
Alternative:
#include <vector>
template<typename T>
class X
{
public:
T const& operator[](std::size_t index) const {return const_cast<X&>(*this)[index];}
T& operator[](std::size_t index) {return data[index];}
void remove(std::size_t index) {unused.push_back(index);}
std::size_t insert(T const& value);
private:
std::vector<T> data;
std::vector<std::size_t> unused;
};
template<typename T>
std::size_t X<T>::insert(T const& value)
{
if (unused.empty())
{
data.push_back(value);
return data.size() - 1;
}
std::size_t result = unused.back();
unused.pop_back();
data[result] = value;
return result;
}

Is there a reason that you need std::map to not remove a key, value pair?
This sounds like an attempt at premature optimization.
A method is to replace the value part with a dummy or place holder value. The problem in the long run is that the dummy value can be extracted from the std::map as long as the key exists. You will need to add a check for dummy value every time the std::map is accessed.
Because you want to maintain a key without a value, you most likely will have to write your own container. You will especially have to design code to handle the case when the client accesses the key when it has no value.
Looks like there is no performance gain for using standard containers and a key without value pair. However, there may be a gain as far as memory is concerned. Your issue would reduce fragmentation of dynamic memory; thus not having to re-allocate memory for the same key. You'll have to decide the trade-off.

Related

C++ - how to buffer calc results faster than using unordered_map

I read a lot about unordered_map not being very fast but I wonder what's the best alternative to do this:
I need to buffer calculation results for a function of an integer argument. I don't know ahead of time what range or interval will be requested. Storing in a vector with maximal resolution would cost way too much memory.
So I'm using
unordered_map<unsigned long, pair<T, long>>
Where the key is the argument of the function to be computed, the first of the pair the result of the computation of type T, and the second of the pair a version information for that computation.
Only if the unordered_map does not contain the element or it contains it but the version is outdated, the computation is carried out and then added to the unordered_map. The lookup function looks something like this:
template<typename T> class BufferClass{
long MyVersion;
unordered_map<unsigned long, pair<T,long>> Buffer;
public:
BufferClass(): MyVersion{1} {};
T* GetIfValid(unsigned long index)
{
if (!Buffer.count(index)) return nullptr;
pair <T,long> &x{Buffer.at(index)};
if (x.second!=MyVersion) return nullptr;
return &x.first;
}
/* ...Functions to set elements...*/
}
As you can see, I combined element validity check and retrieval in one function, so that I only need one lookup for both.
The profiler shows most of the computation time is used up in the hash function __constrain_hash related to unordered_map.
What would be the fastest way to store and retrieve values like that? The list of stored indices is expected to be non-continuous (there will be a lot of "holes") and first and last index are also mostly unknown.
T will generally be a "small" data type (like double or complex).
Thanks!
Martin
In your code, there could be two hash lookup in one query, one invoked in count() and the other invoked in at(). It is redundant, use unordered_map::find instead, see here.
Sample code:
const auto iter = Buffer.find(index);
if(iter != Buffer.end()) //Found something, so the return value is not end()
{
return &(iter->first);
}
else return nullptr;
In my opinion, unordered_map is slow but not that slow, for 99.9% usage is fast enough. You may want to check whether you call this function (unnecessarily) too many times. Using other fast implementation is not free, it could bloat your code base, harm your application's compatibility with different host systems or so on. If you think std::unordered_map is unreasonably slow, it is almost always because you got somewhere wrong in your work. (either your estimation or your code implementation)
BTW, another thing to mention: You said T is a small data type right? then return its value instead of pointer to it, it is faster and safer.
One thing that strikes me as odd about your implementation is the following two lines:
if (!Buffer.count(index)) return nullptr;
pair <T,long> &x{Buffer.at(index)};
This code is checking if the key exists, then throws away the result and searches for the same key again with bounds checking to boot. I think you'll find searching once with std::unordered_map<unsigned long, std::pair<T, long>>::find and reusing the result to be preferable:
auto it = Buffer.find(index);
if (it == Buffer.end()) return nullptr;
auto& x = *it;

Usefulness of KeyEqual in std::unordered_set/std::unordered_map

I understand that this may be vague question, but I wonder what are real world cases when custom comparator is useful for hash containers in std.
I understand it's usefulness in ordered containers, but for hash containers it seems a bit weird.
Reason for this is that hash value for elements that are equal according to comparator needs to be the same, and I believe that in most cases that actually means converting lookup/insert element to some common representation(it is faster and easier to implement).
For example:
set of case insensitive strings: if you want to hash properly you need to uppercase/lowercase the entire string anyway.
set of fractions(where 2/3 == 42/63): you need to convert 42/63 to 2/3 and then hash that...
So I wonder if someone can provide some real world examples of usefulness of customizing std::unordered_ template parameters so I can recognize those patterns in future code I write.
Note 1: "symmetry argument" (std::map enables customization of a comparator so std::unordred_should be customizable also) is something I considered and I do not think it is convincing.
Note 2: I mixed 2 kind of comparators (< and ==) in the post for brevity, I know that std::map uses < and std::unordered_map uses ==.
As per https://en.cppreference.com/w/cpp/container/unordered_set
Internally, the elements are not sorted in any particular order, but
organized into buckets. Which bucket an element is placed into depends
entirely on the hash of its value. This allows fast access to
individual elements, since once a hash is computed, it refers to the
exact bucket the element is placed into.
So the hash function defines the bucket your element will end up in, but once the bucket is decided, in order to find the element, the operator == will be used.
Basically operator == is used to resolve hash collision, and hence, you need your hash function and your operator == to be consistent. Furthermore, if your operator operator == says that two elements are equal, the set will not allow a duplication.
For what concerns customization, I think that the idea of case-insensitive set of strings is a good one: given two strings you will need to provide a case-insensitive hash-function to allow the set to determine the bucket it has to store the string in. Then you will need to provide a custom KeyEqualto allow the set to actually retrieve the element.
A case I had to deal with, in the past, was a way to allow users to insert strings, keeping track of their order of insertion but avoiding duplicates. So, given a struct like:
struct keyword{
std::string value;
int sequenceCounter;
};
You want to detect duplicates according only to value. One of the solutions I came up with was an unordered_set with a custom comparator/hash function, that used only value. This allowed me to check for the existence of a key before allowing insertion.
One interesting usage is to define memory efficient indexes (database sense of the term) for a given set of objects.
Example
Let's say we have a program that has a collection of N objects of this class:
struct Person {
// each object has a unique firstName/lastName pair
std::string firstName;
std::string lastName;
// each object has a unique ssn value
std::string socialSecurityNumber;
// each object has a unique email value
std::string email;
}
And we need to retrieve efficiently objects by the value of any unique property.
Implementations comparison
Time complexities are given assuming string comparisons are constant time (strings have limited length).
1) Single unordered_map
With a single map indexed by a single key (ex: email):
std::unordered_map<std::string,Person> indexedByEmail;
Time complexity: lookup by any unique property other than email requires a traversal of the map: average O(N).
Memory usage: the email value is duplicated. This could be avoided by using a single set with custom hash & compare (see 3).
2) Multiple unordered_map, no custom hash & compare
With a map for each unique property, with default hash & comparisons:
std::unordered_map<std::pair<std::string,std::string>, Person> byName;
std::unordered_map<std::string, const Person*> byEmail;
std::unordered_map<std::string, const Person*> bySSN;
Time complexity: by using the appropriate map, a lookup by any unique property is average O(1).
Memory usage: inefficient, because of all the string duplications.
3) Multiple unordered_set, custom hash & comparison:
With custom hash & comparison, we define different unordered_set which will hash & compare only specific fields of the objects. Theses sets can be used to perform lookup as if items were stored in a map as in 2, but without duplicating any field.
using StrHash = std::hash<std::string>;
// --------------------
struct PersonNameHash {
std::size_t operator()(const Person& p) const {
// not the best hashing function in the world, but good enough for demo purposes.
return StrHash()(p.firstName) + StrHash()(p.lastName);
}
};
struct PersonNameEqual {
bool operator()(const Person& p1, const Person& p2) const {
return (p1.firstName == p2.firstName) && (p1.lastName == p2.lastName);
}
};
std::unordered_set<Person, PersonNameHash, PersonNameEqual> byName;
// --------------------
struct PersonSsnHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->socialSecurityNumber);
}
};
struct PersonSsnEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->socialSecurityNumber == p2->socialSecurityNumber;
}
};
std::unordered_set<const Person*, PersonSsnHash, PersonSsnEqual> bySSN;
// --------------------
struct PersonEmailHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->email);
}
};
struct PersonEmailEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->email == p2->email;
}
};
std::unordered_set<const Person*,PersonEmailHash,PersonEmailEqual> byEmail;
Time complexity: a lookup by any unique property is still O(1) average.
Memory usage: much better than 2): no string duplication.
Live demo
The hash function itself does something to extract features in a certain way, and The comparator's job is to distinguish whether features are the same or not
With a "shell" of data you may not need to modify the comparator
Briefly: put a feature shell on the data. Features are responsible for being compared
As a matter of fact, I don't quite understand what you problem description. My speech is inevitably confused in logic. Please understand.
:)

Map, pair-vector or two vectors...?

I read through some posts and "wikis" but still cannot decide what approach is suitable for my problem.
I create a class called Sample which contains a certain number of compounds (lets say this is another class Nuclide) at a certain relative quantity (double).
Thus, something like (pseudo):
class Sample {
map<Nuclide, double>;
}
If I had the nuclides Ba-133, Co-60 and Cs-137 in the sample, I would have to use exactly those names in code to access those nuclides in the map. However, the only thing I need to do, is to iterate through the map to perform calculations (which nuclides they are is of no interest), thus, I will use a for- loop. I want to iterate without paying any attention to the key-names, thus, I would need to use an iterator for the map, am I right?
An alternative would be a vector<pair<Nuclide, double> >
class Sample {
vector<pair<Nuclide, double> >;
}
or simply two independent vectors
Class Sample {
vector<Nuclide>;
vector<double>;
}
while in the last option the link between a nuclide and its quantity would be "meta-information", given by the position in the respective vector only.
Due to my lack of profound experience, I'd ask kindly for suggestions of what approach to choose. I want to have the iteration through all available compounds to be fast and easy and at the same time keep the logical structure of the corresponding keys and values.
PS.: It's possible that the number of compunds in a sample is very low (1 to 5)!
PPS.: Could the last option be modified by some const statements to prevent changes and thus keep the correct order?
If iteration needs to be fast, you don't want std::map<...>: its iteration is a tree-walk which quickly gets bad. std::map<...> is really only reasonable if you have many mutations to the sequence and you need the sequence ordered by the key. If you have mutations but you don't care about the order std::unordered_map<...> is generally a better alternative. Both kinds of maps assume you are looking things up by key, though. From your description I don't really see that to be the case.
std::vector<...> is fast to iterated. It isn't ideal for look-ups, though. If you keep it ordered you can use std::lower_bound() to do a std::map<...>-like look-up (i.e., the complexity is also O(log n)) but the effort of keeping it sorted may make that option too expensive. However, it is an ideal container for keeping a bunch objects together which are iterated.
Whether you want one std::vector<std::pair<...>> or rather two std::vector<...>s depends on your what how the elements are accessed: if both parts of an element are bound to be accessed together, you want a std::vector<std::pair<...>> as that keeps data which is accessed together. On the other hand, if you normally only access one of the two components, using two separate std::vector<...>s will make the iteration faster as more iteration elements fit into a cache-line, especially if they are reasonably small like doubles.
In any case, I'd recommend to not expose the external structure to the outside world and rather provide an interface which lets you change the underlying representation later. That is, to achieve maximum flexibility you don't want to bake the representation into all your code. For example, if you use accessor function objects (property maps in terms of BGL or projections in terms of Eric Niebler's Range Proposal) to access the elements based on an iterator, rather than accessing the elements you can change the internal layout without having to touch any of the algorithms (you'll need to recompile the code, though):
// version using std::vector<std::pair<Nuclide, double> >
// - it would just use std::vector<std::pair<Nuclide, double>::iterator as iterator
auto nuclide_projection = [](Sample::key& key) -> Nuclide& {
return key.first;
}
auto value_projecton = [](Sample::key& key) -> double {
return key.second;
}
// version using two std::vectors:
// - it would use an iterator interface to an integer, yielding a std::size_t for *it
struct nuclide_projector {
std::vector<Nuclide>& nuclides;
auto operator()(std::size_t index) -> Nuclide& { return nuclides[index]; }
};
constexpr nuclide_projector nuclide_projection;
struct value_projector {
std::vector<double>& values;
auto operator()(std::size_t index) -> double& { return values[index]; }
};
constexpr value_projector value_projection;
With one pair these in-place, for example an algorithm simply running over them and printing them could look like this:
template <typename Iterator>
void print(std::ostream& out, Iterator begin, Iterator end) {
for (; begin != end; ++begin) {
out << "nuclide=" << nuclide_projection(*begin) << ' '
<< "value=" << value_projection(*begin) << '\n';
}
}
Both representations are entirely different but the algorithm accessing them is entirely independent. This way it is also easy to try different representations: only the representation and the glue to the algorithms accessing it need to be changed.

How to achieve better efficiency re-inserting into sets in C++

I need to modify an object that has already been inserted into a set. This isn't trivial because the iterator in the pair returned from an insertion of a single object is a const iterator and does not allow modifications. So, my plan was that if an insert failed I could copy that object into a temporary variable, erase it from the set, modify it locally and then insert my modified version.
insertResult = mySet.insert(newPep);
if( insertResult.second == false )
modifySet(insertResult.first, newPep);
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
Peptide tempPep = (*someIter);
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(tempPep);
}
This works, but I want to make my insert more efficient. I tried making another iterator and setting it equal to someIter in modifySet. Then after deleting someIter I would still have an iterator to that location in the set and I could use that as the insertion location.
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
Peptide tempPep = (*someIter);
anotherIter = someIter;
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(anotherIter, tempPep);
}
However, this results in a seg fault. I am hoping that someone can tell me why this insertion fails or suggest another way to modify an object that has already been inserted into a set.
The full source code can be viewed at github.
I agree with Peter that a map is probably a better model of what you are doing, specifically something like map<pep_key, Peptide::Peptide>, would let you do something like:
insertResult = myMap.insert(std::make_pair(newPep.keyField(), newPep));
if( insertResult.second == false )
insertResult.first->second = newPep;
To answer your question, the insert segfaults because erase invalidates an iterator, so inserting with it (or a copy of it) is analogous to dereferencing an invalid pointer. The only way I see to do what you want is with a const_cast
insertResult = mySet.insert(newPep);
if( insertResult.second == false )
const_cast<Peptide::Peptide&>(*(insertResult.first)) = newPep;
the const_cast approach looks like it will work for what you are doing, but is generally a bad idea.
I hope it isn't bad form to answer my own question, but I would like it to be here in case someone else ever has this problem. The answer of why my attempt seg faulted was given my academicRobot, but here is the solution to make this work with a set. While I do appreciate the other answers and plan to learn about maps, this question was about efficiently re-inserting into a set.
void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
if( someIter == someSet.begin() ) {
Peptide tempPep = (*someIter);
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(tempPep);
}
else {
Peptide tempPep = (*someIter);
anotherIter = someIter;
--anotherIter;
someSet.erase(someIter);
// Modify tempPep - this does not modify the key
someSet.insert(anotherIter, tempPep);
}
}
In my program this change dropped my run time by about 15%, from 32 seconds down to 27 seconds. My larger data set is currently running and I have my fingers crossed that the 15% improvement scales.
std::set::insert returns a pair<iterator, bool> as far as I know. In any case, directly modifying an element in any sort of set is risky. What if your modification causes the item to compare equal to another existing item? What if it changes the item's position in the total order of items in the set? Depending on the implementation, this will cause undefined behaviour.
If the item's key remains the same and only its properties change, then I think what you really want is a map or an unordered_map instead of a set.
As you realized set are a bit messy to deal with because you have no way to indicate which part of the object should be considered for the key and which part you can modify safely.
The usual answer is to use a map or an unordered_map (if you have access to C++0x) and cut your object in two halves: the key and the satellite data.
Beware of the typical answer: std::map<key_type, Peptide>, while it seems easy it means you need to guarantee that the key part of the Peptide object always match the key it's associated with, the compiler won't help.
So you have 2 alternatives:
Cut Peptide in two: Peptide::Key and Peptide::Data, then you can use the map safely.
Don't provide any method to alter the part of Peptide which defines the key, then you can use the typical answer.
Finally, note that there are two ways to insert in a map-like object.
insert: insert but fails if the value already exists
operator[]: insert or update (which requires creating an empty object)
So, a solution would be:
class Peptide
{
public:
Peptide(int const id): mId(id) {}
int GetId() const;
void setWeight(float w);
void setLength(float l);
private:
int const mId;
float mWeight;
float mLength;
};
typedef std::unordered_map<int, Peptide> peptide_map;
Note that in case of update, it means creating a new object (default constructor) and then assigning to it. This is not possible here, because assignment means potentially changing the key part of the object.
std::map will make your life a lot easier and I wouldn't be surprised if it outperforms std::set for this particular case. The storage of the key might seem redundant but can be trivially cheap (ex: pointer to immutable data in Peptide with your own comparison predicate to compare the pointee correctly). With that you don't have to fuss about with the constness of the value associated with a key.
If you can change Peptide's implementation, you can avoid redundancy completely by making Peptide into two separate classes: one for the key part and one for the value associated with the key.

std::map iteration - order differences between Debug and Release builds

Here's a common code pattern I have to work with:
class foo {
public:
void InitMap();
void InvokeMethodsInMap();
static void abcMethod();
static void defMethod();
private:
typedef std::map<const char*, pMethod> TMyMap;
TMyMap m_MyMap;
}
void
foo::InitMap()
{
m_MyMap["abc"] = &foo::abcMethod;
m_MyMap["def"] = &foo::defMethod;
}
void
foo::InvokeMethodsInMap()
{
for (TMyMap::const_iterator it = m_MyMap.begin();
it != m_MyMap.end(); it++)
{
(*it->second)(it->first);
}
}
However, I have found that the order that the map is processed in (within the for loop) can differ based upon whether the build configuration is Release or Debug. It seems that the compiler optimisation that occurs in Release builds affects this order.
I thought that by using begin() in the loop above, and incrementing the iterator after each method call, it would process the map in order of initialisation. However, I also remember reading that a map is implemented as a hash table, and order cannot be guaranteed.
This is particularly annoying, as most of the unit tests are run on a Debug build, and often strange order dependency bugs aren't found until the external QA team start testing (because they use a Release build).
Can anyone explain this strange behaviour?
Don't use const char* as the key for maps. That means the map is ordered by the addresses of the strings, not the contents of the strings. Use a std::string as the key type, instead.
std::map is not a hash table, it's usually implemented as a red-black tree, and elements are guaranteed to be ordered by some criteria (by default, < comparison between keys).
The definition of map is:
map<Key, Data, Compare, Alloc>
Where the last two template parameters default too:
Compare: less<Key>
Alloc: allocator<value_type>
When inserting new values into a map. The new value (valueToInsert) is compared against the old values in order (N.B. This is not sequential search, the standard guarantees a max insert complexity of O(log(N)) ) until Compare(value,ValueToInsert) returns true. Because you are using 'const char*' as the key. The Compare Object is using less<const char*> this class just does a < on the two values. So in effect you are comparing the pointer values (not the string) therefore the order is random (as you don't know where the compiler will put strings.
There are two possible solutions:
Change the type of the key so that it compares the string values.
Define another Compare Type that does what you need.
Personally I (like Chris) would just use a std::string because < operator used on strings returns a comparison based on the string content. But for arguments sake we can just define a Compare type.
struct StringLess
{
bool operator()(const char* const& left,const char* const& right) const
{
return strcmp(left,right) < 0;
}
};
///
typedef std::map<const char*, int,StringLess> TMyMap;
If you want to use const char * as the key for your map, also set a key comparison function that uses strcmp (or similar) to compare the keys. That way your map will be ordered by the string's contents, rather than the string's pointer value (i.e. location in memory).