An integer hashing problem

An integer hashing problem - c++

I have a (C++) std::map<int, MyObject*> that contains a couple of millions of objects of type MyObject*. The maximum number of objects that I can have, is around 100 millions. The key is the object's id. During a certain process, these objects must be somehow marked( with a 0 or 1) as fast as possible. The marking cannot happen on the objects themselves (so I cannot introduce a member variable and use that for the marking process). Since I know the minimum and maximum id (1 to 100_000_000), the first thought that occured to me, was to use a std::bit_set<100000000> and perform my marking there. This solves my problem and also makes it easier when marking processes run in parallel, since these use their own bit_set to mark things, but I was wondering what the solution could be, if I had to use something else instead of a 0-1 marking, e.g what could I use if I had to mark all objects with an integer number ?
Is there some form of a data structure that can deal with this kind of problem in a compact (memory-wise) manner, and also be fast ? The main queries of interest are whether an object is marked, and with what was marked with.
Thank you.
Note: std::map<int, MyObject*> cannot be changed. Whatever data structure I use, must not deal with the map itself.

How about making the value_type of your map a std::pair<bool, MyObject*> instead of MyObject*?

If you're not concerned with memory, then a std::vector<int> (or whatever suits your need in place of an int) should work.
If you don't like that, and you can't modify your map, then why not create a parallel map for the markers?
std::map<id,T> my_object_map;
std::map<id,int> my_marker_map;
If you cannot modify the objects directly, have you considered wrapping the objects before you place them in the map? e.g.:
struct
{
int marker;
T *p_x;
} T_wrapper;
std::map<int,T_wrapper> my_map;
If you're going to need to do lookups anyway, then this will be no slower.
EDIT: As #tenfour suggests in his/her answer, a std::pair may be a cleaner solution here, as it saves the struct definition. Personally, I'm not a big fan of std::pairs, because you have to refer to everything as first and second, rather than by meaningful names. But that's just me...

The most important question to ask yourself is "How many of these 100,000,000 objects might be marked (or remain unmarked)?" If the answer is smaller than roughly 100,000,000/(2*sizeof(int)), then just use another std::set or std::tr1::unordered_set (hash_set previous to tr1) to track which ones are so marked (or remained unmarked).
Where does 2*sizeof(int) come from? It's an estimate of the amount of memory overhead to maintain a heap structure in a deque of the list of items that will be marked.
If it is larger, then use std::bitset as you were about to use. It's overhead is effectively 0% for the scale of quantity you need. You'll need about 13 megabytes of contiguous ram to hold the bitset.
If you need to store a marking as well as presence, then use std::tr1::unordered_map using the key of Object* and value of marker_type. And again, if the percentage of marked nodes is higher than the aforementioned comparison, then you'll want to use some sort of bitset to hold the number of bits needed, with suitable adjustments in size, at 12.5 megabytes per bit.
A purpose-built object holding the bitset might be your best choice, given the clarification of the requirements.
Edit: this assumes that you've done proper time-complexity computations for what are acceptable solutions to you, since changing the base std::map structure is no longer permitted.

If you don't mind using hacks, take a look at the memory optimization used in Boost.MultiIndex. It can store one bit in the LSB of a stored pointer.

Related

Data structure for sparse insertion

I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.
Here is my problem (which I guess is typical?):
We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;
There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);
The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.
Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.
This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?
EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.

I would use something like:
#include <boost/unordered_map.hpp>
class Data
{
boost::unordered_map<std::pair<int,int>,double> map;
public:
void update(int i, int j, double v)
{
map[std::pair<int,int>(i,j)] += v;
}
void output(); // Prints data somewhere.
};
That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.
If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.
Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.

How fast is the code

I'm developing game. I store my game-objects in this map:
std::map<std::string, Object*> mObjects;
std::string is a key/name of object to find further in code. It's very easy to point some objects, like: mObjects["Player"] = .... But I'm afraid it's to slow due to allocation of std::string in each searching in that map. So I decided to use int as key in that map.
The first question: is that really would be faster?
And the second, I don't want to remove my current type of objects accesing, so I found the way: store crc string calculating as key. For example:
Object *getObject(std::string &key)
{
int checksum = GetCrc32(key);
return mObjects[checksum];
}
Object *temp = getOject("Player");
Or this is bad idea? For calculating crc I would use boost::crc. Or this is bad idea and calculating of checksum is much slower than searching in map with key type std::string?

Calculating a CRC is sure to be slower than any single comparison of strings, but you can expect to do about log2N comparisons before finding the key (e.g. 10 comparisons for 1000 keys), so it depends on the size of your map. CRC can also result in collisions, so it's error prone (you could detect collisions relatively easily detect, and possibly even handle them to get correct results anyway, but you'd have to be very careful to get it right).
You could try an unordered_map<> (possibly called hash_map) if your C++ environment provides one - it may or may not be faster but won't be sorted if you iterate. Hash maps are yet another compromise:
the time to hash is probably similar to the time for your CRC, but
afterwards they can often seek directly to the value instead of having to do the binary-tree walk in a normal map
they prebundle a bit of logic to handle collisions.
(Silly point, but if you can continue to use ints and they can be contiguous, then do remember that you can replace the lookup with an array which is much faster. If the integers aren't actually contiguous, but aren't particularly sparse, you could use a sparse index e.g. array of 10000 short ints that are indices into 1000 packed records).
Bottom line is if you care enough to ask, you should implement these alternatives and benchmark them to see which really works best with your particular application, and if they really make any tangible difference. Any of them can be best in certain circumstances, and if you don't care enough to seriously compare them then it clearly means any of them will do.

For the actual performance you need to profile the code and see it. But I would be tempted to use hash_map. Although its not part of the C++ standard library most of the popular implentations provide it. It provides very fast lookup.

The first question: is that really would be faster?
yes - you're comparing an int several times, vs comparing a potentially large map of strings of arbitrary length several times.
checksum: Or this is bad idea?
it's definitely not guaranteed to be unique. it's a bug waiting to bite.
what i'd do:
use multiple collections and embrace type safety:
// perhaps this simplifies things enough that t_player_id can be an int?
std::map<t_player_id, t_player> d_players;
std::map<t_ghoul_id, t_ghoul> d_ghouls;
std::map<t_carrot_id, t_carrot> d_carrots;
faster searches, more type safety. smaller collections. smaller allocations/resizes.... and on and on... if your app is very trivial, then this won't matter. use this approach going forward, and adjust after profiling/as needed for existing programs.
good luck

If you really want to know you have to profile your code and see how long does the function getObject take. Personally I use valgrind and KCachegrind to profile and render data on UNIX system.
I think using id would be faster. It's faster to compare int than string so...

C++ map really slow?

i've created a dll for gamemaker. dll's arrays where really slow so after asking around a bit i learnt i could use maps in c++ and make a dll.
anyway, ill represent what i need to store in a 3d array:
information[id][number][number]
the id corresponds to an objects id. the first number field ranges from 0 - 3 and each number represents a different setting. the 2nd number field represents the value for the setting in number field 1.
so..
information[101][1][4];
information[101][2][4];
information[101][3][4];
this would translate to "object with id 101 has a value of 4 for settings 1, 2 and 3".
i did this to try and copy it with maps:
//declared as a class member
map<double, map<int, double>*> objIdMap;
///// lower down the page, in some function
map<int, double> objSettingsMap;
objSettingsMap[1] = 4;
objSettingsMap[2] = 4;
objSettingsMap[3] = 4;
map<int, double>* temp = &objSettingsMap;
objIdMap[id] = temp;
so the first map, objIdMap stores the id as the key, and a pointer to another map which stores the number representing the setting as the key, and the value of the setting as the value.
however, this is for a game, so new objects with their own id's and settings might need to be stored (sometimes a hundred or so new ones every few seconds), and the existing ones constantly need to retrieve the values for every step of the game. are maps not able to handle this? i has a very similar thing going with game maker's array's and it worked fine.

Do not use double's as a the key of a map.
Try to use a floating point comparison function if you want to compare two doubles.

1) Your code is buggy: You store a pointer to a local object objSettingsMap which will be destroyed as soon as it goes out of scope. You must store a map obj, not a pointer to it, so the local map will be copied into this object.
2) Maps can become arbitrarily large (i have maps with millions of entrys). If you need speed try hash_maps (part of C++0x, but also available from other sources), which are considerably faster. But adding some hundred entries each second shouldn't be a problem. But befre worring about execution speed you should always use a profiler.
3) I am not really sure if your nested structures MUST be maps. Depending of what number of setting you have, and what values they may have, a structure or bitfield or a vector might be more accurate.

If you need really fast associative containers, try to learn about hashes. Maps are 'fast enough' but not brilliant for some cases.
Try to analyze what is the structure of objects you need to store. If the fields are fixed I'd recommend not to use nested maps. At all. Maps are usually intended for 'average' number of indexes. For low number simple lists are more effective because of insert / erase operations lower complexity. For great number of indexes you really need to think about hashing.
Don't forget about memory. std::map is highly dynamic template so on small objects stored you loose tons of memory because of dynamic allocation. Is it what you are really expecting? Once I was involved in std::map usage removal which lowered memory requirements in about 2 times.
If you only need to fill the map at startup and only search for elements (don't need to change structure) I'd recommend simple std::vector with sort applied after all the elems inserted. And then you can just use binary search (as you have sorted vector). Why? std::vector is much more predictable thing. The biggest advantage is continuous memory area.

Using Hash Maps to represent an extremely large data source

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.
I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.
I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.
The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.
Does anyone have any suggestions and/or hashing algorithms that I can use for this?

If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.
Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.
Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.

Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."

Since the comments on the question indicate the primary concern may be memory usage:
Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
(Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.
What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.
You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.

Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.
According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.

Although 166k data entries is rather small IMO you can have a look at google-sparsehash

Need to store string as id for objects in some fast data structure

I'm implementing a session store for a web-server. Keys are string
and stored objects are pointers. I tried using map but need something
faster. I will look up an object 5-20 times
as frequent than insert.
I tried using hash-map but failed. I felt like I got more constraints than more free time.
I'm coding c/c++ under Linux.
I don't want to commit to boost, since my web server is going to outlive boost. :)
This is a highly relevant question since the hardware (ssd disk) is
changing rapidly. What was the right solution will not be in 2 years.

I was going to suggest a map, but I see you have already ruled this out.
I tried using map but need something
faster.
These are the std::map performance bounds courtesy of the Wikipedia page:
Searching for an element takes O(log n) time
Inserting a new element takes O(log n) time
Incrementing/decrementing an iterator takes O(log n) time
Iterating through every element of a map takes O(n) time
Removing a single map element takes O(log n) time
Copying an entire map takes O(n log n) time.
How have you measured and determined that a map is not optimised sufficiently for you? It's quite possible that any bottlenecks you are seeing are in other parts of the code, and a map is perfectly adequate.
The above bounds seem like they would fit within all but the most stringent scalability requirements.

The type of data structure that will be used will be determined by the data you want to access. Some questions you should ask:
How many items will be in the session store? 50? 100000? 10000000000?
How large is each item in the store (byte size)?
What kind of string input is used for the key? ASCII-7? UTF-8? UCS2?
...
Hash tables generally perform very well for look ups. You can optimize them heavily for speed by writing them yourself (and yes, you can resize the table). Suggestions to improve performance with hash tables:
Choose a good hash function! this will have preferably even distribution among your hash table and will not be time intensive to compute (this will depend on the format of the key input).
Make sure that if you are using buckets to not exceed a length of 6. If you do exceed 6 buckets then your hash function probably isn't distributing evenly enough. A bucket length of < 3 is preferable.
Watch out for how you allocate your objects. If at all possible, try to allocate them near each other in memory to take advantage of locality of reference. If you need to, write your own sub-allocator/heap manager. Also keep to aligned boundaries for better access speeds (aligned is processor/bus dependent so you'll have to determine if you want to target a particular processor type).
BTrees are also very good and in general perform well. (Someone can insert info about btrees here).
I'd recommend looking at the data you are storing and making sure that the data is as small as possible. use shorts, unsigned char, bit fields as necessary. There are other additional ways to squeeze out improved performance as well such as allocating your string data at the end of your struct at the same time that you allocate the struct. i.e.
struct foo {
int a;
char my_string[0]; // allocate an instance of foo to be
// sizeof(int) + sizeof(your string data) etc
}
You may also find that implementing your own string compare routine can actually boost performance dramatically, however this will depend upon your input data.

It is possible to make your own. But you shouldn't have any problems with boost or std::tr1::unordered_map.
A ternary trie may be faster than a hash map for a smaller number of elements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js