Storing filepath and size in C++

Storing filepath and size in C++ - c++

I'm processing a large number of image files (tens of millions) and I need to return the number of pixels for each file.
I have a function that uses an std::map<string, unsigned int> to keep track of files already processed. If a path is found in the map, then the value is returned, otherwise the file is processed and inserted into the map. I do not delete entries from the map.
The problem is as the number of entries grow, the time for lookup is killing the performance. This portion of my application is single threaded.
I wanted to know if unordered_map is the solution to this, or the fact that I'm using std::string as keys going to affects the hashing and require too many rehashings as the number of keys increases, thus once again killing the performance.
One other item to note is that the paths for the string are expected (but not guaranteed) to have the same prefix, for example: /common/until/here/now_different/. So all strings will likely have the same first N characters. I could potentially store these as relative to the common directory. How likely is that to help performance?

unordered_map will probably be better in this case. It will typically be implemented as a hash table, with amortized O(1) lookup time, while map is usually a binary tree with O(log n) lookups. It doesn't sound like your application would care about the order of items in the map, it's just a simple lookup table.
In both cases, removing the common prefix should be helpful, as it means less time has to be spent needlessly iterating over that part of the strings. For unordered_map it will have to traverse it twice: once to hash and then to compare against the keys in the table. Some hash functions also limit the amount of a string they hash, to prevent O(n) hash performance -- if the common prefix is longer than this limit, you'll end up with a worst-case hash table (everything is in one bucket).

I really like Galik's suggestion of using inodes if you can, but if not...
Will emphasise a point already made in comments: if you've reason to care, always implement the alternatives and measure. The more reason, the more effort it's worth expending on that....
So /- another option is to use a 128-bit cryptographic strength hash function on your filepaths, then trust that statistically it's extremely unlikely to produce a collision. A rule of thumb is that if you have 2^n distinct keys, you want significantly more than a 2n-bit hash. For ~100m keys, that's about 2*27 bits, so you could probably get away with a 64 bit hash but it's a little too close for comfort and headroom if the number of images grows over the years. Then use a vector to back a hash table of just the hashes and file sizes, with say quadratic probing. Your caller would ideally pre-calculate the hash of an incoming file path in a different thread, passing your lookup API only the hash.
The above avoids the dynamic memory allocation, indirection, and of course memory usage when storing variable-length strings in the hash table and utilises the cache much better. Relying on hashes not colliding may make you uncomfortable, but past a point the odds of a meteor destroying the computer, or lightning frying it, will be higher than the odds of a collision in the hash space (i.e. before mapping to hash table bucket), so there's really no point fixating on that. Cryptographic hashing is relatively slow, hence the suggestion to let clients do it in other threads.
(I have worked with a proprietary distributed database based on exactly this principle for path-like keys.)
Aside: beware Visual C++'s string hashing - they pick 10 characters spaced along your string to incorporate in the hash value, which would be extremely collision prone for you, especially if several of those were taken from the common prefix. The C++ Standard leaves implementations the freedom to provide whatever hashes they like, so re-measure such things if you ever need to port your system.

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?

I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

How to make a fast dictionary that contains another dictionary?

I have a map<size_t, set<size_t>>, which, for better performance, I'm actually representing as a lexicographically-sorted vector<pair<size_t, vector<size_t>>>.
What I need is a set<T> with fast insertion times (removal doesn't matter), where T is the data type above, so that I can check for duplicates (my program runs until there are no more unique T's being generated.).
So far, switching from set to unordered_set has turned out to be quite beneficial (it makes my program run > 25% faster), but even now, inserting T still seems to be one of the main bottlenecks.
The maximum number of integers in a given T is around ~1000, and each integer is also <= ~1000, so the numbers are quite small (but there are thousands of these T's being generated).
What I have already tried:
Using unsigned short. It actually decreases performance slightly.
Using Google's btree::btree_map.
It's actually much slower because I have to work around the iterator invalidation.
(I have to copy the keys, and I think that's why it turned out slow. It was at least twice as slow.)
Using a different hash function. I haven't found any measurable difference as long as I use something reasonable, so it seems like this can't be improved.
What I have not tried:
Storing "fingerprints"/hashes instead of the actual sets.
This sounds like the perfect solution, except that the fingerprinting function needs to be fast, and I need to be extremely confident that collisions won't happen, or they'll screw up my program.
(It's a deterministic program that needs exact results; collisions render it useless.)
Storing the data in some other compact, CPU-friendly way.
I'm not sure how beneficial this would be, because it might involve copying around data, and most of the performance I've gained so far is by (cleverly) avoiding copying data in many situations.
What else can I do to improve the speed, if anything?

I am under the impression that you have 3 different problems here:
you need the T itself to be relatively compact and easy to move around
you need to quickly check whether a T is a possible duplicate of an already existing one
you finally need to quickly insert the new T in whatever data structure you have to check for duplicates
Regarding T itself, it is not yet as compact as it could be. You could probably use a single std::vector<size_t> to represent it:
N pairs
N Indexes
N "Ids" of I elements each
all that can be linearized:
[N, I(0), ..., I(N-1),
R(0) = Size(Id(0)), Id(0, 0), ... , Id(0, R(0)-1),
R(1) = ... ]
and this way you have a single chunk of memory.
Note: depending on the access pattern you may have to tweak it, specifically if you need random access to any ID.
Regarding the possibility of duplicates, a hash-map seems indeed quite appropriate. You will need a good hash function, but with a single array of size_t (or unsigned short if you can, it is smaller), you can just pick MurmurHash or CityHash or SipHash. They all are blazing fast and do their damnest to produce good quality hash (not cryptographic ones, emphasis is on speed).
Now, the question is when is it slow when checking for duplicates.
If you spend too much time checking for non-existing duplicates because the hash-map is too big, you might want to invest in a Bloom Filter in front of it.
Otherwise, check your hash function to make sure that it is really fast and has a low collision rate and your hash-map implementation to make sure it only ever computes the hash once.
Regarding insertion speed. Normally a hash-map, specifically if well-balanced and pre-sized, should have one of the quickest insertion. Make sure you move data into it and do not copy it; if you cannot move, it might be worth using a shared_ptr to limit the cost of copying.

Don't be afraid of collisions, use cryptographic hash. But choose a fast one. 256 bit collision is MUCH less probable than hardware error. Sun used it to deduplicate blocks in ZFS. ZFS uses SHA256. Probably you can use less secure hash. If it takes $1000000 to find one collision hash isn't secure but one collision don't seem to drop your performance. Many collisions would cost many $1000000. You can use something like (unordered) multimap<SHA, T> to deal with collisions. By the way, ANY hash table suffer from collisions (or takes too many memory), so ordered map (rbtree in gcc) or btree_map has better guaranteed time. Also hash table can be DOSed via hash collisions. Probably a secret salt can solve this problem. It is due to table size is much less than number of possible hashes.

You can also:
1) use short ints
2) reinterpret your array as an array of something like uint64_t for fast comparison (+some trailing elements), or even reinterpret it as an array of 128-bit values (or 256-bit, depending on your CPU) and compare via SSE. This should push your performance to memory speed limit.
From my experience SSE works fast with aligned memory access only. uint64_t comparison probably needs alignment for speed too, so you have to allocate memory manually with proper alignment (allocate more and skip first bytes). tcmalloc is 16 byte-aligned, uint64_t-ready. It is strange that you have to copy keys in btree, you can avoid it using shared_ptr. With fast comparisons and slow hash btree or std::map may turn out to be faster than hash table. I guess any hash is slower than memory. You can also calculate hash via SSE and probably find a library that does it.
PS I strongly recommend you to use profiler if you don't yet. Please tell % of time your program spend to insert, compare in insert and calculate hash.

Hash map optimised for lookup

I am looking for some map which has fixed keys (fixed during initialization) and that does faster look-up. It may not support adding/updating elements later. Is there some algorithm which looks the list of keys and formulates a function so that it is faster to look-up later. In my case, keys are strings.
Update:
Keys are not known at compile time. But during initialization time of the application. There wont be any further insertions later but there will be lots of look-ups. So I want look-ups to be optimized.

CMPH may be what you're looking for. Basically this is gperf without requiring the set at compile-time.
Though of course std::unordered_map as by C++11 might just do too, though possibly with a few collisions.
Since you lookup strings, for strings, a trie (any of the different trie flavours, crit-bit or whatever funky names they have) may also be worthwhile to look into, especially if you have many of them. There are a lot of free trie implementations freely available.
The advantage of tries is that they can index-compress strings, so they use less memory, which has a higher likelihood of having data in cache. Also the access pattern is less random, which is also cache-friendly. A hash table must store the value plus the hash, and indexes more or less randomly (not randomly, but unpredictably) into memory. A trie/trie-like structure ideally only needs one extra bit that distinguishes a key from its common prefix in each node.
(Note by the way that O(log(N)) may quite possibly be faster than O(1) in such a case, because big-O does not consider things like that.)

Note that these are distinct things: do you need an upper limit, do you need a fast typical rate, or do you need the fastest lookup ever, no questions asked? The last one will cost you, the first two ones may be conflicting goals.
You could attempt to create a perfect hash function based on the input (i.e. one that does not have collisions of the input set). This is a somehow-solved problem (e.g. this, this). However, they usually generate source code and may spend significant time generating the hash function.
A modification of this would be using a generic hash function (e.g. shift-multiply-add) and do a brute force search over suitable parameters.
This has to be traded off with the cost of a few string comparisons (which aren't that terribly expensive if you don't have to collate).
Another option is to use two distinct hash functions - this increases the cost of a single lookup but makes degradation slightly less likely than aliens stealing your clock cylces. It is rather unlikely that this would be a problem with typical strings and a decent hash function.

Try google-sparsehash: http://code.google.com/p/google-sparsehash/
An extremely memory-efficient hash_map implementation. 2 bits/entry overhead!
The SparseHash library contains several hash-map implementations, including
implementations that optimize for space or speed.

In a similar topic ((number of) items known at compile time) , I produced this one: Lookups on known set of integer keys. Low overhead, no need for perfect hash. Fortunately, it is in C ;-)

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?

Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.

I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.

For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.

You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.

A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.

Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.

Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?

How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.

Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js