I have a problem that I am looking for some guidance to solve the most efficient way. I have 200 million strings of data ranging in size from 3 characters to 70 characters. The strings consist of letters numbers and several special characters such as dashes and underscores. I need to be able to quickly search for the entire string or any substring within a string (minimum substring size is 3). Quickly is defined here as less than 1 second.
As my first cut at this I did the following:
Created 38 index files. An index contains all the substrings that start with a particular letter. The first 4mb contains 1 million hash buckets (start of the hash chains). The rest of the index contains the linked list chains from the hash buckets. My hashing is very evenly distributed. The 1 million hash buckets are kept in RAM and mirrored to disk.
When a string is added to the index it is broken down into its non-duplicate (within itself) 3-n character substrings (when n is the length of the string-1). So, for example, "apples" is stored in the "A" index as pples,pple,ppl,pp (substrings are also stored in the "L" and "P" indexes).
The search/add server runs as a daemon (in C++) and works like a champ. Typical search times are less than 1/2 second.
The problem is on the front end of the process. I typically add 30,000 keys at a time. This part of the process takes forever. By way of benchmark, the load time into an empty index of 180,000 variable length keys is approximately 3 1/2 hours.
This scheme works except for the very long load times.
Before I go nuts optimizing (or trying to) I'm wondering is whether or not there is a better way to solve this problem. Front and back wildcard searches (ie: string like '%ppl%' in a DBMS are amazingly slow (on the order of hours in MySQL for example) for datasets this large. So it would seem that DBMS solutions are out of the question. I can't use full-text searches because we are not dealing with normal words, but strings that may or may not be composed of real words.
From your description, the loading of data takes all that time because you're dealing with I/O, mirroring the inflated strings to hard disk. This will definitely be a bottleneck, mainly depending on the way you read and write data to the disk.
A possible improvement on execution time may be achieved using mmap with some LRU policy. I'm quite sure the idea of replicating data is to make the search faster, but since you're using -- as it seems to be -- only one machine, you're bottleneck will go dive from memory searching to I/O requests.
Another solution, which you may not be interested in -- it's sickly funny and disturbing as well (: --, is to split the data among multiple machines. Considering the way you've structured the data, the implementation itself may take a bit of time, but it would be very straightforward. You'd have:
each machine gets responsible by a set buckets, chosen using something close to hash_id(bucket) % num_machines;
insertions are performed locally, from each machine;
searches may be either interfaced by some type your query-application, or simply clustered into sets of queries -- if the application is not interative;
searches may even have the interface distributed, considering you may send start a request from a node, and forward requests to another node (also clustered requests, to avoid excessive I/O overhead).
Another good point is that, as you said, data is evenly distributed -- ALREADY \o/; this is usually one of the pickiest parts of a distributed implementation. Besides, this would be highly scalable, as you may add another machine whenever data grows in size.
Instead of doing everything in one pass, solve the problem in 38 passes.
Read each of the 180,000 strings. Find "A"s in each string, and write out stuff only to the "A" hash table. After you are done, write the entire finished result of the "A" hash table out to disk. (have enough RAM to store the entire "A" hash table in memory -- if you don't, make smaller hash tables. Ie, have 38^2 hash tables on pairs of starting letters, and have 1444 different tables. You could even dynamically change how many letters the hash tables are keyed off of have based on how common a prefix they are, so they are all of modest size. Keeping track of how long such prefixes are isn't expensive.)
Then read each of the 180,000 strings, looking for "B". Etc.
My theory is that you are going slower than you could because of thrashing of your cache of your massive tables.
The next thing that might help is to limit how long the strings are you do a hash on, in order to shrink the size of your tables.
Instead of doing all 2278 substrings of length 3 to 70 of a string of length 70, if you limited the length of the hash to 10 characters there are only 508 substrings of length 3 to 10. And there may not be that many collisions on strings of length longer than 10. You could, again, have the length of the hashes be dynamic -- the length X hash might have a flag for "try a length X+Y hash if your string is longer than X, this is too common", and otherwise simply terminate the hashing. That could reduce the amount of data in your tables, at the cost of slower lookup in some cases.
Related
For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.
Here's the simplified pseudocode:
set = # empty set
deduped = []
for string in strings:
if !set.contains(string):
set.add(string)
deduped.add(string)
Here's the simplified C++ for it (roughly):
std::unordered_set <const char *>set;
for (auto &string : strings) {
// do some non-trivial work here that is difficult to parallelize
auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector
However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:
Use a different C++ set implementation (e.g., abseil's)
Insert into the set concurrently (however, per the comment in the C++ implementation, this is hard. Also, there will be performance overhead to parallelizing)
Because the set of strings changes very little across runs, perhaps cache whether or not the hash function generates no collisions. If it doesn't generate any (while accounting for the changes), then strings can be compared by their hash during lookup, rather than for actual string equality (strcmp).
Storing the de-duped strings in a file across runs (however, although that may seem simple, there are lots of complexities here)
All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.
You can try various algorithms and data structures to solve your problem:
Try using a prefix tree (trie), a suffix machine, a hash table. A hash table is one of the fastest ways to find duplicates. Try different hash tables.
Use various data attributes to reduce unnecessary calculations. For example, you can only process subsets of strings with the same length.
Try to implement the "divide and conquer" approach to parallelize the computations. For example, divide the set of strings by the number of subsets equal to the hardware threads. Then combine these subsets into one. Since the subsets will be reduced in size in the process (if the number of duplicates is large enough), combining these subsets should not be too expensive.
Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.
You can significantly parallelize your task by implementing simplified version of std::unordered_set manually:
Create arbitrary amount of buckets (probably should be proportional or equal to amount of threads in thread pool).
Using thread pool calculate hashes of your strings in parallel and split strings with their hashes btw buckets. You may need to lock individual buckets when adding your strings there but operation should be short and/or you may use lock free structure.
Process each bucket individually using your thread pool - compare hashes and if they equal compare string themselves.
You may need to experiment with bucket size and check how it would affect performance. Logically it should be not too big on one side but not too small on another - to prevent congestion.
Btw from your description it sounds that you load all strings into memory and then eliminate duplicates. You may try to read your data directly to std::unordered_set instead then you will save memory and increase speed as well.
I'm processing a large number of image files (tens of millions) and I need to return the number of pixels for each file.
I have a function that uses an std::map<string, unsigned int> to keep track of files already processed. If a path is found in the map, then the value is returned, otherwise the file is processed and inserted into the map. I do not delete entries from the map.
The problem is as the number of entries grow, the time for lookup is killing the performance. This portion of my application is single threaded.
I wanted to know if unordered_map is the solution to this, or the fact that I'm using std::string as keys going to affects the hashing and require too many rehashings as the number of keys increases, thus once again killing the performance.
One other item to note is that the paths for the string are expected (but not guaranteed) to have the same prefix, for example: /common/until/here/now_different/. So all strings will likely have the same first N characters. I could potentially store these as relative to the common directory. How likely is that to help performance?
unordered_map will probably be better in this case. It will typically be implemented as a hash table, with amortized O(1) lookup time, while map is usually a binary tree with O(log n) lookups. It doesn't sound like your application would care about the order of items in the map, it's just a simple lookup table.
In both cases, removing the common prefix should be helpful, as it means less time has to be spent needlessly iterating over that part of the strings. For unordered_map it will have to traverse it twice: once to hash and then to compare against the keys in the table. Some hash functions also limit the amount of a string they hash, to prevent O(n) hash performance -- if the common prefix is longer than this limit, you'll end up with a worst-case hash table (everything is in one bucket).
I really like Galik's suggestion of using inodes if you can, but if not...
Will emphasise a point already made in comments: if you've reason to care, always implement the alternatives and measure. The more reason, the more effort it's worth expending on that....
So /- another option is to use a 128-bit cryptographic strength hash function on your filepaths, then trust that statistically it's extremely unlikely to produce a collision. A rule of thumb is that if you have 2^n distinct keys, you want significantly more than a 2n-bit hash. For ~100m keys, that's about 2*27 bits, so you could probably get away with a 64 bit hash but it's a little too close for comfort and headroom if the number of images grows over the years. Then use a vector to back a hash table of just the hashes and file sizes, with say quadratic probing. Your caller would ideally pre-calculate the hash of an incoming file path in a different thread, passing your lookup API only the hash.
The above avoids the dynamic memory allocation, indirection, and of course memory usage when storing variable-length strings in the hash table and utilises the cache much better. Relying on hashes not colliding may make you uncomfortable, but past a point the odds of a meteor destroying the computer, or lightning frying it, will be higher than the odds of a collision in the hash space (i.e. before mapping to hash table bucket), so there's really no point fixating on that. Cryptographic hashing is relatively slow, hence the suggestion to let clients do it in other threads.
(I have worked with a proprietary distributed database based on exactly this principle for path-like keys.)
Aside: beware Visual C++'s string hashing - they pick 10 characters spaced along your string to incorporate in the hash value, which would be extremely collision prone for you, especially if several of those were taken from the common prefix. The C++ Standard leaves implementations the freedom to provide whatever hashes they like, so re-measure such things if you ever need to port your system.
I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.
If it was just ASCII characters I just use array of bool of size 256. But Unicode has so many characters.
1. Wikipedia says unicode has more than 110000 characters. So bool[110000] might not be a good idea?
2. Lets say the characters are coming in a stream and I just want to stop whenever a duplicate is detected. How do I do this?
3. Since the set is so big, I was thinking hash table. But how do I tell when the collision happens because I do not want to continue once a collision is detected. Is there a way to do this in STL implementation of hash table?
Efficient in terms of speed and memory utilization.
Your Options
There are a couple of possible solutions:
bool[0x110000] (note that this is a hex constant, not a decimal constant as in the question)
vector<bool> with a size of 0x110000
sorted vector<uint32_t> or list<uint32_t> containing every encountered codepoint
map<uint32_t, bool> or unordered_map<uint32_t, bool> containing a mapping of codepoints to whether they have been encountered
set<uint32_t> or unordered_set<uint32_t> containing every encountered codepoint
A custom container, e.g. a bloom filter which provides high-density probabilistic storage for exactly this kind of problem
Analysis
Now, let's perform a basic analysis of the 6 variants:
Requires exactly 0x110000 bytes = 1.0625 MiB plus whatever overhead a single allocation does. Both setting and testing is extremely fast.
While it would seem that this is pretty exactly the same solution, it only requires roughly 1/8 of the memory, since it will store the bools in one bit each instead of one byte each. Both setting and testing are extremely fast, performance relative to the first solution may be better or worse, depending on things like cpu cache size, memory performance and of course your test data.
While potentially taking the least amount of memory (4 bytes per encountered codepoint, so it will require less memory as long as the input stream contains at most 0x110000 / 8 / 4 = 34816), performance for this solution will be abysmal: Testing takes O(log(n)) for the vector (binary search) and O(n) for the list (binary search requires random access), while inserting takes O(n) for the vector (all following elements must be moved) and O(1) for the list (assuming that you kept the result of your failed test). This means, that a test + insert cycles takes O(n). Therefore your total runtime will be O(n^2)...
Without doing a lot of talking about this, it should be obvious that we do not need the bool, but would rather just test for existence, leading to solution 5.
Both sets are fairly similar in performance, set is usually implemented with a binary tree and unordered_set with a hash map. This means, that both are fairly inefficient memory wise: Both contain additional overhead (non-leaf nodes in trees and the actual tables containing the hashes in hashmaps) meaning they will probably take 8-16 bytes per entry. Testing and Inserting are O(log(n)) for the set and O(1) amortized for the unordered_set.
To answer the comment, testing whether uint32 const x is contained in unordered_set<uint32_t> data is done like so: if(data.count(x)) or if(data.find(x) != data.end()).
The major drawbacks here is a significant amount of work that the developer has to invest. Also, the bloom filter that was given as an example is a probabilistic data structure, meaning false-positives are possible (false negatives are not in this specific case).
Conclusion
Since your test data is not actual textual data, using a set of either kind is actually very memory inefficient (with high probability). Since this is also the only possibility to achieve better performance than the naive solutions 1 and 2, it will in all probability be significantly slower as well.
Taking into consideration the fact that it is easier to handle and that you seem to be fairly conscious of memory consumption, the final verdict is that vector<bool> seems the most appropriate solution.
Notes
Unicode is intended to represent text. This means that your test case and any analysis following from this is probably highly flawed.
Also, while it is simple to detect duplicate code points, the idea of a character is far more ambiguous, and may require some kind of normalization (e.g. "รค" can be either one codepoint or an "a" and a diacritical mark, or even be based on the cyrillic "a").
I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go).
I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short).
Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).
I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.
Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.
With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.
By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:
Multiattribute querying with Boost.MultiIndex
Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:
A combinatory theorem
Generating permutation covers: part I
Generating permutation covers: part II
Multicolumn querying
The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.
How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got
32+6*12 = 104 bytes + padding.
Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)
B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .
Article evaluating when to use B-Tree and when to use Bitmap
It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.
If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.