What is the best hash function for Rabin-Karp algorithm? - c++

I'm looking for an efficient hash function for Rabin-Karp algorithm. Here is my actual code (C programming language).
static bool f2(char const *const s1, size_t const n1,
char const *const s2, size_t const n2)
{
uintmax_t hsub = hash(s2, n2);
uintmax_t hs = hash(s1, n1);
size_t nmax = n2 - n1;
for (size_t i = 0; i < nmax; ++i) {
if (hs == hsub) {
if (strncmp(&s1[i], s2, i + n2 - 1) == 0)
return true;
}
hs = hash(&s1[i + 1], i + n2);
}
return false;
}
I considered some Rabin-Karp C implementations, but there are differences between all the codes. So my question is: what are the characteristics that a Rabin-Karp hash function should have?

A extremly good performing hash is the bernstein hash. It even outruns
many popular hashing algorithms.
unsigned bernstein_hash ( void *key, int len )
{
unsigned char *p = key;
unsigned h = 0;
int i;
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
return h;
}
Of course, you can try out other hashing algorithms, as described here:
Hash function on NIST
Note: It has never been explained why the 33 is performing so much better
than any other "more logic" constant.
For your interest: Here is a good comparison of different hash algorithms:
strchr comparison of hash algorithms

what are the characteristics that a Rabin-Karp hash function should have?
Rabin-Karp needs a rolling hash. The easiest rolling hash is a moving sum. Adler-32 and Buzhash are pretty simple too and perform better than a moving sum.
Any of these rolling hash techniques should work for Rabin-Karp:
Moving sum
remove the oldest byte with subtraction
add a new byte with addition
Polynomial rolling hash
remove the oldest byte with subtraction
add a new byte with multiplication and addition
Rabin fingerprint
a polynomial rolling hash whose polynomial is irreducible over GF(2)
Tabulation hashing
remove the oldest byte with a table lookup and an xor
add a new byte with a table lookup and an xor
Cyclic polynomial, aka Buzhash
tabulation hashing based on circular shifts
Adler-32 checksum
not a rolling checksum by default but easily adjusted to "roll"
remove the oldest byte with two subtractions
add a new byte with two additions

For the problem with the small alphabets, such as nucleic acid sequence searching (e.g. alphabet = {A, T, C, G, U}), nt-Hash may be a good hash function.
It uses the binary operation, which is faster, and rolling hash update, and it also gives uniform distributed hash values.

Considering that the implementors of Java's JDK would have given some thought, I looked up what function is used there.
As of ~ Java 19, https://github.com/openjdk/jdk/blob/jdk-19+23/src/java.base/share/classes/java/lang/String.java#L2326
The update function is:
h' = 31 * h + c
initial value is 0.

Related

how to find distinct substrings?

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l?
The size of character set is also known. (denote it as s)
For example, given a string "PccjcjcZ", s = 4, l = 3,
then there are 5 distinct substrings:
“Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”
I try to use hash table, but the speed is still slow.
In fact I don't know how to use the character size.
I have done things like this
int diffPatterns(const string& src, int len, int setSize) {
int cnt = 0;
node* table[1 << 15];
int tableSize = 1 << 15;
for (int i = 0; i < tableSize; ++i) {
table[i] = NULL;
}
unsigned int hashValue = 0;
int end = (int)src.size() - len;
for (int i = 0; i <= end; ++i) {
hashValue = hashF(src, i, len);
if (table[hashValue] == NULL) {
table[hashValue] = new node(i);
cnt ++;
} else {
if (!compList(src, i, table[hashValue], len)) {
cnt ++;
};
}
}
for (int i = 0; i < tableSize; ++i) {
deleteList(table[i]);
}
return cnt;
}
Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.
Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.
The idea of using a hashtable is good. It should work well.
The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.
You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity
#include <iostream>
#include <string>
#include <unordered_set>
int main()
{
std::string test = "PccjcjcZ";
std::unordered_set<std::string> counter;
size_t substringSize = 3;
for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
{
counter.insert(test.substr(i, substringSize));
}
std::cout << counter.size();
std::cin.get();
return 0;
}
Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.
Hash function
Let X and Y are two adjacent substrings of length L, more precisely:
X = A[i, i + L - 1]
Y = B[i + 1, i + 1 + L - 1]
Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.
Let's define a hash function h now:
h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M
where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.
Algorithm
The crucial observation is the following:
If you have a hash computed for X, you can compute a hash for Y in
O(1) time.
Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:
h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M
Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].
Notice that we can precompute powers of P at the beginning and we can do it modulo M.
Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.
Collisions
Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.
If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.
Use Rolling hashes.
This will make the runtime expected O(n).
This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.
Suffix Automaton also can finish it in O(N).
It's easy to code, but hard to understand.
Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365
http://www.sciencedirect.com/science/article/pii/S0304397509002370

Simulate random iteration of array

I have an array of given size. I want to traverse it in pseudorandom order, keeping array intact and visiting each element once. It will be best if current state can be stored in a few integers.
I know you can't have full randomness without storing full array, but I don't need the order to be really random. I need it to be perceived as random by user. The solution should use sub-linear space.
One possible suggestion - using large prime number - is given here. The problem with this solution is that there is an obvious fixed step (taken module array size). I would prefer a solution which is not so obviously non-random. Is there a better solution?
How about this algorithm?
To pseudo-pseudo randomly traverse an array of size n.
Create a small array of size k
Use the large prime number method to fill the small array, i = 0
Randomly remove a position using a RNG from the small array, i += 1
if i < n - k then add a new position using the large prime number method
if i < n goto 3.
the higher k is the more randomness you get. This approach will allow you to delay generating numbers from the prime number method.
A similar approach can be done to generate a number earlier than expected in the sequence by creating another array, "skip-list". Randomly pick items later in the sequence, use them to traverse the next position, and then add them to the skip-list. When they naturally arrive they are searched for in the skip-list and suppressed and then removed from the skip-list at which point you can randomly add another item to the skip-list.
The idea of a random generator that simulates a shuffle is good if you can get one whose maximum period you can control.
A Linear Congruential Generator calculates a random number with the formula:
x[i + 1] = (a * x[i] + c) % m;
The maximum period is m and it is achieved when the following properties hold:
The parameters c and m are relatively prime.
For every prime number r dividing m, a - 1 is a multiple of r.
If m is a multiple of 4 then also a - 1 is multiple of 4.
My first darft involved making m the next multiple of 4 after the array length and then finding suitable a and c values. This was (a) a lot of work and (b) yielded very obvious results sometimes.
I've rethought this approach. We can make m the smallest power of two that the array length will fit in. The only prime factor of m is then 2, which will make every odd number relatively prime to it. With the exception of 1 and 2, m will be divisible by 4, which means that we must make a - 1 a multiple of 4.
Having a greater m than the array length means that we must discard all values that are illegal array indices. This will happen at most every other turn and should be negligible.
The following code yields pseudo random numbers with a period of exaclty m. I've avoided trivial values for a and c and on my (not too numerous) spot cheks, the results looked okay. At least there was no obvious cycling pattern.
So:
class RandomIndexer
{
public:
RandomIndexer(size_t length) : len(length)
{
m = 8;
while (m < length) m <<= 1;
c = m / 6 + uniform(5 * m / 6);
c |= 1;
a = m / 12 * uniform(m / 6);
a = 4*a + 1;
x = uniform(m);
}
size_t next()
{
do { x = (a*x + c) % m; } while (x >= len);
return x;
}
private:
static size_t uniform(size_t m)
{
double p = std::rand() / (1.0 + RAND_MAX);
return static_cast<int>(m * p);
}
size_t len;
size_t x;
size_t a;
size_t c;
size_t m;
};
You can then use the generator like this:
std::vector<int> list;
for (size_t i = 0; i < 3; i++) list.push_back(i);
RandomIndexer ix(list.size());
for (size_t i = 0; i < list.size(); i++) {
std::cout << list[ix.next()]<< std::endl;
}
I am aware that this still isn't a great random number generator, but it is reasonably fast, doesn't require a copy of the array and seems to work okay.
If the approach of picking a and c randomly yields bad results, it might be a good idea to restrict the generator to some powers of two and to hard-code literature values that have proven to be good.
As pointed out by others, you can create a sort of "flight plan" upfront by shuffling an array of array indices and then follow it. This violates the "it will be best if current state can be stored in a few integers" constraint but does it really matter? Are there tight performance constraints? After all, I believe that if you don't accept repetitions, than you need to store the items you already visited somewhere or somehow.
Alternatively, you can opt for an intrusive solution and store a bool inside each element of the array, telling you whether the element was already selected or not. This can be done in an almost clean way by employing inheritance (multiple as needed).
Many problems come with this solution, e.g. thread safety, and of course it violates the "keep the array intact" constraint.
Quadratic residues which you have mentioned ("using a large prime") are well-known, will work, and guarantee iterating each and every element exactly once (if that is required, but it seems that's not strictly the case?). Unluckily they are not "very random looking", and there are a few other requirements to the modulo in addition to being prime for it to work.
There is a page on Jeff Preshing's site which describes the technique in detail and suggests to feed the output of the residue generator into the generator again with a fixed offset.
However, since you said that you merely need "perceived as random by user", it seems that you might be able to do with feeding a hash function (say, cityhash or siphash) with consecutive integers. The output will be a "random" integer, and at least so far there will be a strict 1:1 mapping (since there are a lot more possible hash values than there are inputs).
Now the problem is that your array is most likely not that large, so you need to somehow reduce the range of these generated indices without generating duplicates (which is tough).
The obvious solution (taking the modulo) will not work, as it pretty much guarantees that you get a lot of duplicates.
Using a bitmask to limit the range to the next greater power of two should work without introducing bias, and discarding indices that are out of bounds (generating a new index) should work as well. Note that this needs non-deterministic time -- but the combination of these two should work reasonably well (a couple of tries at most) on the average.
Otherwise, the only solution that "really works" is shuffling an array of indices as pointed out by Kamil Kilolajczyk (though you don't want that).
Here is a java solution, which can be easily converted to C++ and similar to M Oehm's solution above, albeit with a different way of choosing LCG parameters.
import java.util.Enumeration;
import java.util.Random;
public class RandomPermuteIterator implements Enumeration<Long> {
int c = 1013904223, a = 1664525;
long seed, N, m, next;
boolean hasNext = true;
public RandomPermuteIterator(long N) throws Exception {
if (N <= 0 || N > Math.pow(2, 62)) throw new Exception("Unsupported size: " + N);
this.N = N;
m = (long) Math.pow(2, Math.ceil(Math.log(N) / Math.log(2)));
next = seed = new Random().nextInt((int) Math.min(N, Integer.MAX_VALUE));
}
public static void main(String[] args) throws Exception {
RandomPermuteIterator r = new RandomPermuteIterator(100);
while (r.hasMoreElements()) System.out.print(r.nextElement() + " ");
//output:50 52 3 6 45 40 26 49 92 11 80 2 4 19 86 61 65 44 27 62 5 32 82 9 84 35 38 77 72 7 ...
}
#Override
public boolean hasMoreElements() {
return hasNext;
}
#Override
public Long nextElement() {
next = (a * next + c) % m;
while (next >= N) next = (a * next + c) % m;
if (next == seed) hasNext = false;
return next;
}
}
maybe you could use this one: http://www.cplusplus.com/reference/algorithm/random_shuffle/ ?

I need some direction on writing a Hash Function to sort ~160,000 strings

My instructor dumped this on us, and told us we just needed to google how to write a hash function. I am quite directionless on this. We wrote a basic Hash Table template for class, but I have a project due that requires ~160,000 strings to be sorted into a table with at least 500 buckets (I am wanting to do more for speed).
I just have no idea where to look to get concise, easily digestible information on this.
Any help would be greatly appreciated.
I suggest a universal hash function. This kind of function guarantees a small number of collisions in expectation, even if the data is chosen by an adversary. There are plenty of universal hash functions.
In case of strings, you can adopt the following hash function.
For a character c we define #(c) the arithmetic value of c ie(ASCII). For a string x=c1c1...cn we define
If HSize is an integer and k a big prime number (you define it), for a range 0<a,b<k*HSizelet the hash function be:
This function provides output between [0, HSize-1]
The output value is calculated by horner's rule, finding the modulo of the k*HSize division after every operation.
So, create a function HashFunction and pass the string you want to hash as a parameter.
Here is the code:
#define k 7919
#define Hsize 1009
#define a 321
#define b 43112
long long HashFunction(string text)
{
int i;
long long res = 0;
long long M = (Hsize * k);
cout<<"M = "<<M<<endl;
cout<<"Hsize = "<<Hsize<<endl;
cout<<"k = "<<k<<endl;
int s=text.size();
for(i = s-1; i >= 0; i--)
{
res = a * (res * 256 + (int)text[i]);
//cout<<"res before modulo = "<<res<<endl;
res=res % M;
//cout<<"res after modulo = "<<res<<endl;
}
long long res1 = (res + b) / k;
return res1;
}

Rabin-Karp Algorithm

I am interested in implementing the Rabin-Karp algorithm to search for sub strings as stated on wiki: http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm. Not for homework, but for self-interest. I have implemented the Rabin-Karp algorithm (shown below) and it works. However, the performance is really, really bad!!! I understand that my hash function is basic. However, it seems that a simple call to strstr() will always outperform my function rabin_karp(). I can understand why - the hash function is doing more work than a simple char-by-char compare each loop. What am I missing here? Should the Rabin-Karp algorithm be faster than a call to strstr()? When is the Rabin-Karp algorithm best used? Hence my self-interest. Have I even implemented the algorithm right?
size_t hash(char* str, size_t i)
{
size_t h = 0;
size_t magic_exp = 1;
// if (str != NULL)
{
while (i-- != 0)
{
magic_exp *= 101;
h += magic_exp + *str;
++str;
}
}
return h;
}
char* rabin_karp(char* s, char* find)
{
char* p = NULL;
if (s != NULL && find != NULL)
{
size_t n = strlen(s);
size_t m = strlen(find);
if (n > m)
{
size_t hfind = hash(find, m);
char* end = s + (n - m + 1);
for (char* i = s; i < end; ++i)
{
size_t hs = hash(i, m);
if (hs == hfind)
{
if (strncmp(i, find, m) == 0)
{
p = i;
break;
}
}
}
}
}
return p;
}
You haven't implemented the hash correctly. The key to Rabin-Karp is to incrementally update the hash as the potential match moves along the string to be searched. As you've determined, if you recalculate the entire hash for each potential match position, things will be really slow.
For every case except for the first comparison, your hash function should take an existing hash, one new character, and one old character, and return an updated hash.
Rabin-Karp is a rolling hash algorithm - the idea is to be able to move the substring one position to either direction(left or right) and be able to recompute the hash with constant number of operations. As you have implemented it the search has complexity O(N * L) where N is the length of the big string and L is the length of the string you are searching for. This is the complexity of the most naive approach and is in fact a little pesimization to it in my opinion.
To improve your algorithm precompute the exponents of magic_exp and use them to 'roll' your hash - basically just as with polynoms you need to subtract the highest degree multiply by magic_exp and add the hash of the symbol to the right(for moving the hash to the right).
Hope this helps.
strstr is using the KMP algorithm which is also linear in nature. This means that the complexity of the two algorithms is approximately the same. From then on the constant is the important factor. Especially in the case where you have bad hash functions with a lot of collisions the KMP will be a lot faster.
EDIT: One more thing. It is very important for the Rabin Karp algorithm to have all the hash codes of the prefixes precalculated. Now you are not implementing proper Rabin Karp, because the calls to your function will be linear, not constant in complexity. (Which by the way means that wikipedia is not very good source to learn Rabin Karp from).

looking for an efficient data structure to do a quick searches

I have a list of elements around 1000. Each element (objects that i read from the file, hence i can arrange them efficiently at the beginning) containing contains 4 variables. So now I am doing the following, which is very inefficient at grand scheme of things:
void func(double value1, double value2, double value3)
{
fooArr[1000];
for(int i=0;i<1000; ++i)
{
//they are all numeric! ranges are < 1000
if(fooArr[i].a== value1
&& fooArr[i].b >= value2;
&& fooArr[i].c <= value2; //yes again value2
&& fooArr[i].d <= value3;
)
{
/* yay found now do something!*/
}
}
}
Space is not too important!
MODIFIED per REQUEST
If space isn't too important the easiest thing to do is to create a hash based on "a" Depending on how many conflicts you get on "a" it may make sense to make each node in the hash table point to a binary tree based off of "b" If b has a lot of conflicts, do the same for c.
That first index into the hash, depending on how many conflicts, will save you a lot of time for very little coding or data structures work.
First, sort the list on increasing a and decreasing b. Then build an index on a (values are integers from 0 to 999. So, we've got
int a_index[1001]; // contains starting subscript for each value
a_index[1000] = 1000;
for (i = a_index[value1]; i < a_index[value1 + 1] && fooArr[i].b >= value2; ++i)
{
if (fooArr[i].c <= value2 && fooArr[i].d <= value3) /* do stuff */
}
Assuming I haven't made a mistake here, this limits the search to the subscripts where a and b are valid, which is likely to cut your search times drastically.
Since you are have only three properties to match you could use a hash table. When performing a search, you use the hash table (which indexes the a-property) to find all entries where a matches SomeConstant. After that you check if b and c also match your constants. This way you can reduce the number of comparisons. I think this would speed the search up quite a bit.
Other than that you could build three binary search trees. One sorted by each property. After searching all three of them you perform your action for those which match your values in each tree.
Based on what you've said (in both the question and the comments) there are only a very few values for a (something like 10).
That being the case, I'd build an index on the values of a where each one points directly to all the elements in the fooArr with that value of a:
std::vector<std::vector<foo *> > index(num_a_values);
for (int i=0; i<1000; i++)
index[fooArr[i].a].push_back(&fooArr[i]);
Then when you get a value to look up an item, you go directly to those for which fooArr[i].a==value1:
std::vector<foo *> const &values = index[value1];
for (int i=0; i<values.size(); i++) {
if (value2 <= values[i]->b
&& value2 >= values[i]->c
&& value3 >= values[i]->d) {
// yay, found something
}
}
This way, instead of looking at 1000 items in fooArray each time, you look at an average of 100 each time. If you want still more speed, the next step would be to sort the items in each vector in the index based on the value of b. This will let you find the lower bound for value2 using a binary search instead of a linear search, reducing ~50 comparisons to ~10. Since you've sorted it by b, from that point onward you don't have to compare value2 to b -- you know exactly where the rest of the numbers that satisfy the inequality are, so you only have to compare to c and d.
You might also consider another approach based on the limited range of the numbers: 0 to 1000 can be represented in 10 bits. Using some bit-twiddling, you could combine three fields into a single 32-bit number, which would let the compiler compare all three at once, instead of in three separate operations. Getting this right is a little tricky, but once you to, it could roughly triple the speed again.
I think using kd-tree would be appropriate.
If there aren't many conflicts with a then hashing/indexing a might resolve your problem.
Anyway if that doesn't work I suggest using kd-tree.
First do a table of multiple kd-trees. Index them with a.
Then implement a kd-tree for each a value with 3-dimensions in directions b, c, d.
Then when searching - first index to appropriate kd-tree with a, and then search from kd-tree with your limits. Basically you'll do a range search.
Kd-tree
You'll get your answer in O(L^(2/3)+m), where L is the number of elements in appropriate kd-tree and m is the number of matching points.
Something better that I found is Range Tree. This might be what you are looking for.
It's fast. It'll answer your query in O(log^3(L)+m). (Unfortunately don't know about Range Tree much.)
Well, let's have a go.
First of all, the == operator calls for a pigeon-hole approach. Since we are talking about int values in the [0,1000] range, a simple table is good.
std::vector<Bucket1> myTable(1001, /*MAGIC_1*/); // suspense
The idea of course is that you will find YourObject instance in the bucket defined for its a attribute value... nothing magic so far.
Now on the new stuff.
&& fooArr[i].b >= value2
&& fooArr[i].c <= value2 //yes again value2
&& fooArr[i].d <= value3
The use of value2 is tricky, but you said you did not care for space right ;) ?
typedef std::vector<Bucket2> Bucket1;
/*MAGIC_1*/ <-- Bucket1(1001, /*MAGIC_2*/) // suspense ?
A BucketA instance will have in its ith position all instances of YourObject for which yourObject.c <= i <= yourObject.b
And now, same approach with the d.
typedef std::vector< std::vector<YourObject*> > Bucket2;
/*MAGIC_2*/ <-- Bucket2(1001)
The idea is that the std::vector<YourObject*> at index ith contains a pointer to all instances of YourObject for which yourObject.d <= i
Putting it altogether!
class Collection:
{
public:
Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue);
// prefer to use unsigned type for unsigned values
void Add(const YourObject& i);
// Pred is a unary operator taking a YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred);
// Pred is a unary operator taking a const YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred) const;
private:
// List behaves nicely with removal,
// if you don't plan to remove, use a vector
// and store the position within the vector
// (NOT an iterator because of reallocations)
typedef std::list<YourObject> value_list;
typedef std::vector<value_list::iterator> iterator_vector;
typedef std::vector<iterator_vector> bc_buckets;
typedef std::vector<bc_buckets> a_buckets;
typedef std::vector<a_buckets> buckets_t;
value_list m_values;
buckets_t m_buckets;
}; // class Collection
Collection::Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue) :
m_values(),
m_buckets(aMaxValue+1,
a_buckets(bMaxValue+1, bc_buckets(dMaxValue+1))
)
)
{
}
void Collection::Add(const YourObject& object)
{
value_list::iterator iter = m_values.insert(m_values.end(), object);
a_buckets& a_bucket = m_buckets[object.a];
for (int i = object.c; i <= object.b; ++i)
{
bc_buckets& bc_bucket = a_bucket[i];
for (int j = 0; j <= object.d; ++j)
{
bc_bucket[j].push_back(index);
}
}
} // Collection::Add
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred)
{
index_vector const& indexes = m_buckets[value1][value2][value3];
BOOST_FOREACH(value_list::iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred) const
{
index_vector const& indexes = m_buckets[value1][value2][value3];
// Promotion from value_list::iterator to value_list::const_iterator is ok
// The reverse is not, which is why we keep iterators
BOOST_FOREACH(value_list::const_iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
So, admitedly adding and removing items to that collections will cost.
Furthermore, you have (aMaxValue + 1) * (bMaxValue + 1) * (dMaxValue + 1) std::vector<value_list::iterator> stored, which is a lot.
However, Collection::Apply complexity is roughly k applications of Pred where k is the number of items which match the parameters.
I am looking for a review there, not sure I got all the indexes right oO
If your app is already using a database then just put them in a table and use a query to find it. I use mysql in a few of my apps and would recommend it.
First for each a do different table...
do a tabel num for numbers that have the same a.
do 2 index tabels each with 1000 rows.
index table contains integer representation of a split which numbers
will be involved.
For example let's say you have values in the array
(ignoring a because we have a table for each a value)
b = 96 46 47 27 40 82 9 67 1 15
c = 76 23 91 18 24 20 15 43 17 10
d = 44 30 61 33 21 52 36 70 98 16
then the index table values for the row 50, 20 are:
idx[a].bc[50] = 0000010100
idx[a].d[50] = 1101101001
idx[a].bc[20] = 0001010000
idx[a].d[20] = 0000000001
so let's say you do func(a, 20, 50).
Then to get which numbers are involved you do:
g = idx[a].bc[20] & idx[a].d[50];
Then g has 1-s for each number you have to deal with. If you don't
need the array values then you can just do a populationCount on g. And
do the inner thing popCount(g) times.
You can do
tg = g
n = 0
while (tg > 0){
if(tg & 1){
// do your stuff
}
tg = tg >>> 1;
n++;
}
maybe it can be improved in tg = tg >>> 1; n++; part by skipping over many zeros, but I have no idea if that's possible. It should considerably faster than your current approach because all variables for the loop are in registers.
As pmg said, the idea is to eliminate as many comparisons as possible. Obviously you won't have 4000 comparisons. That would require that all 1000 elements pass the first test, which would then be redundant. Apparently there are only 10 values of a, hence 10% passes that check. So, you'd do 1000 + 100 + ? + ? checks. Let's assume +50+25, for a total of 1175.
You'd need to know how a,b,c,d and value1, 2 and 3 are distributed to decide exactly what's fastest. We only know that a can have 10 values, and we presume that value1 has the same domain. In that case, binning by a can reduce it to an O(1) operation to get the right bin, plus the same 175 checks further on. But if b,c and value2 effectively form 50 buckets, you could find the right bucket again in O(1). Yet each bucket would now have an average of 20 elements, so you'd only need 35 tests (80% reduction). So, data distribution matters here. Once you understand your data, the algorithm will be clear.
Look, this is just a linear search. It would be nice if you could do a search that scales up better, but your complex matching requirements make it unclear to me whether it's even possible to, say, keep it sorted and use a binary search.
Having said this, perhaps one possibility is to generate some indexes. The main index might be a dictionary keyed on the a property, associating it with a list of elements with the same value for that property. Assuming the values for this property are well-distributed, it would immediately eliminate the overwhelming majority of comparisons.
If the property has a limited number of values, then you could consider adding an additional index which sorts items by b and maybe even another that sorts by c (but in the opposite order).
You can use hash_set from Standard Template Library(STL), this will give you very efficient implementation. complexity of your search would be O(1)
here is link: http://www.sgi.com/tech/stl/hash_set.html
--EDIT--
declare new Struct which will hold your variables, overload comparison operators and make the hash_set of this new struct. every time you want to search, create new object with your variables and pass it to hash_set method "find".
It seems that hash_set is not mandatory for STL, therefore you can use set and it will give you O(LogN) complexity for searching.
here is example:
#include <cstdlib>
#include <iostream>
#include <set>
using namespace std;
struct Obj{
public:
Obj(double a, double b, double c, double d){
this->a = a;
this->b = b;
this->c = c;
this->d = d;
}
double a;
double b;
double c;
double d;
friend bool operator < ( const Obj &l, const Obj &r ) {
if(l.a != r.a) return l.a < r.a;
if(l.b != r.b) return l.a < r.b;
if(l.c != r.c) return l.c < r.c;
if(l.d != r.d) return l.d < r.d;
return false;
}
};
int main(int argc, char *argv[])
{
set<Obj> A;
A.insert( Obj(1,2,3,4));
A.insert( Obj(16,23,36,47));
A.insert(Obj(15,25,35,43));
Obj c(1,2,3,4);
A.find(c);
cout << A.count(c);
system("PAUSE");
return EXIT_SUCCESS;
}