algorithm to identify string matches with hashing and no false positives - c++

I need to find if a given string matches strings in a list without having the strings in the list; basically I need to hash the strings and match only against the list of hashes. The problem is being sure that there are no false positives so that only exact matches will be found and any other set of characters is not found. This is of course easy with an actual list of strings, even a simple binary search will work, but I want an algorithm that works without the actual characters present (i.e. precalculated). A bloom filter doesn't have a guarantee that some arbitrary set of characters might not be matched.
Update: this is similar to storing only password hashes and then hashing an entered password which is then compared to the hash list to see if the password is one of them (admittedly not the usual use of a password). The reason for this requirement is to no have to ship the actual text, just the hashes.
Update 2: Is there another way to do this without a perfect hash function? I have hundreds of thousands of entries, finding a perfect hash is hard. Maybe something like a bloom filter but with a better guarantee?

A good cryptographic hash function (with sufficient bits) will make the probability of a false match extremely small; sufficiently small that brute force attacks are essentially impossible. Most security systems feel that such mechanisms are adequate.
If you want an absolute guarantee that no false positive is possible, then you'll actually need to include enough data to validate the input; that cannot be any shorter than the target strings (but it doesn't have to be any larger). In effect, you need to encrypt the target strings. Since the encryption key and the encrypted strings will both be visible, in order to avoid someone simply decrypting the encrypted strings you need to use an asymmetric cipher. Those are computationally expensive, but that might not be a problem for your environment.

Any perfect hash will do. Follow it up with a string compare to verify it is not a false positive.

Here's a "almost perfect" hashing algorithm:
MPQ Hash
It is used in the StarCraft save files. This algorithm is very efficient, and has a very low collision possibility (about 1:18889465931478580854784 on average).
Here is how this algorithm works.
1.Compute the three hashes (offset hash and two check hashes) and store them in variables.
2.Move to the entry of the offset hash
3.Is the entry unused? If so, stop the search and return 'file not found'.
4.Do the two check hashes match the check hashes of the file we're looking for? If so, stop the search and return the current entry.
5.Move to the next entry in the list, wrapping around to the beginning if we were on the last entry.
6.Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.
7.Go back to step 3.
And here is the hashing and hash table function:
unsigned long HashString(char *lpszFileName, unsigned long dwHashType)
{
//lpszFileName is the string to be hashed.
//dwHashType will change the hash value according to hash mode.
//You can see how it's used in the beginning of GetHashTablePos().
unsigned char *key = (unsigned char *)lpszFileName;
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{
ch = toupper(*key++); //Convert every character to upper case.
//dwHashType will change the hash value in different hashing modes.
//(Whether to calculate the position or to verify.)
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
}
return seed1;
}
int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize)
{
const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
//nHash controls where the hash value of the string should be stored in the hash table.
//nHashA and nHashB are used for verifying the match
int nHash = HashString(lpszString, HASH_OFFSET),
nHashA = HashString(lpszString, HASH_A),
nHashB = HashString(lpszString, HASH_B),
nHashStart = nHash % nTableSize, nHashPos = nHashStart;
while (lpTable[nHashPos].bExists)
{
if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
return nHashPos; //If found, return the entry index
else
nHashPos = (nHashPos + 1) % nTableSize; //Not found, move to next position
if (nHashPos == nHashStart)
break; //Reach the beginning of table, stop searching.
}
return -1; //Error value
}

You might consider putting together a bloom filter. That is basically a set of hash algorithms and hash tables which when taken as a group can give a very high probabilty of a correct match. It's not 100%, but you can get as close to it as you like.

Related

Caesar Cipher w/Frequency Analysis how to proceed next?

I understand this has been asked before and I somewhat have a grasp on how to compare frequency tables between cipher and English(this is the language I'm assuming its in for my program) but I'm unsure about how to get this into code.
void frequencyUpdate(std::vector< std::vector< std::string> > &file, std::vector<int> &freqArg) {
for (int itr_1 = 0; itr_1 < file.size(); ++itr_1) {
for (int itr_2 = 0; itr_2 < file.at(itr_1).size(); ++itr_2) {
for (int itr_3 = 0; itr_3 < file.at(itr_1).at(itr_2).length(); ++itr_3) {
file.at(itr_1).at(itr_2).at(itr_3) = toupper(file.at(itr_1).at(itr_2).at(itr_3));
if (!((int)file.at(itr_1).at(itr_2).at(itr_3) < 65 || (int)file.at(itr_1).at(itr_2).at(itr_3) > 90)) {
int temp = (int)file.at(itr_1).at(itr_2).at(itr_3) - 65;
freqArg.at(temp) += 1;
}
}
}
}
}
this is how I get the frequency of a given file that has its contents split into lines and then into words, hence the double vector of strings and using ASCII values of the chars - 65 for indices. The resulting vector of ints that hold frequency is saved.
Now is where I don't knot how to proceed. Should I hardcode in a const std:: vector <int> for the English frequency of letters and then somehow to comparison? How would I compare efficiently rather than simply compare each vector to each other for is possible not an efficient method?
This comparison is for getting an appropriate shift value for caesar cipher shifting to decrypt a text. I don't wanna use brute force and shift one at a time until the text is readable. Any advice on how to approach this? Thanks.
Take your frequency vector and the frequency vector for "typical" English text, and find the cross-correlation.
The highest values of the cross-correlation correspond to the most likely shift values. At that point you'll need to use each one to decrypt, and see whether the output is sensible (i.e. forms real words and coherent sentences).
In English, 'e' has the highest frequency. So whatever most frequent letter you got from your ciphertext, it most likely maps to 'e'.
Since e --> X then the key should be difference between 'e' and your most frequent letter X.
If this is not the right key (due to too short ciphertext distorting the statistics), try to match your most frequent ciphertext letter with the second one in English i.e. a.
I would suggest a graph traversal algorithm. Your starting node has no substitutions assigned and has 26 connected nodes, one for each possible letter substitution for the most frequently occurring ciphertext letter. The next node has another 25 connected nodes for the possible letters for the second most frequent ciphertext letter (one less, since you've already used one possible letter). Which destination node you choose should be based on which letters are most likely given a normal frequency distribution for the target language.
At each node, you can test for success by doing your substitutions into the ciphertext, and finding all the resulting words that now match entries in a dictionary file. The more matches you've found, the more likely you've got the correct substitution key.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Topic mining algorithm in c/c++

I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.

simple hash map with vectors in C++

I'm in my first semester of studies and as a part of my comp. science assignment I have to implement a simple hash map using vectors, but I have some problems understanding the concept.
First of all I have to implement a hash function. To avoid collisions I thought it would be better to use double hashing, as follows:
do {
h = (k % m + j*(1+(k % (m-2)));
j++;
} while ( j % m != 0 );
where h is the hash to be returned, k is the key and m is the size of hash_map (and a prime number; they are all of type int).
This was easy, but then I need to be able to insert or remove a pair of key and the corresponding value in the map.
The signature of the two functions should be bool, so I have to return either true or flase, and I'm guessing that I should return true when there is no element at position h in the vector. (But I have no idea why remove should be bool as well).
My problem is what to do when the insert function returns false (i.e. when there is already a key-value pair saved on position h - I implemented this as a function named find). I could obviously move it to the next free place by simply increasing j, but then the hash calculated by my hash function wouldn't tell us anymore at which place a certain key is saved, causing wrong behaviour of remove function.
Is there any good example online, that doesn't use the pre defined STD methods? (My Google behaves wierdly in the past few days and only reutrns me unuseful hits in the local language)
I've been told to move my comment to an answer so here it is. I am presuming your get method takes the value you are looking for an argument.
so what we are going to do is a process called linear probing.
when we insert the value we hash it as normal lets say our hash value is 4
[x,x,x,,,x,x]
as we can see we can simply insert it in:
[x,x,x,x,,x,x]
however if 4 is taken when we insert it we will simply move to the next slot that is empty
[x,x,x,**x**,x,,x,x]
In linear probing if we reach the end we loop back round to the beginning until we find a slot. You shouldn't run out of space as you are using a vector which can allocate extra space when it starts getting near full capacity
this will cause problems when you are searching because the value at 4 may not be at 4 anymore (in this case its at 5). To solve this we do a little bit of a hack. Note that we still get O(1) run time complexity for inserting and retrieval as long as the load balance is below 1.
in our get method instead of returning the value in the array at 4 we are instead going to start looking for our value at 4 if its there we can return it. If not we look at the value at 5 and so on till we find the value.
in psudo code the new stuff looks like this
bool insert(value){
h = hash(value);
while(node[h] != null){
h++;
if( h = node.length){
h = 0;
}
}
node[h] = value;
return true;
}
get
get(value){
h = hash(value);
roundTrip = 0; //used to see if we keep going round the hashmap
while(true){
if(node[h] == value)
return node[h];
h++;
if( h = node.length){
h = 0;
roundTrip++;
}
if(roundTrip > 1){ //we can't find it after going round list once
return -1;
}
}
}

Given an array of integers, find the first integer that is unique

Given an array of integers, find the first integer that is unique.
my solution: use std::map
put integer (number as key, its index as value) to it one by one (O(n^2 lgn)), if have duplicate, remove the entry from the map (O(lg n)), after putting all numbers into the map, iterate the map and find the key with smallest index O(n).
O(n^2 lgn) because map needs to do sorting.
It is not efficient.
other better solutions?
I believe that the following would be the optimal solution, at least based on time / space complexity:
Step 1:
Store the integers in a hash map, which holds the integer as a key and the count of the number of times it appears as the value. This is generally an O(n) operation and the insertion / updating of elements in the hash table should be constant time, on the average. If an integer is found to appear more than twice, you really don't have to increment the usage count further (if you don't want to).
Step 2:
Perform a second pass over the integers. Look each up in the hash map and the first one with an appearance count of one is the one you were looking for (i.e., the first single appearing integer). This is also O(n), making the entire process O(n).
Some possible optimizations for special cases:
Optimization A: It may be possible to use a simple array instead of a hash table. This guarantees O(1) even in the worst case for counting the number of occurrences of a particular integer as well as the lookup of its appearance count. Also, this enhances real time performance, since the hash algorithm does not need to be executed. There may be a hit due to potentially poorer locality of reference (i.e., a larger sparse table vs. the hash table implementation with a reasonable load factor). However, this would be for very special cases of integer orderings and may be mitigated by the hash table's hash function producing pseudorandom bucket placements based on the incoming integers (i.e., poor locality of reference to begin with).
Each byte in the array would represent the count (up to 255) for the integer represented by the index of that byte. This would only be possible if the difference between the lowest integer and the highest (i.e., the cardinality of the domain of valid integers) was small enough such that this array would fit into memory. The index in the array of a particular integer would be its value minus the smallest integer present in the data set.
For example on modern hardware with a 64-bit OS, it is quite conceivable that a 4GB array can be allocated which can handle the entire domain of 32-bit integers. Even larger arrays are conceivable with sufficient memory.
The smallest and largest integers would have to be known before processing, or another linear pass through the data using the minmax algorithm to find out this information would be required.
Optimization B: You could optimize Optimization A further, by using at most 2 bits per integer (One bit indicates presence and the other indicates multiplicity). This would allow for the representation of four integers per byte, extending the array implementation to handle a larger domain of integers for a given amount of available memory. More bit games could be played here to compress the representation further, but they would only support special cases of data coming in and therefore cannot be recommended for the still mostly general case.
All this for no reason. Just using 2 for-loops & a variable would give you a simple O(n^2) algo.
If you are taking all the trouble of using a hash map, then it might as well be what #Micheal Goldshteyn suggests
UPDATE: I know this question is 1 year old. But was looking through the questions I answered and came across this. Thought there is a better solution than using a hashtable.
When we say unique, we will have a pattern. Eg: [5, 5, 66, 66, 7, 1, 1, 77]. In this lets have moving window of 3. first consider (5,5,66). we can easily estab. that there is duplicate here. So move the window by 1 element so we get (5,66,66). Same here. move to next (66,66,7). Again dups here. next (66,7,1). No dups here! take the middle element as this has to be the first unique in the set. The left element belongs to the dup so could 1. Hence 7 is the first unique element.
space: O(1)
time: O(n) * O(m^2) = O(n) * 9 ≈ O(n)
Inserting to a map is O(log n) not O(n log n) so inserting n keys will be n log n. also its better to use set.
Although it's O(n^2), the following has small coefficients, isn't too bad on the cache, and uses memmem() which is fast.
for(int x=0;x<len-1;x++)
if(memmem(&array[x+1], sizeof(int)*(len-(x+1)), array[x], sizeof(int))==NULL &&
memmem(&array[x+1], sizeof(int)*(x-1), array[x], sizeof(int))==NULL)
return array[x];
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if (input[i]==input[j])
{
dupIndex[j] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}
#user3612419
Solution given you is good with some what close to O(N*N2) but further optimization in same code is possible I just added two-3 lines that you missed.
public static string firstUnique(int[] input)
{
int size = input.Length;
bool[] dupIndex = new bool[size];
for (int i = 0; i < size; ++i)
{
if (dupIndex[i])
{
continue;
}
else if (i == size - 1)
{
return input[i].ToString();
}
for (int j = i + 1; j < size; ++j)
{
if(dupIndex[j]==true)
{
continue;
}
if (input[i]==input[j])
{
dupIndex[j] = true;
dupIndex[i] = true;
break;
}
else if (j == size - 1)
{
return input[i].ToString();
}
}
}
return "No unique element";
}