Rabin-Karp Algorithm - c++

I am interested in implementing the Rabin-Karp algorithm to search for sub strings as stated on wiki: http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm. Not for homework, but for self-interest. I have implemented the Rabin-Karp algorithm (shown below) and it works. However, the performance is really, really bad!!! I understand that my hash function is basic. However, it seems that a simple call to strstr() will always outperform my function rabin_karp(). I can understand why - the hash function is doing more work than a simple char-by-char compare each loop. What am I missing here? Should the Rabin-Karp algorithm be faster than a call to strstr()? When is the Rabin-Karp algorithm best used? Hence my self-interest. Have I even implemented the algorithm right?
size_t hash(char* str, size_t i)
{
size_t h = 0;
size_t magic_exp = 1;
// if (str != NULL)
{
while (i-- != 0)
{
magic_exp *= 101;
h += magic_exp + *str;
++str;
}
}
return h;
}
char* rabin_karp(char* s, char* find)
{
char* p = NULL;
if (s != NULL && find != NULL)
{
size_t n = strlen(s);
size_t m = strlen(find);
if (n > m)
{
size_t hfind = hash(find, m);
char* end = s + (n - m + 1);
for (char* i = s; i < end; ++i)
{
size_t hs = hash(i, m);
if (hs == hfind)
{
if (strncmp(i, find, m) == 0)
{
p = i;
break;
}
}
}
}
}
return p;
}

You haven't implemented the hash correctly. The key to Rabin-Karp is to incrementally update the hash as the potential match moves along the string to be searched. As you've determined, if you recalculate the entire hash for each potential match position, things will be really slow.
For every case except for the first comparison, your hash function should take an existing hash, one new character, and one old character, and return an updated hash.

Rabin-Karp is a rolling hash algorithm - the idea is to be able to move the substring one position to either direction(left or right) and be able to recompute the hash with constant number of operations. As you have implemented it the search has complexity O(N * L) where N is the length of the big string and L is the length of the string you are searching for. This is the complexity of the most naive approach and is in fact a little pesimization to it in my opinion.
To improve your algorithm precompute the exponents of magic_exp and use them to 'roll' your hash - basically just as with polynoms you need to subtract the highest degree multiply by magic_exp and add the hash of the symbol to the right(for moving the hash to the right).
Hope this helps.

strstr is using the KMP algorithm which is also linear in nature. This means that the complexity of the two algorithms is approximately the same. From then on the constant is the important factor. Especially in the case where you have bad hash functions with a lot of collisions the KMP will be a lot faster.
EDIT: One more thing. It is very important for the Rabin Karp algorithm to have all the hash codes of the prefixes precalculated. Now you are not implementing proper Rabin Karp, because the calls to your function will be linear, not constant in complexity. (Which by the way means that wikipedia is not very good source to learn Rabin Karp from).

Related

understanding algorithmic complexity

I'm looking at some online algorithm solutions for coding interviews, and I don't understand why this algorithm is claimed to be O(n^3).
Caveat: I understand that big-Oh notation is abused in industry, and when I refer to O(n), I'm using that notation to mean the upper bound of an algorithms runtime as is common outside of academia in most places.
Finding the longest palindromic substring. A simple solution might be:
bool isPalindrome(std::string s) {
if (s.length() <= 1) {
return true;
}
if (s[0] == s[s.length() - 1]) {
return isPalindrome(s.substr(1, s.length() - 2));
} else {
return false;
}
}
std::string longestPalindrome(std::string s) {
std::string max_pal = "";
for (size_t i = 0; i < s.length(); ++i) {
for (size_t len = 1; len <= s.length() - i; ++len) {
std::string sub = s.substr(i,len);
if (isPalindrome(sub)) {
if (max_pal.size() < sub.size()) max_pal = sub;
}
}
}
return max_pal;
}
Isn't this algorithm O(n^2)? Very simply, it takes O(n^2) time to generate all substrings, and O(n) time to determine if it's a palindrome. Where n is the number of characters in the initial string.
Isn't this algorithm O(n^2)? Very simply, it takes O(n^2) time to
generate all substrings, and O(n) time to determine if it's a
palindrome.
What you are describing is exactly O(n^3), because for each substring, you are doing an operation which costs O(n), so total number of operations is O(n^2 * C*n), which is O(n^3)
However, the code described is actually O(n^4), isPalindrome() is O(n^2):
You are creating O(n) substrings, of sizes: 1 + 3 + 5 + ... + n-2, which is O(n^2) total time.
Doing this O(n^2) times in longestPalindrome() totals to O(n^4).
(This assumes O(n) substr() complexity. It's not defined - but it's usually the case)
You are almost right,
it takes O(n^2) and O(n) operations to generate the strings and check them.
Thus, you need O(n^2) (amount of strings) times O(n) checks.
Since n^2 * n = n^3, the total run time is in O(n^3).
O(n^2) (substring turns out to be O(n) itself) is executed inside double loop (O(n^2)). That gives us O(n^4).
Actually this'd be even O(N^4), because of the barbarity of the implementation.
isPalindrome is implemented in such a way, that for every recursive invocation it allocates a new string, which is essentially the source string with first and last chars removed. So every such a call is already O(n).

Is String a Permutation of list of strings

A list of words is given and a bigger string given how can we find whether the string is a permutation of the smaller strings.
eg- s= badactorgoodacting dict[]={'actor','bad','act','good'] FALSE
eg- s= badactorgoodacting dict[]={'actor','bad','acting','good'] TRUE
The smaller words themselves don't need to be permuted. The question is whether we can find a ordering of the smaller strings such that if concatenated in that order it gives the larger string
One more constraint - some words from dict[] may also be left over unused
The following gives a complexity O(n2). Any other ways to do this.. so as to improve complexity or increase efficiency in general? Mostly in Java. Thanks in Advance!
bool strPermuteFrom(unordered_set<string> &dict, string &str, int pos)
{
if(pos >= str.size())
return true;
if(dict.size() == 0)
return false;
for(int j = pos; j < str.size(); j++){
string w = str.substr(pos, j - pos + 1);
if(dict.find(w) != dict.end()){
dict.erase(w);
int status = strPermuteFrom(dict, str, j+1);
dict.insert(w);
if(status){
if(pos > 0) str.insert(pos, " ");
return status;
}
}
}
return false;
}
bool strPermute(unordered_set<string> &dict, string &str)
{
return strPermuteFrom(dict, str, 0);
}
The code sample you give doesn't take much advantage of unordered_set (equivalent to Java HashSet's properties); each lookup is O(1), but it has to perform many lookups (for each possible prefix, for the entire length of the string). A std::set (Java TreeSet), being ordered, would allow you to find all possible prefixes at a given point in a single O(log n) lookup (followed by a scan from that point until you were no longer dealing with possible prefixes), rather than stringlength O(1) lookups at each recursive step.
So where this code is doing O(stringlength * dictsize^2) work, using a sorted set structure should reduce it to O(dictsize log dictsize) work. The string length doesn't matter as much, because you no longer lookup prefixes; you look up the remaining string once at each recursive step, and because it's ordered, a matching prefix will sort just before the whole string, no need to check individual substrings. Technically, backtracking would still be necessary (to handle a case where a word in the dict was a prefix of another word in the dict, e.g. act and acting), but aside from that case, you'd never need to backtrack; you'd only ever have a single hit for each recursive step, so you'd just be performing dictsize lookups, each costing log dictsize.

Can you do Top-K frequent Element better than O(nlogn) ? (code attached) [duplicate]

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.
My thinking is like this.
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.
sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.
After sorting, we just take the first K words. This takes O(K) time.
To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).
We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be
2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;
3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).
To summarize, this solution cost time O(n+k*lg(n)).
This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.
This can be done in O(n) time
Solution 1:
Steps:
Count words and hash it, which will end up in the structure like this
var hash = {
"I" : 13,
"like" : 3,
"meow" : 3,
"geek" : 3,
"burger" : 2,
"cat" : 1,
"foo" : 100,
...
...
Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size
Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:
0 1 2 3 100
[[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
Then just traverse the array from the end, and collect the k words.
Solution 2:
Steps:
Same as above
Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
After traversing through the array, we just convert the min heap to array and return the array.
You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.
If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.
The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).
After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).
So this strategy is:
Go through each word and insert it into a hash table: O(n)
Select the Kth smallest element: O(n)
Partition around that element: O(n)
If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).
The O(n) time bound is optimal within a constant factor because we must examine each word at least once.
The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.
Edit:
By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.
This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.
This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].
If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
Your problem is same as this-
http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
Use Trie and min heap to efficieinty solve it.
If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant.
Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.
As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n).
Practically, the long step is counting.
Utilize memory efficient data structure to store the words
Use MaxHeap, to find the top K frequent words.
Here is the code
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;
import com.nadeem.app.dsa.adt.Trie;
import com.nadeem.app.dsa.adt.Trie.TrieEntry;
import com.nadeem.app.dsa.adt.impl.TrieImpl;
public class TopKFrequentItems {
private int maxSize;
private Trie trie = new TrieImpl();
private PriorityQueue<TrieEntry> maxHeap;
public TopKFrequentItems(int k) {
this.maxSize = k;
this.maxHeap = new PriorityQueue<TrieEntry>(k, maxHeapComparator());
}
private Comparator<TrieEntry> maxHeapComparator() {
return new Comparator<TrieEntry>() {
#Override
public int compare(TrieEntry o1, TrieEntry o2) {
return o1.frequency - o2.frequency;
}
};
}
public void add(String word) {
this.trie.insert(word);
}
public List<TopK> getItems() {
for (TrieEntry trieEntry : this.trie.getAll()) {
if (this.maxHeap.size() < this.maxSize) {
this.maxHeap.add(trieEntry);
} else if (this.maxHeap.peek().frequency < trieEntry.frequency) {
this.maxHeap.remove();
this.maxHeap.add(trieEntry);
}
}
List<TopK> result = new ArrayList<TopK>();
for (TrieEntry entry : this.maxHeap) {
result.add(new TopK(entry));
}
return result;
}
public static class TopK {
public String item;
public int frequency;
public TopK(String item, int frequency) {
this.item = item;
this.frequency = frequency;
}
public TopK(TrieEntry entry) {
this(entry.word, entry.frequency);
}
#Override
public String toString() {
return String.format("TopK [item=%s, frequency=%s]", item, frequency);
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + frequency;
result = prime * result + ((item == null) ? 0 : item.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
TopK other = (TopK) obj;
if (frequency != other.frequency)
return false;
if (item == null) {
if (other.item != null)
return false;
} else if (!item.equals(other.item))
return false;
return true;
}
}
}
Here is the unit tests
#Test
public void test() {
TopKFrequentItems stream = new TopKFrequentItems(2);
stream.add("hell");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("hero");
stream.add("hero");
stream.add("hero");
stream.add("hello");
stream.add("hello");
stream.add("hello");
stream.add("home");
stream.add("go");
stream.add("go");
assertThat(stream.getItems()).hasSize(2).contains(new TopK("hero", 3), new TopK("hello", 8));
}
For more details refer this test case
use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.This is same as every one explained above
While insertion itself in hashmap , keep the Treeset(specific to java, there are implementations in every language) of size 10(k=10) to keep the top 10 frequent words. Till size is less than 10, keep adding it. If size equal to 10, if inserted element is greater than minimum element i.e. first element. If yes remove it and insert new element
To restrict the size of treeset see this link
Suppose we have a word sequence "ad" "ad" "boy" "big" "bad" "com" "come" "cold". And K=2.
as you mentioned "partitioning using the first letter of words", we got
("ad", "ad") ("boy", "big", "bad") ("com" "come" "cold")
"then partitioning the largest multi-word set using the next character until you have k single-word sets."
it will partition ("boy", "big", "bad") ("com" "come" "cold"), the first partition ("ad", "ad") is missed, while "ad" is actually the most frequent word.
Perhaps I misunderstand your point. Can you please detail your process about partition?
I believe this problem can be solved by an O(n) algorithm. We could make the sorting on the fly. In other words, the sorting in that case is a sub-problem of the traditional sorting problem since only one counter gets incremented by one every time we access the hash table. Initially, the list is sorted since all counters are zero. As we keep incrementing counters in the hash table, we bookkeep another array of hash values ordered by frequency as follows. Every time we increment a counter, we check its index in the ranked array and check if its count exceeds its predecessor in the list. If so, we swap these two elements. As such we obtain a solution that is at most O(n) where n is the number of words in the original text.
I was struggling with this as well and get inspired by #aly. Instead of sorting afterwards, we can just maintain a presorted list of words (List<Set<String>>) and the word will be in the set at position X where X is the current count of the word. In generally, here's how it works:
for each word, store it as part of map of it's occurrence: Map<String, Integer>.
then, based on the count, remove it from the previous count set, and add it into the new count set.
The drawback of this is the list maybe big - can be optimized by using a TreeMap<Integer, Set<String>> - but this will add some overhead. Ultimately we can use a mix of HashMap or our own data structure.
The code
public class WordFrequencyCounter {
private static final int WORD_SEPARATOR_MAX = 32; // UNICODE 0000-001F: control chars
Map<String, MutableCounter> counters = new HashMap<String, MutableCounter>();
List<Set<String>> reverseCounters = new ArrayList<Set<String>>();
private static class MutableCounter {
int i = 1;
}
public List<String> countMostFrequentWords(String text, int max) {
int lastPosition = 0;
int length = text.length();
for (int i = 0; i < length; i++) {
char c = text.charAt(i);
if (c <= WORD_SEPARATOR_MAX) {
if (i != lastPosition) {
String word = text.substring(lastPosition, i);
MutableCounter counter = counters.get(word);
if (counter == null) {
counter = new MutableCounter();
counters.put(word, counter);
} else {
Set<String> strings = reverseCounters.get(counter.i);
strings.remove(word);
counter.i ++;
}
addToReverseLookup(counter.i, word);
}
lastPosition = i + 1;
}
}
List<String> ret = new ArrayList<String>();
int count = 0;
for (int i = reverseCounters.size() - 1; i >= 0; i--) {
Set<String> strings = reverseCounters.get(i);
for (String s : strings) {
ret.add(s);
System.out.print(s + ":" + i);
count++;
if (count == max) break;
}
if (count == max) break;
}
return ret;
}
private void addToReverseLookup(int count, String word) {
while (count >= reverseCounters.size()) {
reverseCounters.add(new HashSet<String>());
}
Set<String> strings = reverseCounters.get(count);
strings.add(word);
}
}
I just find out the other solution for this problem. But I am not sure it is right.
Solution:
Use a Hash table to record all words' frequency T(n) = O(n)
Choose first k elements of hash table, and restore them in one buffer (whose space = k). T(n) = O(k)
Each time, firstly we need find the current min element of the buffer, and just compare the min element of the buffer with the (n - k) elements of hash table one by one. If the element of hash table is greater than this min element of buffer, then drop the current buffer's min, and add the element of the hash table. So each time we find the min one in the buffer need T(n) = O(k), and traverse the whole hash table need T(n) = O(n - k). So the whole time complexity for this process is T(n) = O((n-k) * k).
After traverse the whole hash table, the result is in this buffer.
The whole time complexity: T(n) = O(n) + O(k) + O(kn - k^2) = O(kn + n - k^2 + k). Since, k is really smaller than n in general. So for this solution, the time complexity is T(n) = O(kn). That is linear time, when k is really small. Is it right? I am really not sure.
Try to think of special data structure to approach this kind of problems. In this case special kind of tree like trie to store strings in specific way, very efficient. Or second way to build your own solution like counting words. I guess this TB of data would be in English then we do have around 600,000 words in general so it'll be possible to store only those words and counting which strings would be repeated + this solution will need regex to eliminate some special characters. First solution will be faster, I'm pretty sure.
http://en.wikipedia.org/wiki/Trie
This is an interesting idea to search and I could find this paper related to Top-K https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
Also there is an implementation of it here.
Simplest code to get the occurrence of most frequently used word.
function strOccurence(str){
var arr = str.split(" ");
var length = arr.length,temp = {},max;
while(length--){
if(temp[arr[length]] == undefined && arr[length].trim().length > 0)
{
temp[arr[length]] = 1;
}
else if(arr[length].trim().length > 0)
{
temp[arr[length]] = temp[arr[length]] + 1;
}
}
console.log(temp);
var max = [];
for(i in temp)
{
max[temp[i]] = i;
}
console.log(max[max.length])
//if you want second highest
console.log(max[max.length - 2])
}
In these situations, I recommend to use Java built-in features. Since, they are already well tested and stable. In this problem, I find the repetitions of the words by using HashMap data structure. Then, I push the results to an array of objects. I sort the object by Arrays.sort() and print the top k words and their repetitions.
import java.io.*;
import java.lang.reflect.Array;
import java.util.*;
public class TopKWordsTextFile {
static class SortObject implements Comparable<SortObject>{
private String key;
private int value;
public SortObject(String key, int value) {
super();
this.key = key;
this.value = value;
}
#Override
public int compareTo(SortObject o) {
//descending order
return o.value - this.value;
}
}
public static void main(String[] args) {
HashMap<String,Integer> hm = new HashMap<>();
int k = 1;
try {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("words.in")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
//System.out.println(line);
String[] tokens = line.split(" ");
for(int i=0; i<tokens.length; i++){
if(hm.containsKey(tokens[i])){
//If the key already exists
Integer prev = hm.get(tokens[i]);
hm.put(tokens[i],prev+1);
}else{
//If the key doesn't exist
hm.put(tokens[i],1);
}
}
}
//Close the input
br.close();
//Print all words with their repetitions. You can use 3 for printing top 3 words.
k = hm.size();
// Get a set of the entries
Set set = hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
int index = 0;
// Display elements
SortObject[] objects = new SortObject[hm.size()];
while(i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
//System.out.print("Key: "+e.getKey() + ": ");
//System.out.println(" Value: "+e.getValue());
String tempS = (String) e.getKey();
int tempI = (int) e.getValue();
objects[index] = new SortObject(tempS,tempI);
index++;
}
System.out.println();
//Sort the array
Arrays.sort(objects);
//Print top k
for(int j=0; j<k; j++){
System.out.println(objects[j].key+":"+objects[j].value);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
For more information, please visit https://github.com/m-vahidalizadeh/foundations/blob/master/src/algorithms/TopKWordsTextFile.java. I hope it helps.
**
C++11 Implementation of the above thought
**
class Solution {
public:
vector<int> topKFrequent(vector<int>& nums, int k) {
unordered_map<int,int> map;
for(int num : nums){
map[num]++;
}
vector<int> res;
// we use the priority queue, like the max-heap , we will keep (size-k) smallest elements in the queue
// pair<first, second>: first is frequency, second is number
priority_queue<pair<int,int>> pq;
for(auto it = map.begin(); it != map.end(); it++){
pq.push(make_pair(it->second, it->first));
// onece the size bigger than size-k, we will pop the value, which is the top k frequent element value
if(pq.size() > (int)map.size() - k){
res.push_back(pq.top().second);
pq.pop();
}
}
return res;
}
};

Implementation of suffix array in c++

#include<iostream>
#include<string.h>
#include<utility>
#include<algorithm>
using namespace std;
struct xx
{
string x;
short int d;
int lcp;
};
bool compare(const xx a,const xx b)
{
return a.x<b.x;
}
int findlcp(string a,string b)
{
int i=0,j=0,k=0;
while(i<a.length() && j<b.length())
{
if(a[i]==b[j])
{
k++;
i++;
j++;
}
else
{
break;
}
}
return k;
}
int main()
{
string a="banana";
xx b[100];
a=a+'$';
int len=a.length();
for(int i=0;i<len;i++)
{
b[i].x=a.substr(i);
b[i].d=i+1;
}
sort(b,b+len,compare);
for(int i=0;i<len;i++)
cout<<b[i].x<<" "<<b[i].d<<endl;
b[0].lcp=0;
b[1].lcp=0;
for(int i=2;i<len;i++)
{
b[i].lcp=findlcp(b[i].x,b[i-1].x);
}
for(int i=0;i<len;i++)
cout<<b[i].d<<" "<<b[i].lcp<<endl;
}
This is a implementation of
Suffix Array. What my question is in the wikipedia article construction is given as o(n) in worst case
So in my construction:
I am sorting all the suffixes of the string using stl sort .This may at least a O(nlogn) in worst case.So here i am violating O(n) construction.
Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)
So for the 1st one i.e for the sorting
If i use count sort i may decrease to O(n).If i use Count sort is it correct?Is my understanding is correct?let me know if my understanding is wrong
And is there any way to find LCP in O(n) time?
First, regarding your two statements:
1) I am sorting all the suffixes of the string using stl sort. This may at least a O(nlogn) in worst case. So here i am violating O(n) construction.
The complexity of std::sort here is worse than O(n log n). The reason is that O(n log n) assumes that there are O(n log n) individual comparisons, and that each comparison is performed in O(1) time. The latter assumption is wrong, because you are sorting strings, not atomic items (like characters or integers).
Since the length of the string items, being substrings of the main string, is O(n), it would be safe to say that the worst-case complexity of your sorting algorithm is O(n2 log n).
2) Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)
Yes, your construction of the LCP array is O(n2) because you are running your lcp function n == len times, and your lcp function requires O(min(len(x),len(y))) time for a pair of strings x, y.
Next, regarding your questions:
If I use count sort I may decrease to O(n). If I use Count sort is it correct? Is my understanding is correct? Let me know if my understanding is wrong.
Unfortunately, your understanding is incorrect. Counting sort is only linear if you can, in O(1) time, get access to an atomic key for each item you want to sort. Again, the items are strings O(n) characters in length, so this won't work.
And is there any way to find LCP in O(n) time?
Yes. Recent algorithms for suffix array computation, including the DC algorithm (aka Skew algorithm), provide for methods to calculate the LCP array along with the suffix array, and do so in O(n) time.
The reference for the DC algorithm is Juha Kärkkäinen, Peter Sanders: Simple linear work suffix array construction, Automata, Languages and Programming
Lecture Notes in Computer Science Volume 2719, 2003, pp 943-955 (DOI 10.1007/3-540-45061-0_73). (But this is not the only algorithm that allows you to do this in linear time.)
You may also want to take a look at the open-source implementations mentioned in this SO post: What's the current state-of-the-art suffix array construction algorithm?. Many of the algorithms used there enable linear-time LCP array construction in addition to the suffix-array construction (but not all of the implementations there may actually include an implementation of that; I am not sure).
If you are ok with examples in Java, you may also want to look at the code for jSuffixArrays. It includes, among other algorithms, an implementation of the DC algorithm along with LCP array construction in linear time.
jogojapan has comprehensively answered your question. Just to mention an optimized cpp implementation, you might want to take a look at here.
Posting the code here in case GitHub goes down.
const int N = 1000 * 100 + 5; //max string length
namespace Suffix{
int sa[N], rank[N], lcp[N], gap, S;
bool cmp(int x, int y) {
if(rank[x] != rank[y])
return rank[x] < rank[y];
x += gap, y += gap;
return (x < S && y < S)? rank[x] < rank[y]: x > y;
}
void Sa_build(const string &s) {
S = s.size();
int tmp[N] = {0};
for(int i = 0;i < S;++i)
rank[i] = s[i],
sa[i] = i;
for(gap = 1;;gap <<= 1) {
sort(sa, sa + S, cmp);
for(int i = 1;i < S;++i)
tmp[i] = tmp[i - 1] + cmp(sa[i - 1], sa[i]);
for(int i = 0;i < S;++i)
rank[sa[i]] = tmp[i];
if(tmp[S - 1] == S - 1)
break;
}
}
void Lcp_build() {
for(int i = 0, k = 0;i < S;++i, --k)
if(rank[i] != S - 1) {
k = max(k, 0);
while(s[i + k] == s[sa[rank[i] + 1] + k])
++k;
lcp[rank[i]] = k;
}
else
k = 0;
}
};

What is the best hash function for Rabin-Karp algorithm?

I'm looking for an efficient hash function for Rabin-Karp algorithm. Here is my actual code (C programming language).
static bool f2(char const *const s1, size_t const n1,
char const *const s2, size_t const n2)
{
uintmax_t hsub = hash(s2, n2);
uintmax_t hs = hash(s1, n1);
size_t nmax = n2 - n1;
for (size_t i = 0; i < nmax; ++i) {
if (hs == hsub) {
if (strncmp(&s1[i], s2, i + n2 - 1) == 0)
return true;
}
hs = hash(&s1[i + 1], i + n2);
}
return false;
}
I considered some Rabin-Karp C implementations, but there are differences between all the codes. So my question is: what are the characteristics that a Rabin-Karp hash function should have?
A extremly good performing hash is the bernstein hash. It even outruns
many popular hashing algorithms.
unsigned bernstein_hash ( void *key, int len )
{
unsigned char *p = key;
unsigned h = 0;
int i;
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
return h;
}
Of course, you can try out other hashing algorithms, as described here:
Hash function on NIST
Note: It has never been explained why the 33 is performing so much better
than any other "more logic" constant.
For your interest: Here is a good comparison of different hash algorithms:
strchr comparison of hash algorithms
what are the characteristics that a Rabin-Karp hash function should have?
Rabin-Karp needs a rolling hash. The easiest rolling hash is a moving sum. Adler-32 and Buzhash are pretty simple too and perform better than a moving sum.
Any of these rolling hash techniques should work for Rabin-Karp:
Moving sum
remove the oldest byte with subtraction
add a new byte with addition
Polynomial rolling hash
remove the oldest byte with subtraction
add a new byte with multiplication and addition
Rabin fingerprint
a polynomial rolling hash whose polynomial is irreducible over GF(2)
Tabulation hashing
remove the oldest byte with a table lookup and an xor
add a new byte with a table lookup and an xor
Cyclic polynomial, aka Buzhash
tabulation hashing based on circular shifts
Adler-32 checksum
not a rolling checksum by default but easily adjusted to "roll"
remove the oldest byte with two subtractions
add a new byte with two additions
For the problem with the small alphabets, such as nucleic acid sequence searching (e.g. alphabet = {A, T, C, G, U}), nt-Hash may be a good hash function.
It uses the binary operation, which is faster, and rolling hash update, and it also gives uniform distributed hash values.
Considering that the implementors of Java's JDK would have given some thought, I looked up what function is used there.
As of ~ Java 19, https://github.com/openjdk/jdk/blob/jdk-19+23/src/java.base/share/classes/java/lang/String.java#L2326
The update function is:
h' = 31 * h + c
initial value is 0.