Implementing Longest Common Substring using Suffix Array - c++

I am using this program for computing the suffix array and the Longest Common Prefix.
I am required to calculate the longest common substring between two strings.
For that, I concatenate strings, A#B and then use this algorithm.
I have Suffix Array sa[] and the LCP[] array.
The the longest common substring is the max value of LCP[] array.
In order to find the substring, the only condition is that among substrings of common lengths, the one occurring the first time in string B should be the answer.
For that, I maintain max of the LCP[]. If LCP[curr_index] == max, then I make sure that the left_index of the substring B is smaller than the previous value of left_index.
However, this approach is not giving a right answer. Where is the fault?
max=-1;
for(int i=1;i<strlen(S)-1;++i)
{
//checking that sa[i+1] occurs after s[i] or not
if(lcp[i] >= max && sa[i] < l1 && sa[i+1] >= l1+1 )
{
if( max == lcp[i] && sa[i+1] < left_index ) left_index=sa[i+1];
else if (lcp[i] > ma )
{
left_index=sa[i+1];
max=lcp[i];
}
}
//checking that sa[i+1] occurs after s[i] or not
else if (lcp[i] >= max && sa[i] >= l1+1 && sa[i+1] < l1 )
{
if( max == lcp[i] && sa[i] < left_index) left_index=sa[i];
else if (lcp[i]>ma)
{
left_index=sa[i];
max=lcp[i];
}
}
}

AFAIK, This problem is from a programming contest and discussing about programming problems of ongoing contest before editorials have been released shouldn't be .... Although I am giving you some insights as I got Wrong Answer with suffix array. Then I used suffix Automaton which gives me Accepted.
Suffix array works in O(nlog^2 n) whereas Suffix Automaton works in O(n). So my advice is go with suffix Automaton and you will surely get Accepted.
And if you can code solution for that problem, you will surely code this.
Also found in codchef forum that:
Try this case
babaazzzzyy
badyybac
The suffix array will contain baa... (From 1st string ) , baba.. ( from first string ) , bac ( from second string ) , bad from second string .
So if you are examining consecutive entries of SA then you will find a match at "baba" and "bac" and find the index of "ba" as 7 in second string , even though its actually at index 1 also .
Its likely that you may output "yy" instead of "ba"
And also handling the constraint ...the first longest common substring to be found on the second string, should be written to output... would be very easy in case of suffix automaton. Best of luck!

Related

How to convert one string to another by successive substitutions of characters?

I'm currently trying to design an algorithm that doing such thing:
I got two strings A and B which consist of lowercase characters 'a'-'z'
and I can modify string A using the following operations:
1. Select two characters 'c1' and 'c2' from the character set ['a'-'z'].
2. Replace all characters 'c1' in string A with character 'c2'.
I need to find the minimum number of operations needed to convert string A to string B when possible.
I have 2 ideas that didn't work
1. Simple range-based for cycle that changes string B and compares it with A.
2. Idea with map<char, int> that does the same.
Right now I'm stuck on unit-testing with such situation : 'ab' is transferable to 'ba' in 3 iterations and 'abc' to 'bca' in 4 iterations.
My algorithm is wrong and I need some fresh ideas or working solution.
Can anyone help with this?
Here is some code that shows minimal RepEx:
int Transform(string& A, string& B)
{
int count = 0;
if(A.size() != B.size()){
return -1;
}
for(int i = A.size() - 1; i >= 0; i--){
if(A[i]!=B[i]){
char rep_elem = A[i];
++count;
replace(A.begin(),A.end(),rep_elem,B[i]);
}
}
if(A != B){
return -1;
}
return count;
}
How can I improve this or I should find another ideas?
First of all, don't worry about string operations. Your problem is algorithmic, not textual. You should somehow analyze your data, and only afterwards print your solution.
Start with building a data structure which tells, for each letter, which letter it should be replaced with. Use an array (or std::map<char, char> — it should conceptually be similar, but have different syntax).
If you discover that you should convert a letter to two different letters — error, conversion impossible. Otherwise, count the number of non-trivial cycles in the conversion graph.
The length of your solution will be the number of letters which shouldn't be replaced by themselves plus the number of cycles.
I think the code to implement this would be too long to be helpful.

Checking if two patterns match one another?

This Leetcode problem is about how to match a pattern string against a text string as efficiently as possible. The pattern string can consists of letters, dots, and stars, where a letter only matches itself, a dot matches any individual character, and a star matches any number of copies of the preceding character. For example, the pattern
ab*c.
would match ace and abbbbcc. I know that it's possible to solve this original problem using dynamic programming.
My question is whether it's possible to see whether two patterns match one another. For example, the pattern
bdbaa.*
can match
bdb.*daa
Is there a nice algorithm for solving this pattern-on-pattern matching problem?
Here's one approach that works in polynomial time. It's slightly heavyweight and there may be a more efficient solution, though.
The first observation that I think helps here is to reframe the problem. Rather than asking whether these patterns match each other, let's ask this equivalent question:
Given patterns P1 and P2, is there a string w where P1 and P2 each match w?
In other words, rather than trying to get the two patterns to match one another, we'll search for a string that each pattern matches.
You may have noticed that the sorts of patterns you're allowed to work with are a subset of the regular expressions. This is helpful, since there's a pretty elaborate theory of what you can do with regular expressions and their properties. So rather than taking aim at your original problem, let's solve this even more general one:
Given two regular expressions R1 and R2, is there a string w that both R1 and R2 match?
The reason for solving this more general problem is that it enables us to use the theory that's been developed around regular expressions. For example, in formal language theory we can talk about the language of a regular expression, which is the set of all strings that the regex matches. We can denote this L(R). If there's a string that's matched by two regexes R1 and R2, then that string belongs to both L(R1) and L(R2), so our question is equivalent to
Given two regexes R1 and R2, is there a string w in L(R1) ∩ L(R2)?
So far all we've done is reframe the problem we want to solve. Now let's go solve it.
The key step here is that it's possible to convert any regular expression into an NFA (a nondeterministic finite automaton) so that every string matched by the regex is accepted by the NFA and vice-versa. Even better, the resulting NFA can be constructed in polynomial time. So let's begin by constructing NFAs for each input regex.
Now that we have those NFAs, we want to answer this question: is there a string that both NFAs accept? And fortunately, there's a quick way to answer this. There's a common construction on NFAs called the product construction that, given two NFAs N1 and N2, constructs a new NFA N' that accepts all the strings accepted by both N1 and N2 and no other strings. Again, this construction runs in polynomial time.
Once we have N', we're basically done! All we have to do is run a breadth-first or depth-first search through the states of N' to see if we find an accepting state. If so, great! That means there's a string accepted by N', which means that there's a string accepted by both N1 and N2, which means that there's a string matched by both R1 and R2, so the answer to the original question is "yes!" And conversely, if we can't reach an accepting state, then the answer is "no, it's not possible."
I'm certain that there's a way to do all of this implicitly by doing some sort of implicit BFS over the automaton N' without actually constructing it, and it should be possible to do this in something like time O(n2). If I have some more time, I'll revisit this answer and expand on how to do that.
I have worked on my idea of DP and came out with the below implementation of the above problem. Please feel free to edit the code in case someone finds any test cases failed. From my side, I tried few test cases and passed all of them, which I will be mentioning below as well.
Please note that I have extended the idea which is used to solve the regex pattern matching with a string using DP. To refer to that idea, please refer to the LeetCode link provided in the OP and look out for discussion part. They have given the explanation for regex matching and the string.
The idea is to create a dynamic memoization table, entries of which will follow the below rules:
If pattern1[i] == pattern2[j], dp[i][j] = dp[i-1][j-1]
If pattern1[i] == '.' or pattern2[j] == '.', then dp[i][j] = dp[i-1][j-1]
The trick lies here: If pattern1[i] = '*', then if dp[i-2][j] exists, then
dp[i][j] = dp[i-2][j] || dp[i][j-1] else dp[i][j] = dp[i][j-1].
If pattern2[j] == '*', then if pattern1[i] == pattern2[j-1], then
dp[i][j] = dp[i][j-2] || dp[i-1][j]
else dp[i][j] = dp[i][j-2]
pattern1 goes row-wise and pattern2 goes column-wise. Also, please note that this code should also work for normal regex pattern matching with any given string. I have verified it by running it on LeetCode and it passed all the available test cases there!
Below is the complete working implementation of the above logic:
boolean matchRegex(String pattern1, String pattern2){
boolean dp[][] = new boolean[pattern1.length()+1][pattern2.length()+1];
dp[0][0] = true;
//fill up for the starting row
for(int j=1;j<=pattern2.length();j++){
if(pattern2.charAt(j-1) == '*')
dp[0][j] = dp[0][j-2];
}
//fill up for the starting column
for(int j=1;j<=pattern1.length();j++){
if(pattern1.charAt(j-1) == '*')
dp[j][0] = dp[j-2][0];
}
//fill for rest table
for(int i=1;i<=pattern1.length();i++){
for(int j=1;j<=pattern2.length();j++){
//if second character of pattern1 is *, it will be equal to
//value in top row of current cell
if(pattern1.charAt(i-1) == '*'){
dp[i][j] = dp[i-2][j] || dp[i][j-1];
}
else if(pattern1.charAt(i-1)!='*' && pattern2.charAt(j-1)!='*'
&& (pattern1.charAt(i-1) == pattern2.charAt(j-1)
|| pattern1.charAt(i-1)=='.' || pattern2.charAt(j-1)=='.'))
dp[i][j] = dp[i-1][j-1];
else if(pattern2.charAt(j-1) == '*'){
boolean temp = false;
if(pattern2.charAt(j-2) == pattern1.charAt(i-1)
|| pattern1.charAt(i-1)=='.'
|| pattern1.charAt(i-1)=='*'
|| pattern2.charAt(j-2)=='.')
temp = dp[i-1][j];
dp[i][j] = dp[i][j-2] || temp;
}
}
}
//comment this portion if you don't want to see entire dp table
for(int i=0;i<=pattern1.length();i++){
for(int j=0;j<=pattern2.length();j++)
System.out.print(dp[i][j]+" ");
System.out.println("");
}
return dp[pattern1.length()][pattern2.length()];
}
Driver method:
System.out.println(e.matchRegex("bdbaa.*", "bdb.*daa"));
Input1: bdbaa.* and bdb.*daa
Output1: true
Input2: .*acd and .*bce
Output2: false
Input3: acd.* and .*bce
Output3: true
Time complexity: O(mn) where m and n are lengths of two regex patterns given. Same will be the space complexity.
You can use a dynamic approach tailored to this subset of a Thompson NFA style regex implementing only . and *:
You can do that either with dynamic programming (here in Ruby):
def is_match(s, p)
return true if s==p
len_s, len_p=s.length, p.length
dp=Array.new(len_s+1) { |row| [false] * (len_p+1) }
dp[0][0]=true
(2..len_p).each { |j| dp[0][j]=dp[0][j-2] && p[j-1]=='*' }
(1..len_s).each do |i|
(1..len_p).each do |j|
if p[j-1]=='*'
a=dp[i][j - 2]
b=[s[i - 1], '.'].include?(p[j-2])
c=dp[i - 1][j]
dp[i][j]= a || (b && c)
else
a=dp[i - 1][j - 1]
b=['.', s[i - 1]].include?(p[j - 1])
dp[i][j]=a && b
end
end
end
dp[len_s][len_p]
end
# 139 ms on Leetcode
Or recursively:
def is_match(s,p,memo={["",""]=>true})
if p=="" && s!="" then return false end
if s=="" && p!="" then return p.scan(/.(.)/).uniq==[['*']] && p.length.even? end
if memo[[s,p]]!=nil then return memo[[s,p]] end
ch, exp, prev=s[-1],p[-1], p.length<2 ? 0 : p[-2]
a=(exp=='*' && (
([ch,'.'].include?(prev) && is_match(s[0...-1], p, memo) ||
is_match(s, p[0...-2], memo))))
b=([ch,'.'].include?(exp) && is_match(s[0...-1], p[0...-1], memo))
memo[[s,p]]=(a || b)
end
# 92 ms on Leetcode
In each case:
The operative starting point in the string and pattern is at the second character looking for * and matches one character back for as long as s matches the character in p prior to the *
The meta character . is being used as a fill in for the actual character. This allows any character in s to match . in p
You can solve this with backtracking too, not very efficiently (because the match of the same substrings may be recalculated many times, which could be improved by introducing a lookup table where all non-matching pairs of strings are saved and the calculation only happens when they cannot be found in the lookup table), but seems to work (js, the algorithm assumes the simple regex are valid, which means not beginning with * and no two adjacent * [try it yourself]):
function canBeEmpty(s) {
if (s.length % 2 == 1)
return false;
for (let i = 1; i < s.length; i += 2)
if (s[i] != "*")
return false;
return true;
}
function match(a, b) {
if (a.length == 0 || b.length == 0)
return canBeEmpty(a) && canBeEmpty(b);
let x = 0, y = 0;
// process characters up to the next star
while ((x + 1 == a.length || a[x + 1] != "*") &&
(y + 1 == b.length || b[y + 1] != "*")) {
if (a[x] != b[y] && a[x] != "." && b[y] != ".")
return false;
x++; y++;
if (x == a.length || y == b.length)
return canBeEmpty(a.substr(x)) && canBeEmpty(b.substr(y));
}
if (x + 1 < a.length && y + 1 < b.length && a[x + 1] == "*" && b[y + 1] == "*")
// star coming in both strings
return match(a.substr(x + 2), b.substr(y)) || // try skip in a
match(a.substr(x), b.substr(y + 2)); // try skip in b
else if (x + 1 < a.length && a[x + 1] == "*") // star coming in a, but not in b
return match(a.substr(x + 2), b.substr(y)) || // try skip * in a
((a[x] == "." || b[y] == "." || a[x] == b[y]) && // if chars matching
match(a.substr(x), b.substr(y + 1))); // try skip char in b
else // star coming in b, but not in a
return match(a.substr(x), b.substr(y + 2)) || // try skip * in b
((a[x] == "." || b[y] == "." || a[x] == b[y]) && // if chars matching
match(a.substr(x + 1), b.substr(y))); // try skip char in a
}
For a little optimization you could normalize the strings first:
function normalize(s) {
while (/([^*])\*\1([^*]|$)/.test(s) || /([^*])\*\1\*/.test(s)) {
s = s.replace(/([^*])\*\1([^*]|$)/, "$1$1*$2"); // move stars right
s = s.replace(/([^*])\*\1\*/, "$1*"); // reduce
}
return s;
}
// example: normalize("aa*aa*aa*bb*b*cc*cd*dd") => "aaaa*bb*ccc*ddd*"
There is a further reduction of the input possible: x*.* and .*x* can both be replaced by .*, so to get the maximal reduction you would have to try to move as many stars as possible next to .* (so moving some stars to the left can be better than moving all to the right).
IIUC, you are asking: "Can a regex pattern match another regex pattern?"
Yes, it can. Specifically, . matches "any character" which of course includes . and *. So if you have a string like this:
bdbaa.*
How could you match it? Well, you could match it like this:
bdbaa..
Or like this:
b.*
Or like:
.*ba*.*

Finding Lexicographically smallest arrangement of some string

The Title might seem as if it is a very common question but please bear with me.
Basically lets say at each index of the string you know which alphabets could be in that index, and then you want to find lexicographically smallest arrangement.
So for example:
Index | Options
-------|----------
1 | 'b'
2 | 'c', 'a'
3 | 'd', 'c'
4 | 'c', 'a'
So hence, o/p should be badc. And yes btw, characters cannot repeat so no greedy algorithm.
I think we could use some sort of Breadth First Search by creating a queue or something of the string and each time we found we could not create another permutation, you pop that out of the list. Doubt this is optimal though, must be something in O(N). Any ideas?
And no, I don't think C is bad, but I would prefer code snippets in C/C++ :p
Thanks!
This can be solved by matching algorithm. You can use a network flow solution to solve this. This can be broken down into a bi-partite graph problem.
To be precise maximum weight assignment problem or maximum cost maximum matching would be a solution.
Below is the bipartite set of vertices -
LEVEL Alphabets
1 a
2 b
3 c
4 d
e
.
.
.
z
Now assign edges from set Level to set Alphabet, only and only if those are the options for that level. So these will be edges here - {1,b} , {2,a}, {2,c} , {3,c} , {3,d} ,{4,a} ,{4,c}
Now, to get the lexicographically least result you need to assign weight to the edges in this fashion -
Edge Wt. = 26^(N-Level) * ('z' - Alphabet)
So for example edge weight for edge {2,c} would be
26^(4-2) * (26-3) = 26^2*23
Now you can use a standard maximum cost maximum matching solution. Which is a polynomial solution. And this would be the best approach as far as I can think now. The naive solution is an exponential solution 26^N, so I think you would be happy with a polynomial solution.
The naive approach is to use backtracking and try every possible solution, however it won't be efficient enough(26!). Then you can improve this backtrack solution by using dynamic programming with the help of bitmask. A bitmask can help you store which characters you have used so far.
Write a recursive function that takes an two inputs, the index which should assign a character to, and a bitmask which indicates which characters we have used so far. Initially the bitmask contains 26 zeros which means we haven't used any characters. After assigning a character to some index we change the bitmask accordingly. For example if we use character a we set the first bit of the bitmask to 1. This way you won't solve a lot of overlapping sub-problems.
#include <iostream>
#include <queue>
#include <vector>
#include <map>
using namespace std;
vector<vector<char> > data;
map< pair<int,int>, string > dp;
string func( int index, int bitmask ){
pair<int,int> p = make_pair(index,bitmask);
if ( dp.count( p ) )
return dp[p];
string min_str = "";
for ( int i=0; i<data[index].size(); ++i ){
if ( (bitmask&(1<<(data[index][i]-'a'))) == 0 ){
string cur_str = "";
cur_str += data[index][i];
if ( index+1 != data.size() ){
int mask = bitmask;
mask |= 1<<(data[index][i]-'a');
string sub = func(index+1, mask);
if (sub == "")
continue;
cur_str += sub;
}
if ( min_str == "" || cur_str < min_str ){
min_str = cur_str;
}
}
}
dp[p] = min_str;
return min_str;
}
int main()
{
data.resize(4);
data[0].push_back('b');
data[1].push_back('c');
data[1].push_back('a');
data[2].push_back('d');
data[2].push_back('c');
data[3].push_back('c');
data[3].push_back('a');
cout << func(0,0) << endl;
}

Range checking for my remove function in c++

So I have a method in c++ that takes an array and removes a certain number of values in the array. The method removes the range of values from the starting value all the way up to but not including the end value. void dynamic_array::remove(int start, int end) {
The only problem I'm having is with the range checking. So I've set up a way to check to make sure the start and end values are not in the incorrect places however whenever I test the code, it appears that it doesn't catch the range exception. Here's the code that's supposed to check the exception:
if (not (0 <= ((start <= (end < size))))){
throw exception(SUBSCRIPT_RANGE_EXCEPTION);
}
you cannot use the notation 1 < x < 2 in c++ (or most languages). So you have to do each comparison separately. ie. (1<x) && (x<2) (brackets not really necessary here).
If you are interested, you actually can use the notation, but it means something different than you might think. It means that you first compare 1<x which gives either true (1) or zero(0) and then you compare this 1 or 0 with two.
It should be written
if(!(0 <= start && start <= end && end < size)){
throw exception
}
As i know, C++ can't understand the way you write it.
C++ does not work this way. The result of a single logical comparison is a boolean value. For example, the first comparison:
end < size
If this comparison is true, the result becomes a true value, which is for all practical purposes is 1. So, your expression now becomes, for all practical purposes:
if (not (0 <= ((start <= 1)))){
Which is already pretty much nonsensical, not to mention that there isn't a not operator in C++. Things pretty much roll downhill, from that point on.
You just need to make two logical comparisons: start < end, and end <= size. If you spend a few moments to think about it, you would realize this is all you need:
if (!(start < end && end <= size))

Wildcard String Search Algorithm

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!
A fast way, which is kind of the same thing as using a regexp, (which I would recommend anyway), is to find something that is fixed in needle, "a", but not "?", and search for it, then see if you've got a complete match.
j = firstNonWildcardPos(needle)
for(i = j, i < length(haystack)-length(needle)+j, ++i)
if(haystack[i] == needle[j])
if(!CompareMemory(haystack+i-j,needle,length(needle))
return i;
return -1; (Not found)
A regexp would generate code similar to this (I believe).
Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.
The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
# 17, de?ga and de3ga match
# 24, de?ga and defg4 differ
# 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("# %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)
Regular expressions usually use a finite state automation-based search, I think. Try implementing that.