How to get the shortest palindrome of a string - c++

For example :
String is : abcd
shortest palindrome is abcdcba is the solution
longer palindrome can be : abcddcba
another example:
String : aaaab
shortest palindrome is aaaabaaaa
longer palindrome can be aaaaabbaaaa
Restrictions : you can only add characters in the end.

Just append the reverse of initial substrings of the string, from shortest to longest, to the string until you have a palindrome. e.g., for "acbab", try appending "a" which yields "acbaba", which is not a palindrome, then try appending "ac" reversed, yielding "acbabca" which is a palindrome.
Update: Note that you don't have to actually do the append. You know that the substring matches since you just reversed it. So all you have to do is check whether the remainder of the string is a palindrome, and if so append the reverse of the substring. Which is what Ptival wrote symbolically, so he should probably get the credit for the answer. Example: for "acbab", find the longest suffix that is a palindrome; that is "bab". Then append the remainder, "ac", in reverse: ac bab ca.

My guess for the logic:
Say you string is [a1...an] (list of characters a1 to an)
Find the smallest i such that [ai...an] is a palindrome.
The smallest palindrome is [a1 ... a(i-1)] ++ [ai ... an] ++ [a(i-1) ... a1]
where ++ denotes string concatenation.

Some pseudo code, to leave at least a bit of work on you:
def shortPalindrome(s):
for i in range(len(s)):
pal = s + reverse(s[0:i])
if isPalindrome(pal):
return pal
error()

Python code, should be easy to convert to C:
for i in range(1, len(a)):
if a[i:] == a[i:][::-1]:
break
print a + a[0:i][::-1]

I was also asked the same question recently, and here is what I wrote for my interview:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int isPalin ( char *str ) {
int i, len=strlen(str);
for (i=0; i<len/2; ++i)
if (str[i]!=str[len-i-1])
break;
return i==len/2;
}
int main(int argc, char *argv[]) {
if (argc!=2)
puts("Usage: small_palin <string>");
else {
char *str = argv[1];
int i=0, j, len=strlen(str);
while ( !isPalin(str+i) )
++i;
char *palin = malloc(len+1+i);
*(palin+len+1+i) = '\0';
strcpy(palin,str);
for (i=i-1, j=0; i>=0; --i, ++j)
*(palin+len+j) = str[i];
puts(palin);
}
return 0;
}
I feel that the program would have been more structured had I written an strrev() function and checked palindrome using strcmp(). This would enable me to reverse the starting characters of the source string and directly copy it using strcpy().
The reson why I went with this solution is that before this question I was asked to check for palindrome and I already had that isPalin() in paper. Kind of felt using existing code would be better !!

From the examples you shown looks like the longest palindrome is the original string concatenated with its reverse, and the shortest is the original string concatenated with its reverse except for the first character. But I'm pretty sure you want something more complex. Perhaps you can give better examples?

if string is made of k chars, I think you should add to this string the reversed (k-1) chars...

Below is my answer for another case: shortest palindrome by attaching characters to the front. So your task is to understand the algorithm and modify it appropriately.
Basically, it states that from a string s find the shortest palindrome by adding some characters to the front of s.
If you have never tried to solve this problem, I suggest that you solve it, and it will help you improve your problem solving skill.
After solving it, I kept looking for better solutions. I stumbled upon another programmer's solution. It is in python, and really neat. It is really interesting, but later I found out it was wrong.
class Solution:
# #param {string} s
# #return {string}
def shortestPalindrome(self, s):
A=s+s[::-1]
cont=[0]
for i in range(1,len(A)):
index=cont[i-1]
while(index>0 and A[index]!=A[i]):
index=cont[index-1]
cont.append(index+(1 if A[index]==A[i] else 0))
print cont[-1]
return s[cont[-1]:][::-1]+s
I myself looked at the Solution and saw it's interesting idea. At first, the algorithm concatenates the string and its reversed version. Then the following steps are similar to the steps for building KMP-table (or failure function) using in KMP algorithm. Why does this procedure work?
If you know KMP text searching algorithm, you will know its "lookup table" and steps to build it. Right now, I just show one important use of the table: it can show you the longest prefix of a string s that is also suffix of s (but not s itself). For example, "abcdabc" has the longest prefix which is also a suffix: "abc" (not "abcdabc" since this is the entire string!!!). To make it fun, we call this prefix is "happy substring" of s. So the happy substring of "aaaaaaaaaa" (10 a's ) is "aaaaaaaaa" (9 a's).
Now we go back and see how finding happy sub string of s can help solve the shortest palindrome problem.
Suppose that q is the shortest string added to the front of s to make the string qs is a palindrome. We can see that obviously length(q) < length(s) since ss is also a palindrome. Since qs is a palindrome, qs must end with q, or s = p+q where p is a sub string of s. Easily we see that p is also a palindrome. Therefore, in order to have shortest qs, q needs to be shortest. In turn, p is the longest palindromic sub string of s.
We call s' and q' are the reversed strings of s and q respectively. We see that s = pq, s' = q'p since p is a palindrome. So ss' = pqq'p . Now we need to find the longest p. Eureka! This also means that p is a happy sub string of the string ss'. That's how the above algorithm works!!!
However, after some thought, the above algorithm has some loophole. p is not a happy sub string of ss'! In fact, p is the longest prefix that is also a suffix of ss', but the prefix and suffix must not overlap each other. So let's make it more fun, we call "extremely happy sub string" of a string s is the longest sub string of s that is a prefix and also a suffix and this prefix and suffix must not overlap. On the other word, the "extremely happy sub string" of s must have length less than or equal half length of s.
So it turns out the "happy sub string" of ss' is not always "extremely happy sub string" of ss'. We can easily construct an example: s = "aabba". ss'="aabbaabbaa". The happy sub string of "aabbaabbaa" is "aabbaa", while the extremely happy sub string of "aabbaabbaa" is "aa". Bang!
Hence, the correct solution should be as following, based on the observation that length(p) <= length(ss')/2.
class Solution:
# #param {string} s
# #return {string}
def shortestPalindrome(self, s):
A=s+s[::-1]
cont=[0]
for i in range(1,len(A)):
index=cont[i-1]
while(index>0):
if(A[index]==A[i]):
if index < len(s):
break
index=cont[index-1]
cont.append(index+(1 if A[index]==A[i] else 0))
print cont[-1]
return s[cont[-1]:][::-1]+s
Hooray!
As you can see, algorithms are interesting!
The link to the article I wrote here

It looks like the solutions outlined here are O(N^2) (for each suffix X of the reversed string S, find if S + X is a palindrome).
I believe there is a linear, i.e O(N) solution for this problem. Consider the following statement: the only time where you would append less characters than S.Length - 1 is when the string already contains a partial palindrome, so it will be in the form of NNNNNPPPPPP, where PPPPP represent a palindrome. This means that if we can find the largest trailing palindrome, we can solve it linearly by concatenating the reverse of NNNNN to the end.
Finally, there exists a famous algorithm (Manacher, 1975) that finds the longest (and in fact, all) of the palindromes contained in a string (there is a good explanation here). It can be easily modified to return the longest trailing palidrome, thus giving a linear solution for this problem.
If anyone is interested, here is the full code for a mirror problem (append characters at the beginning):
using System.Text;
// Via http://articles.leetcode.com/2011/11/longest-palindromic-substring-part-ii.html
class Manacher
{
// Transform S into T.
// For example, S = "abba", T = "^#a#b#b#a#$".
// ^ and $ signs are sentinels appended to each end to avoid bounds checking
private static string PreProcess(string s)
{
StringBuilder builder = new StringBuilder();
int n = s.Length;
if (n == 0) return "^$";
builder.Append('^');
for (int i = 0; i < n; i++)
{
builder.Append('#');
builder.Append(s[i]);
}
builder.Append('#');
builder.Append('$');
return builder.ToString();
}
// Modified to return only the longest palindrome that *starts* the string
public static string LongestPalindrome(string s)
{
string T = PreProcess(s);
int n = T.Length;
int[] P = new int[n];
int C = 0, R = 0;
for (int i = 1; i < n - 1; i++)
{
int i_mirror = 2 * C - i; // equals to i' = C - (i-C)
P[i] = (R > i) ? Math.Min(R - i, P[i_mirror]) : 0;
// Attempt to expand palindrome centered at i
while (T[i + 1 + P[i]] == T[i - 1 - P[i]])
P[i]++;
// If palindrome centered at i expand past R,
// adjust center based on expanded palindrome.
if (i + P[i] > R)
{
C = i;
R = i + P[i];
}
}
// Find the maximum element in P.
int maxLen = 0;
int centerIndex = 0;
for (int i = 1; i < n - 1; i++)
{
if (P[i] > maxLen
&& i - 1 == P[i] /* the && part forces to only consider palindromes that start at the beginning*/)
{
maxLen = P[i];
centerIndex = i;
}
}
return s.Substring((centerIndex - 1 - maxLen) / 2, maxLen);
}
}
public class Solution {
public string Reverse(string s)
{
StringBuilder result = new StringBuilder();
for (int i = s.Length - 1; i >= 0; i--)
{
result.Append(s[i]);
}
return result.ToString();
}
public string ShortestPalindrome(string s)
{
string palindrome = Manacher.LongestPalindrome(s);
string part = s.Substring(palindrome.Length);
return Reverse(part) + palindrome + part;
}
}

using System;
using System.Collections.Generic;
using System.Linq;
public class Test
{
public static void shortPalindrome(string [] words){
List<string> container = new List<string>(); //List of Palindromes
foreach (string word in words )
{
char[] chararray = word.ToCharArray();
Array.Reverse(chararray);
string newText = new string(chararray);
if (word == newText) container.Add(word);
}
string shortPal=container.ElementAt(0);
for(int i=0; i<container.Count; i++)
{
if(container[i].Length < shortPal.Length){
shortPal = container[i];
}
}
Console.WriteLine(" The Shortest Palindrome is {0}",shortPal);
}
public static void Main()
{
string[] word = new string[5] {"noon", "racecar","redivider", "sun", "loss"};
shortPalindrome(word);
}
}

Shortest palindrome -
Reverse iterate from last positon + 1 to beginning
Push_back the elements
#include <iostream>
#include <string>
using namespace std ;
int main()
{
string str = "abcd" ;
string shortStr = str ;
for( string::reverse_iterator it = str.rbegin()+1; it != str.rend() ; ++it )
{
shortStr.push_back(*it) ;
}
cout << shortStr << "\n" ;
}
And longer palindrome can be any longer.
Ex: abcd
Longer Palindrome - abcddcba, abcdddcba, ...

Related

Print out each character randomly

I am creating a small game where the user will have hints(Characters of a string) to guess the word of a string. I have the code to see each individual character of the string, but is it possible that I can see those characters printed out randomly?
string str("TEST");
for (int i = 0; i < str.size(); i++){
cout <<" "<< str[i];
output:T E S T
desired sample output: E T S T
Use random_shuffle on the string:
random_shuffle(str.begin(), str.end());
Edits:
C++11 onwards use:
auto engine = std::default_random_engine{};
shuffle ( begin(str), end(str), engine );
Use the following code to generate the letters randomly.
const int stl = str.size();
int stl2 = stl;
while (stl2 >= 0)
{
int r = rand() % stl;
if (str[r] != '0')
{
cout<<" "<<str[r];
str[r] = '0';
stl2--;
}
}
This code basically generates the random number based on the size of the String and then prints the character placed at that particular position of the string.
To avoid the reprinting of already printed character, I have converted the character printed to "0", so next time same position number is generated, it will check if the character is "0" or not.
If you need to preserve the original string, then you may copy the string to another variable and use it in the code.
Note: It is assumed that string will contain only alphabetic characters and so to prevent repetition, "0" is used. If your string may contain numbers, you may use a different character for comparison purpose

Given a word and a text, return the count of the occurrences of anagrams of the word in the text [duplicate]

This question already has answers here:
Given a word and a text, we need to return the occurrences of anagrams
(6 answers)
Closed 9 years ago.
For eg. word is for and the text is forxxorfxdofr, anagrams of for will be ofr, orf, fro, etc. So the answer would be 3 for this particular example.
Here is what I came up with.
#include<iostream>
#include<cstring>
using namespace std;
int countAnagram (char *pattern, char *text)
{
int patternLength = strlen(pattern);
int textLength = strlen(text);
int dp1[256] = {0}, dp2[256] = {0}, i, j;
for (i = 0; i < patternLength; i++)
{
dp1[pattern[i]]++;
dp2[text[i]]++;
}
int found = 0, temp = 0;
for (i = 0; i < 256; i++)
{
if (dp1[i]!=dp2[i])
{
temp = 1;
break;
}
}
if (temp == 0)
found++;
for (i = 0; i < textLength - patternLength; i++)
{
temp = 0;
dp2[text[i]]--;
dp2[text[i+patternLength]]++;
for (j = 0; j < 256; j++)
{
if (dp1[j]!=dp2[j])
{
temp = 1;
break;
}
}
if (temp == 0)
found++;
}
return found;
}
int main()
{
char pattern[] = "for";
char text[] = "ofrghofrof";
cout << countAnagram(pattern, text);
}
Does there exist a faster algorithm for the said problem?
Most of the time will be spent searching, so to make the algorithm more time efficient, the objective is to reduce the quantities of searches or optimize the search.
Method 1: A table of search starting positions.
Create a vector of lists, one vector slot for each letter of the alphabet. This can be space-optimized later.
Each slot will contain a list of indices into the text.
Example text: forxxorfxdofr
Slot List
'f' 0 --> 7 --> 11
'o' 1 --> 5 --> 10
'r' 2 --> 6 --> 12
For each word, look up the letter in the vector to get a list of indexes into the text. For each index in the list, compare the text string position from the list item to the word.
So with the above table and the word "ofr", the first compare occurs at index 1, second compare at index 5 and last compare at index 10.
You could eliminate near-end of text indices where (index + word length > text length).
You can use the commutativity of multiplication, along with uniqueness of primal decomposition. This relies on my previous answer here
Create a mapping from each character into a list of prime numbers (as small as possible). For e.g. a-->2, b-->3, c-->5, etc.. This can be kept in a simple array.
Now, convert the given word into the multiplication of the primes matching each of its characters. This results will be equal to a similar multiplication of any anagram of that word.
Now sweep over the array, and at any given step, maintain the multiplication of the primes matching the last L characters (where L is the length of your word). So every time you advance you do
mul = mul * char2prime(text[i]) / char2prime(text[i-L])
Whenever this multiplication equals that of your word - increment the overall counter, and you're done
Note that this method would work well on short words, but the primes multiplication can overflow a 64b var pretty fast (by ~9-10 letters), so you'll have to use a large number math library to support longer words.
This algorithm is reasonably efficient if the pattern to be anagrammed is so short that the best way to search it is to simply scan it. To allow longer patterns, the scans represented here by the 'for jj' and 'for mm' loops could be replaced by more sophisticated search techniques.
// sLine -- string to be searched
// sWord -- pattern to be anagrammed
// (in this pseudo-language, the index of the first character in a string is 0)
// iAnagrams -- count of anagrams found
iLineLim = length(sLine)-1
iWordLim = length(sWord)-1
// we need a 'deleted' marker char that will never appear in the input strings
chNil = chr(0)
iAnagrams = 0 // well we haven't found any yet have we
// examine every posn in sLine where an anagram could possibly start
for ii from 0 to iLineLim-iWordLim do {
chK = sLine[ii]
// does the char at this position in sLine also appear in sWord
for jj from 0 to iWordLim do {
if sWord[jj]=chK then {
// yes -- we have a candidate starting posn in sLine
// is there an anagram of sWord at this position in sLine
sCopy = sWord // make a temp copy that we will delete one char at a time
sCopy[jj] = chNil // delete the char we already found in sLine
// the rest of the anagram would have to be in the next iWordLim positions
for kk from ii+1 to ii+iWordLim do {
chK = sLine[kk]
cc = false
for mm from 0 to iWordLim do { // look for anagram char
if sCopy[mm]=chK then { // found one
cc = true
sCopy[mm] = chNil // delete it from copy
break // out of 'for mm'
}
}
if not cc then break // out of 'for kk' -- no anagram char here
}
if cc then { iAnagrams = iAnagrams+1 }
break // out of 'for jj'
}
}
}
-Al.

Compare part of the string

Okay so here is what I'm trying to accomplish.
First of all below table is just an example of what I created, in my assignment I'm not suppose to know any of these. Which means I don't know what they will pass and what is the length of each string.
I'm trying to accomplish one task is to get to be able to compare part of the string
//In Array `phrase` // in array `word`
"Backdoor", 0 "mark" 3 (matches "Market")
"DVD", 1 "of" 2 (matches "Get off")
"Get off", 2 "" -1 (no match)
"Market", 3 "VD" 1 (matches "DVD")
So as you can see from the above codes from the left hand side is the set of array which I store them in my class and they have upto 10 words
Here is the class definition.
class data
{
char phrase[10][40];
public:
int match(const char word[ ]);
};
so I'm using member function to access this private data.
int data::match(const char word[ ])
{
int n,
const int wordLength = strlen(word);
for (n=0 ; n <= 10; n++)
{
if (strncmp (phrase[n],word,wordLength) == 0)
{
return n;
}
}
return -1;
}
The above code that I'm trying to make it work is that it should match and and return if it found the match by returning the index n if not found should always return -1.
What happen now is always return 10.
You're almost there but your code is incomplete so I''m shootin in the dark on a few things.
You may have one too many variables representing an index. Unless n and i are different you should only use one. Also try to use more descriptive names, pos seems to represent the length of the text you are searching.
for (n=0 ; n <= searchLength ; n++)
Since the length of word never changes you don't need to call strlen every time. Create a variable to store the length in before the for loop.
const int wordLength = strlen(word);
I'm assuming the text you are searching is stored in a char array. This means you'll need to pass a pointer to the first element stored at n.
if (strncmp (&phrase[n],word,wordLength) == 0)
In the end you have something that looks like the following:
char word[256] = "there";
char phrase[256] = "hello there hippie!";
const int wordLength = strlen(word);
const int searchLength = strlen(phrase);
for (int n = 0; n <= searchLength; n++)
{
// or phrase + n
if (strncmp(&phrase[n], word, wordLength) == 0)
{
return n;
}
}
return -1;
Note: The final example is now complete to the point of returning a match.
I'm puzzled about your problem. There are some cases unclear. For eaxmple abcdefg --- abcde Match "abcde"? how many words match? any other examples, abcdefg --- dcb Match "c"?and abcdefg --- aoodeoofoo Match "a" or "adef"? if you want to find the first matched word, it's OK and very simple. But if you are to find the longest and discontinuous string, it is a big question. I think you should have a research about LCS problem (Longest Common Subsequence)

Solving "Welcome to Code Jam" from Google Code Jam 2009

I am trying to solve the following code jam question,ive made some progress but for few cases my code give wrong outputs..
Welcome to Code jam
So i stumbled on a solution by dev "rem" from russia.
I've no idea how his/her solution is working correctly.. the code...
const string target = "welcome to code jam";
char buf[1<<20];
int main() {
freopen("input.txt", "rt", stdin);
freopen("output.txt", "wt", stdout);
gets(buf);
FOR(test, 1, atoi(buf)) {
gets(buf);
string s(buf);
int n = size(s);
int k = size(target);
vector<vector<int> > dp(n+1, vector<int>(k+1));
dp[0][0] = 1;
const int mod = 10000;
assert(k == 19);
REP(i, n) REP(j, k+1) {// Whats happening here
dp[i+1][j] = (dp[i+1][j]+dp[i][j])%mod;
if (j < k && s[i] == target[j])
dp[i+1][j+1] = (dp[i+1][j+1]+dp[i][j])%mod;
}
printf("Case #%d: %04d\n", test, dp[n][k]);
}
exit(0);
}//credit rem
Can somebody explain whats happening in the two loops?
Thanks.
What he is doing: dynamic programming, this far you can see too.
He has 2D array and you need to understand what is its semantics.
The fact is that dp[i][j] counts the number of ways he can get a subsequence of the first j letters of welcome to code jam using all the letters in the input string upto the ith index. Both indexes are 1 -based to allow for the case of not taking any letters from the strings.
For example if the input is:
welcome to code jjam
The values of dp in different situations are going to be:
dp[1][1] = 1; // first letter is w. perfect just the goal
dp[1][2] = 0; // no way to have two letters in just one-letter string
dp[2][2] = 1; // again: perfect
dp[1][2] = 1; // here we ignore the e. We just need the w.
dp[7][2] = 2; // two ways to construct we: [we]lcome and [w]elcom[e].
The loop you are specifically asking about calculates new dynamic values based on the already calculated ones.
Whoa, I was practicing this problem few days ago and and stumbled across this question.
I suspect that saying "he's doing dynamic programming" won't not explain too much if you did not study DP.
I can give clearer implementation and easier explanation:
string phrase = "welcome to code jam"; // S
string text; getline(cin, text); // T
vector<int> ob(text.size(), 1);
int ans = 0;
for (int p = 0; p < phrase.size(); ++p) {
ans = 0;
for (int i = 0; i < text.size(); ++i) {
if (text[i] == phrase[p]) ans = (ans + ob[i]) % 10000;
ob[i] = ans;
}
}
cout << setfill('0') << setw(4) << ans << endl;
To solve the problem if S had only one character S[0] we could just count number of its occurrences.
If it had only two characters S[0..1] we see that each occurrence T[i]==S[1] increases answer by the number of occurrences of S[0] before index i.
For three characters S[0..2] each occurrence T[i]==S[2] similarly increases answer by number of occurrences of S[0..1] before index i. This number is the same as the answer value at the moment the previous paragraph had processed T[i].
If there were four characters, the answer would be increasing by number of occurrences of the previous three before each index at which fourth character is found, and so on.
As every other step uses values from the previous ones, this can be solved incrementally. On each step p we need to know number of occurrences of previous substring S[0..p-1] before any index i, which can be kept in array of integers ob of the same length as T. Then the answer goes up by ob[i] whenever we encounter S[p] at i. And to prepare ob for the next step, we also update each ob[i] to be the number of occurrences of S[0..p] instead — i.e. to the current answer value.
By the end the latest answer value (and the last element of ob) contain the number of occurrences of whole S in whole T, and that is the final answer.
Notice that it starts with ob filled with ones. The first step is different from the rest; but counting number of occurrences of S[0] means increasing answer by 1 on each occurrence, which is what all other steps do, except that they increase by ob[i]. So when every ob[i] is initially 1, the first step will run just like all others, using the same code.

Wildcard String Search Algorithm

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb).
The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).
Until now I use the trivial approach (here in pseudocode)
algorithm "search", input: haystack(string), needle(string)
for(i = 0, i < length(haystack), ++i)
if(!CompareMemory(haystack+i,needle,length(needle))
return i;
return -1; (Not found)
Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.
My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here
but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).
I hope the format of the question is ok because I am new here, thank you in advance!
A fast way, which is kind of the same thing as using a regexp, (which I would recommend anyway), is to find something that is fixed in needle, "a", but not "?", and search for it, then see if you've got a complete match.
j = firstNonWildcardPos(needle)
for(i = j, i < length(haystack)-length(needle)+j, ++i)
if(haystack[i] == needle[j])
if(!CompareMemory(haystack+i-j,needle,length(needle))
return i;
return -1; (Not found)
A regexp would generate code similar to this (I believe).
Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)
The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.
In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.
In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.
The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)
Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.
Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.
Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.
Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.
In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.
Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.
You can run the program to see a few operation-count statistics. For example, the output
Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg
# 17, de?ga and de3ga match
# 24, de?ga and defg4 differ
# 31, de?ga and defga match
Advances: 'd' 0+3 'e' 3+3 'g' 3+3 = 6+9 = 15
shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.
/* jiw
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $
Re: Concept-code for searching a long string for short targets,
where targets may contain wildcard characters.
The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].
Eg, for tests with *hay = jay use command like
./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f
or with *hay = kay,
./stringsearch bc?e? jih? pa?j ?av??j
to exercise program.
Copyright 2011 James Waldby. Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
"etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
"etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
"ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
"ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;
enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
int L[L2], H[chars], T[chars], g, k, par;
// Step 1. Make arrays L, H, T.
for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
for (g=0; hay[g]; ++g) { // Make linked character lists for hay.
k = hay[g]; // In same loop, could count char freqs.
if (T[k]==L1) H[k] = T[k] = g;
T[k] = L[T[k]] = g;
}
// Step 2. Read and process target strings.
for (par=1; par<argc; ++par) {
int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
char * targ = argv[par];
enum { wild = '?' };
int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
printf ("Target %d=<%s>\n", par, targ);
// Step 2A. Choose 3 non-wild characters to follow.
// As is, chooses first 3 non-wilds for a,b,c.
// Could instead choose 3 rarest characters.
for (j=0; j<3; ++j) alpha[j] = -j;
for (i=j=0; targ[i] && j<3; ++i)
if (targ[i] != wild) {
r = alpha[j] = targ[i];
if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
|| alpha[0]==alpha[2]) continue;
at[j] = i;
L[T[r]] = L[g+i] = g+i;
++j;
}
if (j != 3) {
printf (" Too few target chars\n");
continue;
}
// Step 2B. Set a,b,c to head-of-list locations, set deltas.
da = at[0];
a = H[alpha[0]]; dab = at[1]-at[0];
b = H[alpha[1]]; dbc = at[2]-at[1];
c = H[alpha[2]]; dac = at[2]-at[0];
// Step 2C. See if key characters appear in haystack
if (a >= g || b >= g || c >= g) {
printf (" No match on some character\n");
continue;
}
for (g=0; hay[g]; ++g) printf ("%d", g%10);
printf ("\n%s\n", hay); // Show haystack, for user aid
// Step 2D. Search for match
while (c < g) {
// Step 2E. Enforce delta distances
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
while (b+dbc < c) {b = L[b]; ++sb; } // 3 abc lines as many
while (a+dac > c) {c = L[c]; ++sc; } // times as you like.
// Step 2F. See if delta distances were met
if (a+dab==b && b+dbc==c && c<g) {
// Step 2G. Yes, so we have 3-letter-match and need to test whole match.
r = a-da;
for (k=0; targ[k]; ++k)
if ((hay[r+k] != targ[k]) && (targ[k] != wild))
break;
printf ("# %3d, %s and ", r, targ);
for (i=0; targ[i]; ++i) putchar(hay[r++]);
// Step 2H. Report match, if found
puts (targ[k]? " differ" : " match");
// Step 2I. Advance all of a,b,c, to go on looking
a = L[a]; ++ta;
b = L[b]; ++tb;
c = L[c]; ++tc;
}
}
printf ("Advances: '%c' %d+%d '%c' %d+%d '%c' %d+%d = %d+%d = %d\n",
alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
}
return 0;
}
Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)
Regular expressions usually use a finite state automation-based search, I think. Try implementing that.