Search For Subtext In Text With A Defined Algorithm - c++

I want to create a program to search for a subtext in a text.
For example, I have this text: abcdeabbdfeg
And in that text I want to find: cd
But I want to use this algorithm:
start = 1
end = string length of the text
middle = (start + end) / 2
if (pattern < text[middle]) end = mid - 1;
if (pattern > text[middle]) start = mid + 1;
...and continue until the pattern is found in the text
So, I already have a simple program that completely works without any problem but without that algorithm above, so now I only want to implement that algorithm above in my program, I have tried many ways, but my program won't show anything in any case, after I add that algorithm...
This is the code that I have and works:
void search(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
for (int i = 0; i <= N - M; i++)
{
int j;
for (j = 0; j < M; j++)
{
if (txt[i+j] != pat[j])
break;
}
if (j == M)
{
printf("Pattern found at index %d \n", i);
}
}
}
And this is the code above with the implementation of the algorithm:
int _tmain(int argc, _TCHAR* argv[])
{
char t[32];
cout << "Please enter your text (t):";
cin >> t;
char p[32];
cout << "Please enter the pattern (p) you wish to look for in that text (t):";
cin >> p;
int start, end = 0;
double middle = 0;
start = 1;
end = strlen(t);
while (start <= end)
{
int M = strlen(p);
int N = strlen(t);
middle = std::ceil((start + end) / 2.0);
int mid = (int)middle;
for (int i = mid; i <= M; i++)
{
int j;
for (j = 0; j < M; j++)
{
if (t[mid] != p[j]) break;
if (p[j] < t[mid]) { end = mid - 1; }
else if (p[j] > t[mid]) { start = mid + 1; }
}
if (j == M)
{
printf("Pattern found at index %d \n", i);
}
}
}
if (start > end) cout << "Search has ended: pattern p does not occur in the text." << endl;
return 0;
}

Your algorithm is still a binary search. You split the array into two partitions, then select a partition, based on the value of a letter.
The requirements of a partitioned search is to have an ordered collection.
Let's use your example.
0 1 2 3 4 5 6 7 8 9 10 11
+---+---+---+---+---+---+---+---+---+---+---+---+
| a | b | c | d | e | a | b | b | d | f | e | g |
+---+---+---+---+---+---+---+---+---+---+---+---+
If you choose the midpoint at index 5, this yields the letter a. Since you are searching for the letter c first, then d, the algorithm says that the letter c must lie in the partition 6..11. Thus the basis of your issue.
The algorithm will not find cd because there is no c in the partition 6..11.
The algorithm assumes that the array is sorted and that for any given index, there will be one partition containing values less than array[index] and one partition containing values greater than array[index].
This assumption is demonstrated by the following code of yours:
if (p[j] < t[mid]) { end = mid - 1; }
else if (p[j] > t[mid]) { start = mid + 1; }
No matter how you name your algorithm, if it assumes an ordering on the array (e.g. p[j] < t[mid]), the array must be ordered.
You data is not ordered, so your algorithm fails the assumption, and thus the algorithm fails.
Edit 1:
Using Partitions
If you really must use a partitioning algorithm, you will need to build a set of partitions.
For example one partition starts at index 0 and proceeds until array[i] > array[i+1], this ends up at index 4. The other partition is 5..11.
(By the way, by determining the partitions, you have used more operations than a linear search.)
At this point, how do you know which partition to choose?
You don't. The letter c, that you are searching for, lies between a and e in the first partition; and a through g in the second partition. Pick a partition. If not found in the partition, you will have to search the other partition.
By performing a binary search on either partition, you have used more operations than a linear search.

Related

Why is my brute force substring search returning extra counts?

Doing some work with timing different algorithms, however my brute force implementation which I have found numerous times on different sites is sometimes returning more results than, say, Notepad++ search or VSCode search. Not sure what I am doing wrong.
The program opens a txt file with a DNA strand string of length 10000000 and searches and counts the number of occurrences of the string passed in via command line.
Algorithm:
int main(int argc, char *argv[]) {
// read in dna strand
ifstream file("dna.txt");
string dna((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
dna.c_str();
int dnaLength = dna.length();
cout << "DNA Strand Length: " << dnaLength << endl;
string pat = argv[1];
cout << "Pattern: " << pat << endl;
// algorithm
int M = pat.length();
int N = dnaLength;
int localCount = 0;
for (int i = 0; i <= N - M; i++) {
int j;
for (j = 0; j < M; j++) {
if (dna.at(i + j) != pat.at(j)) {
break;
}
}
if (j == M) {
localCount++;
}
}
The difference might be because your algorithm also counts overlapping results, while a quick check with Notepad++ shows that it does not.
Example:
Let dna be "FooFooFooFoo"
And your pattern "FooFoo"
What result do you expect? Notepad++ shows 2 (one starts at position 1, the second at position 7 (after the first).
Your algorithm will find 3 (position 1, 4 and 7)
In your algorithm, the index i increase by 1 every loop. This may cause double counting for some searching pattern. For eaxample, search for ABAB in the text ... ABABABABABAB .... The answer may be 5 times in your methods, and it would be 3 times if each character is not allowed to be double counted. Which answer you want?
To avoid double counting, you may rewrite the index i to a while loop:
i = 0;
while (i < M)
{
for (j = 0; j < M; j++) {
if (dna.at(i + j) != pat.at(j)) {
break;
}
}
if (j == M) {
localCount++;
i += M;
}
else ++i;
}
Or, you can employ the function std::string::find(const string&, int p=0). The first argument is the pattern to look for, and the second the position to start search:
int pos = 0, count=0;
pos = dna.find(pat); // initial serach start from pos=0;
while( pos != std::string::npos) { // while not end of string
++count;
pos = dna.find(pat, pos + M); // start search from pos+M
}
These two methods provide a self-confirmation for confidence.

Computing all distinct pair-combinations from N elements

Working on a USACO programming problem, I got stuck when using a brute-force approach.
From a list of N elements, I need to compute all distinct pair-configurations.
My problem is twofold.
How do I express such a configuration in, lets say, an array?
How do I go about computing all distinct combinations?
I only resorted to the brute-force approach after I gave up solving it analytically. Although this is context-specific, I came as far as noting that one can quickly rule out the rows where there is only a single, so called, "wormhole" --- it isn't effectively in an infinite cycle.
Update
I'll express them with a tree structure. Set N = 6; {A,B,C,D,E,F}.
By constructing the following trees chronologically, all combinations are listed.
A --> B,C,D,E,F;
B --> C,D,E,F;
C --> D,E,F;
D --> E,F;
E --> F.
Check: in total there are 6 over 2 = 6!/(2!*4!) = 15 combinations.
Note. Once a lower node is selected, it should be discarded as a top node; it can only exist in one single pair.
Next, selecting them and looping over all configurations.
Here is a sample code (in C/C++):
int c[N];
void LoopOverAll(int n)
{
if (n == N)
{
// output, the array c now contains a configuration
// do anything you want here
return;
}
if (c[n] != - 1)
{
// this warmhole is already paired with someone
LoopOverAll(n + 1);
return;
}
for (int i = n + 1; i < N; i ++)
{
if (c[i] != - 1)
{
// this warmhole is already paired with someone
continue;
}
c[i] = n; c[n] = i; LoopOverAll(n + 1);
c[i] = - 1;
}
c[n] = - 1;
}
int main()
{
for (int i = 0; i < N; i ++)
c[i] = - 1;
LoopOverAll(0);
return 1;
}

Searching a string of ints for a repeating pattern [duplicate]

My problem is to find the repeating sequence of characters in the given array. simply, to identify the pattern in which the characters are appearing.
.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
1: | J | A | M | E | S | O | N | J | A | M | E | S | O | N |
'---'---'---'---'---'---'---'---'---'---'---'---'---'---'
.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
2: | R | O | N | R | O | N | R | O | N | R | O | N | R | O | N |
'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'
.---.---.---.---.---.---.---.---.---.---.---.---.
3: | S | H | A | M | I | L | S | H | A | M | I | L |
'---'---'---'---'---'---'---'---'---'---'---'---'
.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
4: | C | A | R | P | E | N | T | E | R | C | A | R | P | E | N | T | E | R |
'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'
Example
Given the previous data, the result should be:
"JAMESON"
"RON"
"SHAMIL"
"CARPENTER"
Question
How to deal with this problem efficiently?
Tongue-in-cheek O(NlogN) solution
Perform an FFT on your string (treating characters as numeric values). Every peak in the resulting graph corresponds to a substring periodicity.
For your examples, my first approach would be to
get the first character of the array (for your last example, that would be C)
get the index of the next appearance of that character in the array (e.g. 9)
if it is found, search for the next appearance of the substring between the two appearances of the character (in this case CARPENTER)
if it is found, you're done (and the result is this substring).
Of course, this works only for a very limited subset of possible arrays, where the same word is repeated over and over again, starting from the beginning, without stray characters in between, and its first character is not repeated within the word. But all your examples fall into this category - and I prefer the simplest solution which could possibly work :-)
If the repeated word contains the first character multiple times (e.g. CACTUS), the algorithm can be extended to look for subsequent occurrences of that character too, not only the first one (so that it finds the whole repeated word, not only a substring of it).
Note that this extended algorithm would give a different result for your second example, namely RONRON instead of RON.
In Python, you can leverage regexes thus:
def recurrence(text):
import re
for i in range(1, len(text)/2 + 1):
m = re.match(r'^(.{%d})\1+$'%i, text)
if m: return m.group(1)
recurrence('abcabc') # Returns 'abc'
I'm not sure how this would translate to Java or C. (That's one of the reasons I like Python, I guess. :-)
First write a method that find repeating substring sub in the container string as below.
boolean findSubRepeating(String sub, String container);
Now keep calling this method with increasing substring in the container, first try 1 character substring, then 2 characters, etc going upto container.length/2.
Pseudocode
len = str.length
for (i in 1..len) {
if (len%i==0) {
if (str==str.substr(0,i).repeat(len/i)) {
return str.substr(0,i)
}
}
}
Note: For brevity, I'm inventing a "repeat" method for strings, which isn't actually part of Java's string; "abc".repeat(2)="abcabc"
Using C++:
//Splits the string into the fragments of given size
//Returns the set of of splitted strings avaialble
set<string> split(string s, int frag)
{
set<string> uni;
int len = s.length();
for(int i = 0; i < len; i+= frag)
{
uni.insert(s.substr(i, frag));
}
return uni;
}
int main()
{
string out;
string s = "carpentercarpenter";
int len = s.length();
//Optimistic approach..hope there are only 2 repeated strings
//If that fails, then try to break the strings with lesser number of
//characters
for(int i = len/2; i>1;--i)
{
set<string> uni = split(s,i);
if(uni.size() == 1)
{
out = *uni.begin();
break;
}
}
cout<<out;
return 0;
}
The first idea that comes to my mind is trying all repeating sequences of lengths that divide length(S) = N. There is a maximum of N/2 such lengths, so this results in a O(N^2) algorithm.
But i'm sure it can be improved...
Here is a more general solution to the problem, that will find repeating subsequences within an sequence (of anything), where the subsequences do not have to start at the beginning, nor immediately follow each other.
given an sequence b[0..n], containing the data in question, and a threshold t being the minimum subsequence length to find,
l_max = 0, i_max = 0, j_max = 0;
for (i=0; i<n-(t*2);i++) {
for (j=i+t;j<n-t; j++) {
l=0;
while (i+l<j && j+l<n && b[i+l] == b[j+l])
l++;
if (l>t) {
print "Sequence of length " + l + " found at " + i + " and " + j);
if (l>l_max) {
l_max = l;
i_max = i;
j_max = j;
}
}
}
}
if (l_max>t) {
print "longest common subsequence found at " + i_max + " and " + j_max + " (" + l_max + " long)";
}
Basically:
Start at the beginning of the data, iterate until within 2*t of the end (no possible way to have two distinct subsequences of length t in less than 2*t of space!)
For the second subsequence, start at least t bytes beyond where the first sequence begins.
Then, reset the length of the discovered subsequence to 0, and check to see if you have a common character at i+l and j+l. As long as you do, increment l.
When you no longer have a common character, you have reached the end of your common subsequence.
If the subsequence is longer than your threshold, print the result.
Just figured this out myself and wrote some code for this (written in C#) with a lot of comments. Hope this helps someone:
// Check whether the string contains a repeating sequence.
public static bool ContainsRepeatingSequence(string str)
{
if (string.IsNullOrEmpty(str)) return false;
for (int i=0; i<str.Length; i++)
{
// Every iteration, cut down the string from i to the end.
string toCheck = str.Substring(i);
// Set N equal to half the length of the substring. At most, we have to compare half the string to half the string. If the string length is odd, the last character will not be checked against, but it will be checked in the next iteration.
int N = toCheck.Length / 2;
// Check strings of all lengths from 1 to N against the subsequent string of length 1 to N.
for (int j=1; j<=N; j++)
{
// Check from beginning to j-1, compare against j to j+j.
if (toCheck.Substring(0, j) == toCheck.Substring(j, j)) return true;
}
}
return false;
}
Feel free to ask any questions if it's unclear why it works.
and here is a concrete working example:
/* find greatest repeated substring */
char *fgrs(const char *s,size_t *l)
{
char *r=0,*a=s;
*l=0;
while( *a )
{
char *e=strrchr(a+1,*a);
if( !e )
break;
do {
size_t t=1;
for(;&a[t]!=e && a[t]==e[t];++t);
if( t>*l )
*l=t,r=a;
while( --e!=a && *e!=*a );
} while( e!=a && *e==*a );
++a;
}
return r;
}
size_t t;
const char *p;
p=fgrs("BARBARABARBARABARBARA",&t);
while( t-- ) putchar(*p++);
p=fgrs("0123456789",&t);
while( t-- ) putchar(*p++);
p=fgrs("1111",&t);
while( t-- ) putchar(*p++);
p=fgrs("11111",&t);
while( t-- ) putchar(*p++);
Not sure how you define "efficiently". For easy/fast implementation you could do this in Java:
private static String findSequence(String text) {
Pattern pattern = Pattern.compile("(.+?)\\1+");
Matcher matcher = pattern.matcher(text);
return matcher.matches() ? matcher.group(1) : null;
}
it tries to find the shortest string (.+?) that must be repeated at least once (\1+) to match the entire input text.
This is a solution I came up with using the queue, it passed all the test cases of a similar problem in codeforces. Problem No is 745A.
#include<bits/stdc++.h>
using namespace std;
typedef long long ll;
int main()
{
ios_base::sync_with_stdio(false);
cin.tie(NULL);
string s, s1, s2; cin >> s; queue<char> qu; qu.push(s[0]); bool flag = true; int ind = -1;
s1 = s.substr(0, s.size() / 2);
s2 = s.substr(s.size() / 2);
if(s1 == s2)
{
for(int i=0; i<s1.size(); i++)
{
s += s1[i];
}
}
//cout << s1 << " " << s2 << " " << s << "\n";
for(int i=1; i<s.size(); i++)
{
if(qu.front() == s[i]) {qu.pop();}
qu.push(s[i]);
}
int cycle = qu.size();
/*queue<char> qu2 = qu; string str = "";
while(!qu2.empty())
{
cout << qu2.front() << " ";
str += qu2.front();
qu2.pop();
}*/
while(!qu.empty())
{
if(s[++ind] != qu.front()) {flag = false; break;}
qu.pop();
}
flag == true ? cout << cycle : cout << s.size();
return 0;
}
I'd convert the array to a String object and use regex
Put all your character in an array e.x. a[]
i=0; j=0;
for( 0 < i < count )
{
if (a[i] == a[i+j+1])
{++i;}
else
{++j;i=0;}
}
Then the ratio of (i/j) = repeat count in your array.
You must pay attention to limits of i and j, but it is the simple solution.

Find whether a 2d matrix is subset of another 2d matrix

Recently i was taking part in one Hackathon and i came to know about a problem which tries to find a pattern of a grid form in a 2d matrix.A pattern could be U,H and T and will be represented by 3*3 matrix
suppose if i want to present H and U
+--+--+--+ +--+--+--+
|1 |0 |1 | |1 |0 |1 |
+--+--+--+ +--+--+--+
|1 |1 |1 | --> H |1 |0 |1 | -> U
+--+--+--+ +--+--+--+
|1 |0 |1 | |1 |1 |1 |
+--+--+--+ +--+--+--+
Now i need to search this into 10*10 matrix containing 0s and 1s.Closest and only solution i can get it brute force algorithm of O(n^4).In languages like MATLAB and R there are very subtle ways to do this but not in C,C++. I tried a lot to search this solution on Google and on SO.But closest i can get is this SO POST which discuss about implementing Rabin-Karp string-search algorithm .But there is no pseudocode or any post explaining this.Could anyone help or provide any link,pdf or some logic to simplify this?
EDIT
as Eugene Sh. commented that If N is the size of the large matrix(NxN) and k - the small one (kxk), the buteforce algorithm should take O((Nk)^2). Since k is fixed, it is reducing to O(N^2).Yes absolutely right.
But is there is any generalised way if N and K is big?
Alright, here is then the 2D Rabin-Karp approach.
For the following discussion, assume we want to find a (m, m) sub-matrix inside a (n, n) matrix. (The concept works for rectangular matrices just as well but I ran out of indices.)
The idea is that for each possible sub-matrix, we compute a hash. Only if that hash matches the hash of the matrix we want to find, we will compare element-wise.
To make this efficient, we must avoid re-computing the entire hash of the sub-matrix each time. Because I got little sleep tonight, the only hash function for which I could figure out how to do this easily is the sum of 1s in the respective sub-matrix. I leave it as an exercise to someone smarter than me to figure out a better rolling hash function.
Now, if we have just checked the sub-matrix from (i, j) to (i + m – 1, j + m – 1) and know it has x 1s inside, we can compute the number of 1s in the sub-matrix one to the right – that is, from (i, j + 1) to (i + m – 1, j + m) – by subtracting the number of 1s in the sub-vector from (i, j) to (i + m – 1, j) and adding the number of 1s in the sub-vector from (i, j + m) to (i + m – 1, j + m).
If we hit the right margin of the large matrix, we shift the window down by one and then back to the left margin and then again down by one and then again to the right and so forth.
Note that this requires O(m) operations, not O(m2) for each candidate. If we do this for every pair of indices, we get O(mn2) work. Thus, by cleverly shifting a window of the size of the potential sub-matrix through the large matrix, we can reduce the amount of work by a factor of m. That is, if we don't get too many hash collisions.
Here is a picture:
As we shift the current window one to the right, we subtract the number of 1s in the red column vector on the left side and add the number of 1s in the green column vector on the right side to obtain the number of 1s in the new window.
I have implemented a quick demo of this idea using the great Eigen C++ template library. The example also uses some stuff from Boost but only for argument parsing and output formatting so you can easily get rid of it if you don't have Boost but want to try out the code. The index fiddling is a bit messy but I'll leave it without further explanation here. The above prose should cover it sufficiently.
#include <cassert>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <random>
#include <type_traits>
#include <utility>
#include <boost/format.hpp>
#include <boost/lexical_cast.hpp>
#include <Eigen/Dense>
#define PROGRAM "submatrix"
#define SEED_CSTDLIB_RAND 1
using BitMatrix = Eigen::Matrix<bool, Eigen::Dynamic, Eigen::Dynamic>;
using Index1D = BitMatrix::Index;
using Index2D = std::pair<Index1D, Index1D>;
std::ostream&
operator<<(std::ostream& out, const Index2D& idx)
{
out << "(" << idx.first << ", " << idx.second << ")";
return out;
}
BitMatrix
get_random_bit_matrix(const Index1D rows, const Index1D cols)
{
auto matrix = BitMatrix {rows, cols};
matrix.setRandom();
return matrix;
}
Index2D
findSubMatrix(const BitMatrix& haystack,
const BitMatrix& needle,
Index1D *const collisions_ptr = nullptr) noexcept
{
static_assert(std::is_signed<Index1D>::value, "unsigned index type");
const auto end = Index2D {haystack.rows(), haystack.cols()};
const auto hr = haystack.rows();
const auto hc = haystack.cols();
const auto nr = needle.rows();
const auto nc = needle.cols();
if (nr > hr || nr > hc)
return end;
const auto target = needle.count();
auto current = haystack.block(0, 0, nr - 1, nc).count();
auto j = Index1D {0};
for (auto i = Index1D {0}; i <= hr - nr; ++i)
{
if (j == 0) // at left margin
current += haystack.block(i + nr - 1, 0, 1, nc).count();
else if (j == hc - nc) // at right margin
current += haystack.block(i + nr - 1, hc - nc, 1, nc).count();
else
assert(!"this should never happen");
while (true)
{
if (i % 2 == 0) // moving right
{
if (j > 0)
current += haystack.block(i, j + nc - 1, nr, 1).count();
}
else // moving left
{
if (j < hc - nc)
current += haystack.block(i, j, nr, 1).count();
}
assert(haystack.block(i, j, nr, nc).count() == current);
if (current == target)
{
// TODO: There must be a better way than using cwiseEqual().
if (haystack.block(i, j, nr, nc).cwiseEqual(needle).all())
return Index2D {i, j};
else if (collisions_ptr)
*collisions_ptr += 1;
}
if (i % 2 == 0) // moving right
{
if (j < hc - nc)
{
current -= haystack.block(i, j, nr, 1).count();
++j;
}
else break;
}
else // moving left
{
if (j > 0)
{
current -= haystack.block(i, j + nc - 1, nr, 1).count();
--j;
}
else break;
}
}
if (i % 2 == 0) // at right margin
current -= haystack.block(i, hc - nc, 1, nc).count();
else // at left margin
current -= haystack.block(i, 0, 1, nc).count();
}
return end;
}
int
main(int argc, char * * argv)
{
if (SEED_CSTDLIB_RAND)
{
std::random_device rnddev {};
srand(rnddev());
}
if (argc != 5)
{
std::cerr << "usage: " << PROGRAM
<< " ROWS_HAYSTACK COLUMNS_HAYSTACK"
<< " ROWS_NEEDLE COLUMNS_NEEDLE"
<< std::endl;
return EXIT_FAILURE;
}
auto hr = boost::lexical_cast<Index1D>(argv[1]);
auto hc = boost::lexical_cast<Index1D>(argv[2]);
auto nr = boost::lexical_cast<Index1D>(argv[3]);
auto nc = boost::lexical_cast<Index1D>(argv[4]);
const auto haystack = get_random_bit_matrix(hr, hc);
const auto needle = get_random_bit_matrix(nr, nc);
auto collisions = Index1D {};
const auto idx = findSubMatrix(haystack, needle, &collisions);
const auto end = Index2D {haystack.rows(), haystack.cols()};
std::cout << "This is the haystack:\n\n" << haystack << "\n\n";
std::cout << "This is the needle:\n\n" << needle << "\n\n";
if (idx != end)
std::cout << "Found as sub-matrix at " << idx << ".\n";
else
std::cout << "Not found as sub-matrix.\n";
std::cout << boost::format("There were %d (%.2f %%) hash collisions.\n")
% collisions
% (100.0 * collisions / ((hr - nr) * (hc - nc)));
return (idx != end) ? EXIT_SUCCESS : EXIT_FAILURE;
}
While it compiles and runs, please consider the above as pseudo-code. I have made almost no attempt at optimizing it. It was just a proof-of concept for myself.
I'm going to present an algorithm that takes O(n*n) time in the worst case whenever k = O(sqrt(n)) and O(n*n + n*k*k) in general. This is an extension of Aho-Corasick to 2D. Recall that Aho-Corasick locates all occurrences of a set of patterns in a target string T, and it does so in time linear in pattern lengths, length of T, and number of occurrences.
Let's introduce some terminology. The haystack is the large matrix we are searching in and the needle is the pattern-matrix. The haystack is a nxn matrix and the needle is a kxk matrix. The set of patterns that we are going to use in Aho-Corasick is the set of rows of the needle. This set contains at most k rows and will have fewer if there are duplicate rows.
We are going to build the Aho-Corasick automaton (which is a Trie augmented with failure links) and then run the search algorithm on each row of the haystack. So we take every row of the needle and search for it in every row of the haystack. We can use a linear time 1D matching algorithm to do this but that would still be inefficient. The advantage of Aho-Corasick is that it searches for all patterns at once.
During the search we are going to populate a matrix A which we are going to use later. When we search in the first row of the haystack, the first row of A is filled with the occurrences of rows of the needle in the first row of the haystack. So we'll end up with a first row of A that looks like 2 - 0 - - 1 for example. This means that row number 0 of the needle appears at position 2 in the first row of the haystack; row number 1 appears at position 5; row number 2 appears at position 0. The - entries are positions that did not get matched. Keep doing this for every row.
Let's for now assume that there are no duplicate rows in the needle. Assign 0 to the first row of the needle, 1 to the second, and so on. Now we are going to search for the pattern [0 1 2 ... k-1] in every column of the matrix A using a linear time 1D search algorithm (KMP for example). Recall that every row of A stores positions at which rows of the needle appear. So if a column contains the pattern [0 1 2 ... k-1], this means that row number 0 of the needle appears at some row of the haystack, row number 1 of the needle is just below it, and so on. This is exactly what we want. If there are duplicate rows, just assign a unique number to each unique row.
Search in column takes O(n) using a linear time algorithm. So searching all columns takes O(n*n). We populate the matrix during the search, we search every row of the haystack (there are n rows) and the search in a row takes O(n+k*k). So O(n(n+k*k)) overall.
So the idea was to find that matrix and then reduce the problem to 1D pattern matching. Aho-Corasick is just there for efficiency, I don't know if there is another efficient way to find the matrix.
EDIT: added implementation.
Here is my c++ implementation. The max value of n is set to 100 but you can change it.
The program starts by reading two integers n k (the dimensions of the matrices). Then it reads n lines each containing a string of 0's and 1's of length n. Then it reads k lines each containing a string of 0's and 1's of length k. The output is the upper-left coordinate of all matches. For the following input for example.
12 2
101110111011
111010111011
110110111011
101110111010
101110111010
101110111010
101110111010
111010111011
111010111011
111010111011
111010111011
111010111011
11
10
The program will output:
match at (2,0)
match at (1,1)
match at (0,2)
match at (6,2)
match at (2,10)
#include <cstdio>
#include <cstring>
#include <string>
#include <queue>
#include <iostream>
using namespace std;
const int N = 100;
const int M = N;
int n, m;
string haystack[N], needle[M];
int A[N][N]; /* filled by successive calls to match */
int p[N]; /* pattern to search for in columns of A */
struct Node
{ Node *a[2]; /* alphabet is binary */
Node *suff; /* pointer to node whose prefix = longest proper suffix of this node */
int flag;
Node()
{ a[0] = a[1] = 0;
suff = 0;
flag = -1;
}
};
void insert(Node *x, string s)
{ static int id = 0;
static int p_size = 0;
for(int i = 0; i < s.size(); i++)
{ char c = s[i];
if(x->a[c - '0'] == 0)
x->a[c - '0'] = new Node;
x = x->a[c - '0'];
}
if(x->flag == -1)
x->flag = id++;
/* update pattern */
p[p_size++] = x->flag;
}
Node *longest_suffix(Node *x, int c)
{ while(x->a[c] == 0)
x = x->suff;
return x->a[c];
}
Node *mk_automaton(void)
{ Node *trie = new Node;
for(int i = 0; i < m; i++)
{ insert(trie, needle[i]);
}
queue<Node*> q;
/* level 1 */
for(int i = 0; i < 2; i++)
{ if(trie->a[i])
{ trie->a[i]->suff = trie;
q.push(trie->a[i]);
}
else trie->a[i] = trie;
}
/* level > 1 */
while(q.empty() == false)
{ Node *x = q.front(); q.pop();
for(int i = 0; i < 2; i++)
{ if(x->a[i] == 0) continue;
x->a[i]->suff = longest_suffix(x->suff, i);
q.push(x->a[i]);
}
}
return trie;
}
/* search for patterns in haystack[j] */
void match(Node *x, int j)
{ for(int i = 0; i < n; i++)
{ x = longest_suffix(x, haystack[j][i] - '0');
if(x->flag != -1)
{ A[j][i-m+1] = x->flag;
}
}
}
int match2d(Node *x)
{ int matches = 0;
static int z[M+N];
static int z_str[M+N+1];
/* init */
memset(A, -1, sizeof(A));
/* fill the A matrix */
for(int i = 0; i < n; i++)
{ match(x, i);
}
/* build string for z algorithm */
z_str[n+m] = -2; /* acts like `\0` for strings */
for(int i = 0; i < m; i++)
{ z_str[i] = p[i];
}
for(int i = 0; i < n; i++)
{ /* search for pattern in column i */
for(int j = 0; j < n; j++)
{ z_str[j + m] = A[j][i];
}
/* run z algorithm */
int l, r;
l = r = 0;
z[0] = n + m;
for(int j = 1; j < n + m; j++)
{ if(j > r)
{ l = r = j;
while(z_str[r] == z_str[r - l]) r++;
z[j] = r - l;
r--;
}
else
{ if(z[j - l] < r - j + 1)
{ z[j] = z[j - l];
}
else
{ l = j;
while(z_str[r] == z_str[r - l]) r++;
z[j] = r - l;
r--;
}
}
}
/* locate matches */
for(int j = m; j < n + m; j++)
{ if(z[j] >= m)
{ printf("match at (%d,%d)\n", j - m, i);
matches++;
}
}
}
return matches;
}
int main(void)
{ cin >> n >> m;
for(int i = 0; i < n; i++)
{ cin >> haystack[i];
}
for(int i = 0; i < m; i++)
{ cin >> needle[i];
}
Node *trie = mk_automaton();
match2d(trie);
return 0;
}
Let's start with an O(N * N * K) solution. I will use the following notation: A is a pattern matrix, B is a big matrix(the one we will search for occurrences of the pattern in).
We can fix a top row of the B matrix(that is, we will search for all occurrence that start in a position (this row, any column). Let's call this row a topRow. Now we can take a slice of this matrix that contains [topRow; topRow + K) rows and all columns.
Let's create a new matrix as a result of concatenation A + column + the slice, where a column is a column with K elements that are not present in A or B(if A and B consist of 0 and 1, we can use -1, for instance). Now we can treat columns of this new matrix as letters and run the Knuth-Morris-Pratt's algorithm. Comparing two letters requires O(K) time, thus the time complexity of this step is O(N * K).
There are O(N) ways to fix the top row, so the total time complexity is O(N * N * K). It is already better than a brute-force solution, but we are not done yet. The theoretical lower bound is O(N * N)(I assume that N >= K), and I want to achieve it.
Let's take a look at what can be improved here. If we could compare two columns of a matrix in O(1) time instead of O(k), we would have achieved the desired time complexity. Let's concatenate all columns of both A and B inserting some separator after each column. Now we have a string and we need to compare its substrings(because columns and their parts are substrings now). Let's construct a suffix tree in linear time(using Ukkonnen's algorithm). Now comparing two substrings is all about finding the height of the lowest common ancestor(LCA) of two nodes in this tree. There is an algorithm that allows us to do it with linear preprocessing time and O(1) time per LCA query. It means that we can compare two substrings(or columns) in constant time! Thus, the total time complexity is O(N * N). There is another way to achieve this time complexity: we can build a suffix array in linear time and answer the longest common prefix queries in constant time(with a linear time preprocessing). However, both of this O(N * N) solutions look pretty hard to implement and they will have a big constant.
P.S If we have a polynomial hash function that we can fully trust(or we are fine with a few false positives), we can get a much simpler O(N * N) solution using 2-D polynomial hashes.

What Ruzzle board contains the most unique words?

For smart phones, there is this game called Ruzzle.
It's a word finding game.
Quick Explanation:
The game board is a 4x4 grid of letters.
You start from any cell and try to spell a word by dragging up, down, left, right, or diagonal.
The board doesn't wrap, and you can't reuse letters you've already selected.
On average, my friend and I find about 40 words, and at the end of the round, the game informs you of how many possible words you could have gotten. This number is usually about 250 - 350.
We are wondering what board would yield the highest number of possible words.
How would I go about finding the optimal board?
I've written a program in C that takes 16 characters and outputs all the appropriate words.
Testing over 80,000 words, it takes about a second to process.
The Problem:
The number of game board permutations is 26^16.
That's 43608742899428874059776 (43 sextillion).
I need some kind of heuristic.
Should I skip all boards that have either z, q, x, etc because they are expected to not have as many words? I wouldn't want to exclude a letter without being certain.
There is also 4 duplicates of every board, because rotating the board will still give the same results.
But even with these restrictions, I don't think I have enough time in my life to find the answer.
Maybe board generation isn't the answer.
Is there a quicker way to find the answer looking at the list of words?
tldr;
S E R O
P I T S
L A N E
S E R G
or any of its reflections.
This board contains 1212 words (and as it turns out, you can exclude 'z', 'q' and 'x').
First things first, turns out you're using the wrong dictionary. After not getting exact matches with Ruzzle's word count, I looked into it, it seems Ruzzle uses a dictionary called TWL06, which has around 180,000 words. Don't ask me what it stands for, but it's freely available in txt.
I also wrote code to find all possible words given a 16 character board, as follows. It builds the dictionary into a tree structure, and then pretty much just goes around recursively while there are words to be found. It prints them in order of length. Uniqueness is maintained by the STL set structure.
#include <cstdlib>
#include <ctime>
#include <map>
#include <string>
#include <set>
#include <algorithm>
#include <fstream>
#include <iostream>
using namespace std;
struct TreeDict {
bool existing;
map<char, TreeDict> sub;
TreeDict() {
existing = false;
}
TreeDict& operator=(TreeDict &a) {
existing = a.existing;
sub = a.sub;
return *this;
}
void insert(string s) {
if(s.size() == 0) {
existing = true;
return;
}
sub[s[0]].insert(s.substr(1));
}
bool exists(string s = "") {
if(s.size() == 0)
return existing;
if(sub.find(s[0]) == sub.end())
return false;
return sub[s[0]].exists(s.substr(1));
}
TreeDict* operator[](char alpha) {
if(sub.find(alpha) == sub.end())
return NULL;
return &sub[alpha];
}
};
TreeDict DICTIONARY;
set<string> boggle_h(const string board, string word, int index, int mask, TreeDict *dict) {
if(index < 0 || index >= 16 || (mask & (1 << index)))
return set<string>();
word += board[index];
mask |= 1 << index;
dict = (*dict)[board[index]];
if(dict == NULL)
return set<string>();
set<string> rt;
if((*dict).exists())
rt.insert(word);
if((*dict).sub.empty())
return rt;
if(index % 4 != 0) {
set<string> a = boggle_h(board, word, index - 4 - 1, mask, dict);
set<string> b = boggle_h(board, word, index - 1, mask, dict);
set<string> c = boggle_h(board, word, index + 4 - 1, mask, dict);
rt.insert(a.begin(), a.end());
rt.insert(b.begin(), b.end());
rt.insert(c.begin(), c.end());
}
if(index % 4 != 3) {
set<string> a = boggle_h(board, word, index - 4 + 1, mask, dict);
set<string> b = boggle_h(board, word, index + 1, mask, dict);
set<string> c = boggle_h(board, word, index + 4 + 1, mask, dict);
rt.insert(a.begin(), a.end());
rt.insert(b.begin(), b.end());
rt.insert(c.begin(), c.end());
}
set<string> a = boggle_h(board, word, index + 4, mask, dict);
set<string> b = boggle_h(board, word, index - 4, mask, dict);
rt.insert(a.begin(), a.end());
rt.insert(b.begin(), b.end());
return rt;
}
set<string> boggle(string board) {
set<string> words;
for(int i = 0; i < 16; i++) {
set<string> a = boggle_h(board, "", i, 0, &DICTIONARY);
words.insert(a.begin(), a.end());
}
return words;
}
void buildDict(string file, TreeDict &dict = DICTIONARY) {
ifstream fstr(file.c_str());
string s;
if(fstr.is_open()) {
while(fstr.good()) {
fstr >> s;
dict.insert(s);
}
fstr.close();
}
}
struct lencmp {
bool operator()(const string &a, const string &b) {
if(a.size() != b.size())
return a.size() > b.size();
return a < b;
}
};
int main() {
srand(time(NULL));
buildDict("/Users/XXX/Desktop/TWL06.txt");
set<string> a = boggle("SEROPITSLANESERG");
set<string, lencmp> words;
words.insert(a.begin(), a.end());
set<string>::iterator it;
for(it = words.begin(); it != words.end(); it++)
cout << *it << endl;
cout << words.size() << " words." << endl;
}
Randomly generating boards and testing against them didn't turn out too effective, expectedly, I didn't really bother with running that, but I'd be surprised if they crossed 200 words. Instead I changed the board generation to generate boards with letters distributed in proportion to their frequency in TWL06, achieved by a quick cumulative frequency (the frequencies were reduced by a factor of 100), below.
string randomBoard() {
string board = "";
for(int i = 0; i < 16; i++)
board += (char)('A' + rand() % 26);
return board;
}
char distLetter() {
int x = rand() % 15833;
if(x < 1209) return 'A';
if(x < 1510) return 'B';
if(x < 2151) return 'C';
if(x < 2699) return 'D';
if(x < 4526) return 'E';
if(x < 4726) return 'F';
if(x < 5161) return 'G';
if(x < 5528) return 'H';
if(x < 6931) return 'I';
if(x < 6957) return 'J';
if(x < 7101) return 'K';
if(x < 7947) return 'L';
if(x < 8395) return 'M';
if(x < 9462) return 'N';
if(x < 10496) return 'O';
if(x < 10962) return 'P';
if(x < 10987) return 'Q';
if(x < 12111) return 'R';
if(x < 13613) return 'S';
if(x < 14653) return 'T';
if(x < 15174) return 'U';
if(x < 15328) return 'V';
if(x < 15452) return 'W';
if(x < 15499) return 'X';
if(x < 15757) return 'Y';
if(x < 15833) return 'Z';
}
string distBoard() {
string board = "";
for(int i = 0; i < 16; i++)
board += distLetter();
return board;
}
This was significantly more effective, very easily achieving 400+ word boards. I left it running (for longer than I intended), and after checking over a million boards, the highest found was around 650 words. This was still essentially random generation, and that has its limits.
Instead, I opted for a greedy maximisation strategy, wherein I'd take a board and make a small change to it, and then commit the change only if it increased the word count.
string changeLetter(string x) {
int y = rand() % 16;
x[y] = distLetter();
return x;
}
string swapLetter(string x) {
int y = rand() % 16;
int z = rand() % 16;
char w = x[y];
x[y] = x[z];
x[z] = w;
return x;
}
string change(string x) {
if(rand() % 2)
return changeLetter(x);
return swapLetter(x);
}
int main() {
srand(time(NULL));
buildDict("/Users/XXX/Desktop/TWL06.txt");
string board = "SEROPITSLANESERG";
int locmax = boggle(board).size();
for(int j = 0; j < 5000; j++) {
int changes = 1;
string board2 = board;
for(int k = 0; k < changes; k++)
board2 = change(board);
int loc = boggle(board2).size();
if(loc >= locmax && board != board2) {
j = 0;
board = board2;
locmax = loc;
}
}
}
This very rapidly got me 1000+ word boards, with generally similar letter patterns, despite randomised starting points. What leads me to believe that the board given is the best possible board is how it, or one of its various reflections, turned up repeatedly, within the first 100 odd attempts at maximising a random board.
The biggest reason for skepticism is the greediness of this algorithm, and that this somehow would lead to the algorithm missing out better boards. The small changes made are quite flexible in their outcomes – that is, they have the power to completely transform a grid from its (randomised) start position. The number of possible changes, 26*16 for the fresh letter, and 16*15 for the letter swap, are both significantly less than 5000, the number of continuous discarded changes allowed.
The fact that the program was able to repeat this board output within the first 100 odd times implies that the number of local maximums is relatively small, and the probability that there is an undiscovered maximum low.
Although the greedy seemed intuitively right – it shouldn't really be less possible to reach a given grid with the delta changes from a random board – and the two possible changes, a swap and a fresh letter do seem to encapsulate all possible improvements, I changed the program in order to allow it to make more changes before checking for the increase. This again returned the same board, repeatedly.
int main() {
srand(time(NULL));
buildDict("/Users/XXX/Desktop/TWL06.txt");
int glomax = 0;
int i = 0;
while(true) {
string board = distBoard();
int locmax = boggle(board).size();
for(int j = 0; j < 500; j++) {
string board2 = board;
for(int k = 0; k < 2; k++)
board2 = change(board);
int loc = boggle(board2).size();
if(loc >= locmax && board != board2) {
j = 0;
board = board2;
locmax = loc;
}
}
if(glomax <= locmax) {
glomax = locmax;
cout << board << " " << glomax << " words." << endl;
}
if(++i % 10 == 0)
cout << i << endl;
}
}
Having iterated over this loop around a 1000 times, with this particular board configuration showing up ~10 times, I'm pretty confident that this is for now the Ruzzle board with the most unique words, until the English language changes.
Interesting problem. I see (at least, but mainly) two approches
one is to try the hard way to stick as many wordable letters (in all directions) as possible, based on a dictionary. As you said, there are many possible combinations, and that route requires a well elaborated and complex algorithm to reach something tangible
there is another "loose" solution based on probabilities that I like more. You suggested to remove some low-appearance letters to maximize the board yield. An extension of this could be to use more of the high-appearance letters in the dictionary.
A further step could be:
based on the 80k dictionary D, you find out for each l1 letter of our L ensemble of 26 letters the probability that letter l2 precedes or follows l1. This is a L x L probabilities array, and is pretty small, so you could even extend to L x L x L, i.e. considering l1 and l2 what probability has l3 to fit. This is a bit more complex if the algorithm wants to estimate accurate probabilities, as the probas sum depends on the relative position of the 3 letters, for instance in a 'triangle' configuration (eg positions (3,3), (3,4) and (3,5)) the result is probably less yielding than when the letters are aligned [just a supposition]. Why not going up to L x L x L x L, which will require some optimizations...
then you distribute a few high-appearance letters (say 4~6) randomly on the board (having each at least 1 blank cell around in at least 5 of the 8 possible directions) and then use your L x L [xL] probas arrays to complete - meaning based on the existing letter, the next cell is filled with a letter which proba is high given the configuration [again, letters sorted by proba descending, and use randomness if two letters are in a close tie].
For instance, taking only the horizontal configuration, having the following letters in place, and we want to find the best 2 in between ER and TO
...ER??TO...
Using L x L, a loop like (l1 and l2 are our two missing letters). Find the absolutely better letters - but bestchoice and bestproba could be arrays instead and keep the - say - 10 best choices.
Note: there is no need to keep the proba in the range [0,1] in this case, we can sum up the probas (which don't give a proba - but the number matters. A mathematical proba could be something like p = ( p(l0,l1) + p(l2,l3) ) / 2, l0 and l3 are the R and T in our L x L exemple)
bestproba = 0
bestchoice = (none, none)
for letter l1 in L
for letter l2 in L
p = proba('R',l1) + proba(l2,'T')
if ( p > bestproba )
bestproba = p
bestchoice = (l1, l2)
fi
rof
rof
the algorithm can take more factors into account, and needs to take the vertical and diagonals into account as well. With L x L x L, more letters in more directions are taken into account, like ER?,R??,??T,?TO - this requires to think more through the algorithm - maybe starting with L x L can give an idea about the relevancy of this algorithm.
Note that a lot of this may be pre-calculated, and the L x L array is of course one of them.