Find the longest blob prefix in a blob

Find the longest blob prefix in a blob - c++

I try to produce / get an efficient implementation in C++ for the following problem:
I have to blobs (const char *data, size_t length), i call them "blob1" and "blob2". Now i like to get the longest prefix of "blob2" in "blob1". If the longest prefix is multiple times in "blob1" i like to get the one which has the biggest index.
Here an example (the blobs are here just ASCII-strings, so it's easier to read the example):
blob1 = HELLO LOOO HELOO LOOO LOO JU
blob2 = LOOO TUS
The following are all valid prefixes of blob2:
{ L, LO, LOO, LOOO, LOOO, LOOO T, LOOO TU, LOOO TUS }
The longest prefix of blob2 in blob1 is LOOO. It is there twice:
HELLO *LOOO *HELOO *LOOO *LOO JU
So i like to get the index of the second one, which would be 6, and the length of the prefix which would be 4.
Unfortunately blob1 and blob2 change many times, so it is probably slow to create a tree or some other complex structure.
Do you know a good algorithm to solve this problem?
Thank you.
Cheers
Kevin

I dont know if this the best algorithm to solve this (I'm sure, this is not), but, I guess this is a good one. The idea is simple, start by searching for the lowest token from blob2 in blob1. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length.
Continue your search, from your last stop, but, at this time, searching for token with the updated token length from blob2. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length. Repeat the previous procedure until the end of your buffer.
Bellow is a simple fluxogram trying to explain this algorithm and, in sequence, a simple complete program, showing an implementation.
#include <algorithm>
#include <vector>
#include <iostream>
/////////////////////0123456789012345678901234567
const char str1[] = "HELLO LOOO HELOO LOOO LOO JU";
const char str2[] = "LOOO TUS";
int main()
{
std::vector<char> blob1(strlen(str1));
std::vector<char> blob2(strlen(str2));
blob1.reserve(30);
blob2.reserve(30);
std::copy(str1, str1+strlen(str1), blob1.begin());
std::copy(str2, str2+strlen(str2), blob2.begin());
auto next = blob1.begin();
auto tokenLength = 1;
auto position = -1;
while ( std::next(next, tokenLength) < blob1.end() ) {
auto current = std::search(next,
blob1.end(),
blob2.begin(),
std::next(blob2.begin(), tokenLength));
if (current == blob1.end() )
break;
position = std::distance(blob1.begin(), current);
next = std::next(current, 1);
for (auto i = tokenLength; std::next(blob2.begin(), i) < blob2.end(); ++i) {
auto x = std::search(std::next(current, i),
std::next(current, i + 1),
std::next(blob2.begin(), i),
std::next(blob2.begin(), i + 1));
if ( x != std::next(current, i) )
break;
++tokenLength;
}
}
std::cout << "Index: " << position << ", length: " << tokenLength << std::endl;
}

Related

Searching for an exact string match in a (arbitrary large) stream - C++

I am building a simple multi-server for string matching. I handle multiple clients at the same time by using sockets and select. The only job that the server does is this: a client connects to a server and sends a needle (of size less than 10 GB) and a haystack (of arbitrary size) as a stream through a network socket. Needle and haystack are an arbitrary binary data.
Server needs to search the haystack for all occurrences of the needle (as an exact string match) and sends a number of needle matches back to the client. Server needs to process clients on the fly and be able to handle any input in a reasonable time (that is a search algorithm have to have a linear time complexity).
To do this I obviously need to split the haystack into a small parts (possibly smaller than the needle) in order to process them as they are coming through the network socket. That is I would need a search algorithm that is able to handle a string, that is split into parts and search in it, the same way as strstr(...) does.
I could not find any standard C or C++ library function nor a Boost library object that could handle a string by parts. If I am not mistaken, algorithms in strstr(), string.find() and Boost searching/knuth_morris_pratt.hpp are only able to handle the search, when a whole haystack is in a continuous block of memory. Or is there some trick, that I could use to search a string by parts that I am missing? Do you guys know of any C/C++ library, that is able to cope with such a large needles and haystacks resp. that is able to handle haystack streams or search in haystack by parts?
I did not find any useful library by googling and hence I was forced to create my own variation of Knuth Morris Pratt algorithm, that is able to remember its own state (shown bellow). However I do not find it to be an optimal solution, as a well tuned string searching algorithm would surely perform better in my opinion, and it would be a less worry for a debugging later.
So my question is:
Is there some more elegant way to search in a large haystack stream by parts, other than creating my own search algorithm? Is there any trick how to use a standard C string library for this? Is there some C/C++ library that is specialized for a this kind of task?
Here is a (part of) code of my midified KMP algorithm:
#include <cstdlib>
#include <cstring>
#include <cstdio>
class knuth_morris_pratt {
const char* const needle;
const size_t needle_len;
const int* const lps; // a longest proper suffix table (skip table)
// suffix_len is an ofset of a longest haystack_part suffix matching with
// some prefix of the needle. suffix_len myst be shorter than needle_len.
// Ofset is defined as a last matching character in a needle.
size_t suffix_len;
size_t match_count; // a number of needles found in haystack
public:
inline knuth_morris_pratt(const char* needle, size_t len) :
needle(needle), needle_len(len),
lps( build_lps_array() ), suffix_len(0),
match_count(len == 0 ? 1 : 0) { }
inline ~knuth_morris_pratt() { free((void*)lps); }
void search_part(const char* haystack_part, size_t hp_len); // processes a given part of the haystack stream
inline size_t get_match_count() { return match_count; }
private:
const int* build_lps_array();
};
// Worst case complexity: linear space, linear time
// see: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
// see article: KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings
void knuth_morris_pratt::search_part(const char* haystack_part, size_t hp_len) {
if(needle_len == 0) {
match_count += hp_len;
return;
}
const char* hs = haystack_part;
size_t i = 0; // index for txt[]
size_t j = suffix_len; // index for pat[]
while (i < hp_len) {
if (needle[j] == hs[i]) {
j++;
i++;
}
if (j == needle_len) {
// a needle found
match_count++;
j = lps[j - 1];
}
else if (i < hp_len && needle[j] != hs[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
suffix_len = j;
}
const int* knuth_morris_pratt::build_lps_array() {
int* const new_lps = (int*)malloc(needle_len);
// check_cond_fatal(new_lps != NULL, "Unable to alocate memory in knuth_morris_pratt(..)");
// length of the previous longest prefix suffix
size_t len = 0;
new_lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
size_t i = 1;
while (i < needle_len) {
if (needle[i] == needle[len]) {
len++;
new_lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = new_lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
new_lps[i] = 0;
i++;
}
}
}
return new_lps;
}
int main()
{
const char* needle = "lorem";
const char* p1 = "sit voluptatem accusantium doloremque laudantium qui dolo";
const char* p2 = "rem ipsum quia dolor sit amet";
const char* p3 = "dolorem eum fugiat quo voluptas nulla pariatur?";
knuth_morris_pratt searcher(needle, strlen(needle));
searcher.search_part(p1, strlen(p1));
searcher.search_part(p2, strlen(p2));
searcher.search_part(p3, strlen(p3));
printf("%d \n", (int)searcher.get_match_count());
return 0;
}

You can have a look at BNDM, which has same performances as KMP:
O(m) for preprocessing
O(n) for matching.
It is used for nrgrep, the sources of which can be found here which containts C sources.
C source for BNDM algo are here.
See here for more information.

If I have well understood your problem, you want to search if a large std::string received part by part contains a substring.
If it is the case, I think you can store for each iteration the overlapping section between two contiguous received packets. And then you just have to check for each iteration that either the overlap or the packet contains the desired pattern to find.
In the example below, I consider the following contains() function to search a pattern in a std::string:
bool contains(const std::string & str, const std::string & pattern)
{
bool found(false);
if(!pattern.empty() && (pattern.length() < str.length()))
{
for(size_t i = 0; !found && (i <= str.length()-pattern.length()); ++i)
{
if((str[i] == pattern[0]) && (str.substr(i, pattern.length()) == pattern))
{
found = true;
}
}
}
return found;
}
Example:
std::string pattern("something"); // The pattern we want to find
std::string end_of_previous_packet(""); // The first part of overlapping section
std::string beginning_of_current_packet(""); // The second part of overlapping section
std::string overlap; // The string to store the overlap at each iteration
bool found(false);
while(!found && !all_data_received()) // stop condition
{
// Get the current packet
std::string packet = receive_part();
// Set the beginning of the current packet
beginning_of_current_packet = packet.substr(0, pattern.length());
// Build the overlap
overlap = end_of_previous_packet + beginning_of_current_packet;
// If the overlap or the packet contains the pattern, we found a match
if(contains(overlap, pattern) || contains(packet, pattern))
found = true;
// Set the end of previous packet for the next iteration
end_of_previous_packet = packet.substr(packet.length()-pattern.length());
}
Of course, in this example I made the assumption that the method receive_part() already exists. Same thing for the all_data_received() function. It is just an example to illustrate the idea.
I hope it will help you to find a solution.

How to cut off parts of a string, which every string in a collection has

My currently problem is the following:
I have a std::vector of full path names to files.
Now i want to cut off the common prefix of all string.
Example
If I have these 3 strings in the vector:
/home/user/foo.txt
/home/user/bar.txt
/home/baz.txt
I would like to cut off /home/ from every string in the vector.
Question
Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?

This can be done entirely with std:: algorithms.
synopsis:
sort the input range if not already sorted. The first and last paths in the sorted range
will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)
use std::mismatch to determine the larges common sequence between the
two most dissimilar paths [insignificant]
run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)
Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
std::string common_substring(const std::string& l, const std::string& r)
{
return std::string(l.begin(),
std::mismatch(l.begin(), l.end(),
r.begin(), r.end()).first);
}
std::string mutating_common_substring(std::vector<std::string>& range)
{
if (range.empty())
return std::string();
else
{
if (not std::is_sorted(range.begin(), range.end()))
std::sort(range.begin(), range.end());
return common_substring(range.front(), range.back());
}
}
std::vector<std::string> chop(std::vector<std::string> samples)
{
auto str = mutating_common_substring(samples);
for (auto& s : samples)
{
s.erase(s.begin(), std::next(s.begin(), str.size()));
}
return samples;
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
samples = chop(std::move(samples));
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
expected:
baz.txt
user/bar.txt
user/foo.txt
Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:
std::string common_substring(const std::vector<std::string>& range)
{
if (range.empty())
{
return {};
}
return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
[](auto const& best, const auto& sample)
{
return common_substring(best, sample);
});
}
update:
Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.
#include <iostream>
#include <vector>
#include <string>
void reduce_to_common(std::string& best, const std::string& sample)
{
best.erase(std::mismatch(best.begin(), best.end(),
sample.begin(), sample.end()).first,
best.end());
}
void remove_common_prefix(std::vector<std::string>& range)
{
if (range.size())
{
auto iter = range.begin();
auto best = *iter;
for ( ; ++iter != range.end() ; )
{
reduce_to_common(best, *iter);
}
auto prefix_length = best.size();
for (auto& s : range)
{
s.erase(s.begin(), std::next(s.begin(), prefix_length));
}
}
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
remove_common_prefix(samples);
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}

You have to search every string in the list. However you don't need to compare all the characters in every string. The common prefix can only get shorter, so you only need to compare with "the common prefix so far". I don't think this changes the big-O complexity - but it will make quite a difference to the actual speed.
Also, these look like file names. Are they sorted (bearing in mind that many filesystems tend to return things in sorted order)? If so, you only need to consider the first and last elements. If they are probably pr mostly ordered, then consider the common prefix of the first and last, and then iterate through all the other strings shortening the prefix further as necessary.

You just have to iterate over every string. You can only avoid iterating over the full length of strings needlessly by exploiting the fact, that the prefix can only shorten:
#include <iostream>
#include <string>
#include <vector>
std::string common_prefix(const std::vector<std::string> &ss) {
if (ss.empty())
// no prefix
return "";
std::string prefix = ss[0];
for (size_t i = 1; i < ss.size(); i++) {
size_t c = 0; // index after which the string differ
for (; c < prefix.length(); c++) {
if (prefix[c] != ss[i][c]) {
// strings differ from character c on
break;
}
}
if (c == 0)
// no common prefix
return "";
// the prefix is only up to character c-1, so resize prefix
prefix.resize(c);
}
return prefix;
}
void strip_common_prefix(std::vector<std::string> &ss) {
std::string prefix = common_prefix(ss);
if (prefix.empty())
// no common prefix, nothing to do
return;
// drop the common part, which are always the first prefix.length() characters
for (std::string &s: ss) {
s = s.substr(prefix.length());
}
}
int main()
{
std::vector<std::string> ss { "/home/user/foo.txt", "/home/user/bar.txt", "/home/baz.txt"};
strip_common_prefix(ss);
for (std::string &s: ss)
std::cout << s << "\n";
}
Drawing from the hints of Martin Bonner's answer, you may implement a more efficient algorithm if you have more prior knowledge on your input.
In particular, if you know your input is sorted, it suffices to compare the first and last strings (see Richard's answer).

i - Find the file which has the least folder depth (i.e. baz.txt) - it's root path is home
ii - Then go through the other strings to see if they start with that root.
iii - If so then remove root from all the strings.

Start with std::size_t index=0;. Scan the list to see if characters at that index match (note: past the end does not match). If it does, advance index and repeat.
When done, index will have the value of the length of the prefix.
At this point, I'd advise you to write or find a string_view type. If you do, simply create a string_view for each of your strings str with start/end of index, str.size().
Overall cost: O(|prefix|*N+N), which is also the cost to confirm that your answer is correct.
If you don't want to write a string_view, simply call str.erase(str.begin(), str.begin()+index) on each str in your vector.
Overall cost is O(|total string length|+N). The prefix has to be visited in order to confirm it, then the tail of the string has to be rewritten.
Now the cost of the breadth-first is locality, as you are touching memory all over the place. It will probably be more efficient in practice to do it in chunks, where you scan the first K strings up to length Q and find the common prefix, then chain that common prefix plus the next block. This won't change the O-notation, but will improve locality of memory reference.

for(vector<string>::iterator itr=V.begin(); itr!=V.end(); ++itr)
itr->erase(0,6);

algorithm for making optimise string from sub strings

Suppose I have a collection of substrings, for example:
string a = {"cat","sensitive","ate","energy","tense"}
Then the output for this should be as follows:
catensesensitivenergy
How can I do this?

This problem is known as the shortest common superstring problem and it is NP-hard, so if you need an exact solution you cannot do much better then trying all possibilities and choosing the best one.
One possible exponential solution is to generate all permutations of the input strings, find the shortest common superstring greedily for each permutation(a permutation specifies the order of strings and it is possible to prove that for a fixed order greedy algorithm always works correctly) and choose the best one.

Using user2040251 suggestion:
#include <string>
#include <iostream>
#include <algorithm>
std::string merge_strings( const std::vector< std::string > & pool )
{
std::string retval;
for( auto s : pool )
if( retval.empty() )
retval.append( s );
else if( std::search( retval.begin(), retval.end(), s.begin(), s.end() ) == retval.end() )
{
size_t len = std::min( retval.size(), s.size() );
for( ; len; --len )
if( retval.substr( retval.size() - len ) == s.substr( 0, len ) )
{
retval.append( s.substr( len ) );
break;
}
if( !len )
retval.append( s );
}
return retval;
}
std::string shortest_common_supersequence( std::vector< std::string > & pool )
{
std::sort( pool.begin(), pool.end() );
std::string buffer;
std::string best_reduction = merge_strings( pool );
while( std::next_permutation( pool.begin(), pool.end() ) )
{
buffer = merge_strings( pool );
if( buffer.size() < best_reduction.size() )
best_reduction = buffer;
}
return best_reduction;
}
int main( int argc, char ** argv )
{
std::vector< std::string > a{"cat","sensitive","ate","energy","tense"};
std::vector< std::string > b{"cat","sensitive","ate","energy","tense","sit"};
std::vector< std::string > c{"personal","ate","energy","tense","gyroscope"};
std::cout << "best a --> \"" << shortest_common_supersequence( a ) << "\"\n";
std::cout << "best b --> \"" << shortest_common_supersequence( b ) << "\"\n";
std::cout << "best c --> \"" << shortest_common_supersequence( c ) << "\"\n";
return 0;
}
Output:
best a --> "catensensitivenergy"
best b --> "catensensitivenergy"
best c --> "atensenergyroscopersonal"

Break the problem down and see what we got. Starting with only two strings. We must check which suffix of one string is the longest prefix of the other. This gives us the order for the best concatenation.
Now, with a set of n word, how do we do ? We start by building a trie containing every word (a key for each word). If a word is a duplicate of an other we can easily flag it as such while building the prefix tree.
I made a quick implementation of a regular Trie. You can find it here.
We have the tools to build a directed graph linking the different words wether a suffix of the first is a prefix of the second. The weight of the edge is the length of the suffix.
To do so, for each word w of the input set, we must see which words we can reach with a suffix of w :
We walk down the trie using the suffix. We will end up in a node (or not).
From this node, provided it exists, we scan the remaining subtree to see which words are
available.
If a given suffix of length l yields a match with a
prefix of word w', then we add an edge w → w', with weight length(w') - l.
If such an edge already exists, we just update the weight to keep the lowest.
From there, the graph is set and we must find the shortest path that runs through every vertex (eg. word) only once. If the graph is complete, this is the Traveling Salesman Problem. Most of the times, the graph won't be complete.
Still, it remains a NP-hard problem. In more "technical" terms, the problem at hand is to find the shortest hamiltonian path of a digraph.
Note : Given an Hamiltonian path (if it exists) with its cost C, and its starting vertex (word) W, the supestring length is given by :
Lsuper = LW + C
Note : If two words have no suffix linking them to another word, then the graph is not connected and there is no hamiltonian path.

How to find the first occurrence of any string given a particular group of strings?

For example, I have 3 strings I am interested in finding in a much larger string. How would I use std algorithms to find the first occurrence of anyone of these strings?
std::string str = ...//Some large string to search in.
std::array<std::string, 3> tokens{ "abc", "qwe", "zxc" };
...//Find the the next occurrence of either "abc", "qwe", "zxc" in str, whichever comes first.
...// Process based on the result.

If you are restricted to the Standard Library, or wish for something simple, the simplest alternative would be to search them one at a time and keep the "best" position:
size_t bestIndex = 0;
size_t bestPosition = str.size();
for (size_t i = 0, max = tokens.size(); i < max; ++i) {
size_t const pos = str.find(tokens[i]);
if (pos >= bestPosition) { continue; }
bestPosition = pos;
bestIndex = i
}
If you have more resources, or wish for more performance, then Aho-Corasick is a good algorithm to search multiple needles in a single pass. It is definitely more complicated though.

Efficiently check string for one of several hundred possible suffixes

I need to write a C/C++ function that would quickly check if string ends with one of ~1000 predefined suffixes. Specifically the string is a hostname and I need to check if it belongs to one of several hundred predefined second-level domains.
This function will be called a lot so it needs to be written as efficiently as possible. Bitwise hacks etc anything goes as long as it turns out fast.
Set of suffixes is predetermined at compile-time and doesn't change.
I am thinking of either implementing a variation of Rabin-Karp or write a tool that would generate a function with nested ifs and switches that would be custom tailored to specific set of suffixes. Since the application in question is 64-bit to speed up comparisons I could store suffixes of up to 8 bytes in length as const sorted array and do binary search within it.
Are there any other reasonable options?

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance
suffixes:
foo
bar
bao
reverse order suffix trie:
o
-a-b (matches bao)
-o-f (matches foo)
r-a-b (matches bar)
These can then be used to match your string:
"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix

You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.
Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

I would reverse all of the suffix strings, build a prefix tree of them and then test the reverse of your IP string against that.

I think that building your own automata would be the most efficient way.. it's a sort of your second solution, according to which, starting from a finite set of suffixes, it generates an automaton fitted for that suffixes.
I think you can easily use flex to do it, taking care of reversing the input or handling in a special way the fact that you are looking just for suffixes (just for efficienty matters)..
By the way using a Rabin-Karp approach would be efficient too since your suffixes will be short. You can fit a hashset with all the suffixes needed and then
take a string
take the suffix
calculate the hash of the suffix
check if suffix is in the table

Just create a 26x26 array of set of domains. e.g. thisArray[0][0] will be the domains that end in 'aa', thisArray[0][1] is all the domains that end in 'ab' and so on...
Once you have that, just search your array for thisArray[2nd last char of hostname][last char of hostname] to get the possible domains. If there's more than one at that stage, just brute force the rest.

I think that the solution should be very different depending on the type of input strings. If the strings are some kind of string class that can be iterated from the end (such as stl strings) it is a lot easier than if they are NULL-terminated C-strings.
String Class
Iterate the string backwards (don't make a reverse copy - use some kind of backward iterator). Build a Trie where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to match less than 8 characters - e.g. deny "*.org" would give a mask with 32 bits set. The mask is also used as termination criteria.
C strings
Construct an NDFA that matches the strings on a single-pass over them. That way you don't have to first iterate to the end but can instead use it in one pass. An NDFA can be converted to a DFA, which will probably make the implementation more efficient. Both construction of the NDFA and conversion to DFA will probably be so complex that you will have to write tools for it.

After some research and deliberation I've decided to go with trie/finite state machine approach.
The string is parsed starting from the last character going backwards using a TRIE as long as the portion of suffix that was parsed so far can correspond to multiple suffixes. At some point we either hit the first character of one of the possible suffixes which means that we have a match, hit a dead end, which means there are no more possible matches or get into situation where there is only one suffix candidate. In this case we just do compare remainder of the suffix.
Since trie lookups are constant time, worst case complexity is o(maximum suffix length). The function turned out to be pretty fast. On 2.8Ghz Core i5 it can check 33,000,000 strings per second for 2K possible suffixes. 2K suffixes totaling 18 kilobytes, expanded to 320kb trie/state machine table. I guess that I could have stored it more efficiently but this solution seems to work good enough for the time being.
Since suffix list was so large, I didn't want to code it all by hand so I ended up writing C# application that generated C code for the suffix checking function:
public static uint GetFourBytes(string s, int index)
{
byte[] bytes = new byte[4] { 0, 0, 0, 0};
int len = Math.Min(s.Length - index, 4);
Encoding.ASCII.GetBytes(s, index, len, bytes, 0);
return BitConverter.ToUInt32(bytes, 0);
}
public static string ReverseString(string s)
{
char[] chars = s.ToCharArray();
Array.Reverse(chars);
return new string(chars);
}
static StringBuilder trieArray = new StringBuilder();
static int trieArraySize = 0;
static void Main(string[] args)
{
// read all non-empty lines from input file
var suffixes = File
.ReadAllLines(#"suffixes.txt")
.Where(l => !string.IsNullOrEmpty(l));
var reversedSuffixes = suffixes
.Select(s => ReverseString(s));
int start = CreateTrieNode(reversedSuffixes, "");
string outFName = #"checkStringSuffix.debug.h";
if (args.Length != 0 && args[0] == "--release")
{
outFName = #"checkStringSuffix.h";
}
using (StreamWriter wrt = new StreamWriter(outFName))
{
wrt.WriteLine(
"#pragma once\n\n" +
"#define TRIE_NONE -1000000\n"+
"#define TRIE_DONE -2000000\n\n"
);
wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray);
wrt.WriteLine(
"inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+
" int len = trie[0];\n"+
" if (curr - str < len) return false;\n"+
" const char* cmp = (const char*)(trie + 1);\n"+
" while (len-- > 0) {\n"+
" if (*--curr != *cmp++) return false;\n"+
" }\n"+
" return true;\n"+
"}\n\n"+
"bool checkStringSuffix(const char* str, int len) {\n" +
" if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" +
" const char* curr = (str + len - 1);\n"+
" int currTrie = " + start.ToString() + ";\n"+
" while (curr >= str) {\n" +
" assert(*curr >= 0x20 && *curr <= 0x7f);\n" +
" currTrie = trieArray[currTrie + *curr - 0x20];\n" +
" if (currTrie < 0) {\n" +
" if (currTrie == TRIE_NONE) return false;\n" +
" if (currTrie == TRIE_DONE) return true;\n" +
" return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" +
" }\n"+
" --curr;\n"+
" }\n" +
" return false;\n"+
"}\n"
);
}
}
private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix)
{
int retVal = trieArraySize;
if (suffixes.Count() == 1)
{
string theSuffix = suffixes.Single();
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix);
++trieArraySize;
for (int i = 0; i < theSuffix.Length; i += 4)
{
trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i));
++trieArraySize;
}
retVal = -(retVal + 1);
}
else
{
var groupByFirstChar =
from s in suffixes
let first = s[0]
let remainder = s.Substring(1)
group remainder by first;
string[] trieIndexes = new string[0x60];
for (int i = 0; i < trieIndexes.Length; ++i)
{
trieIndexes[i] = "TRIE_NONE";
}
foreach (var g in groupByFirstChar)
{
if (g.Any(s => s == string.Empty))
{
trieIndexes[g.Key - 0x20] = "TRIE_DONE";
continue;
}
trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString();
}
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix);
retVal = trieArraySize;
trieArraySize += 0x60;
}
return retVal;
}
So it generates code like this:
inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {
int len = trie[0];
if (curr - str < len) return false;
const char* cmp = (const char*)(trie + 1);
while (len-- > 0) {
if (*--curr != *cmp++) return false;
}
return true;
}
bool checkStringSuffix(const char* str, int len) {
if (len < 5) return false;
const char* curr = (str + len - 1);
int currTrie = 81921;
while (curr >= str) {
assert(*curr >= 0x20 && *curr <= 0x7f);
currTrie = trieArray[currTrie + *curr - 0x20];
if (currTrie < 0) {
if (currTrie == TRIE_NONE) return false;
if (currTrie == TRIE_DONE) return true;
return checkSingleSuffix(str, curr, trieArray - currTrie - 1);
}
--curr;
}
return false;
}
Since for my particular set of data len in checkSingleSuffix was never more than 9, I tried to replace the comparison loop with switch (len) and hardcoded comparison routines that compared up to 8 bytes of data at a time but it didn't affect overall performance at all either way.
Thanks for everyone who contributed their ideas!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js