Efficiently check string for one of several hundred possible suffixes - c++

I need to write a C/C++ function that would quickly check if string ends with one of ~1000 predefined suffixes. Specifically the string is a hostname and I need to check if it belongs to one of several hundred predefined second-level domains.
This function will be called a lot so it needs to be written as efficiently as possible. Bitwise hacks etc anything goes as long as it turns out fast.
Set of suffixes is predetermined at compile-time and doesn't change.
I am thinking of either implementing a variation of Rabin-Karp or write a tool that would generate a function with nested ifs and switches that would be custom tailored to specific set of suffixes. Since the application in question is 64-bit to speed up comparisons I could store suffixes of up to 8 bytes in length as const sorted array and do binary search within it.
Are there any other reasonable options?

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance
reverse order suffix trie:
-a-b (matches bao)
-o-f (matches foo)
r-a-b (matches bar)
These can then be used to match your string:
"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix

You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.
Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

I would reverse all of the suffix strings, build a prefix tree of them and then test the reverse of your IP string against that.

I think that building your own automata would be the most efficient way.. it's a sort of your second solution, according to which, starting from a finite set of suffixes, it generates an automaton fitted for that suffixes.
I think you can easily use flex to do it, taking care of reversing the input or handling in a special way the fact that you are looking just for suffixes (just for efficienty matters)..
By the way using a Rabin-Karp approach would be efficient too since your suffixes will be short. You can fit a hashset with all the suffixes needed and then
take a string
take the suffix
calculate the hash of the suffix
check if suffix is in the table

Just create a 26x26 array of set of domains. e.g. thisArray[0][0] will be the domains that end in 'aa', thisArray[0][1] is all the domains that end in 'ab' and so on...
Once you have that, just search your array for thisArray[2nd last char of hostname][last char of hostname] to get the possible domains. If there's more than one at that stage, just brute force the rest.

I think that the solution should be very different depending on the type of input strings. If the strings are some kind of string class that can be iterated from the end (such as stl strings) it is a lot easier than if they are NULL-terminated C-strings.
String Class
Iterate the string backwards (don't make a reverse copy - use some kind of backward iterator). Build a Trie where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to match less than 8 characters - e.g. deny "*.org" would give a mask with 32 bits set. The mask is also used as termination criteria.
C strings
Construct an NDFA that matches the strings on a single-pass over them. That way you don't have to first iterate to the end but can instead use it in one pass. An NDFA can be converted to a DFA, which will probably make the implementation more efficient. Both construction of the NDFA and conversion to DFA will probably be so complex that you will have to write tools for it.

After some research and deliberation I've decided to go with trie/finite state machine approach.
The string is parsed starting from the last character going backwards using a TRIE as long as the portion of suffix that was parsed so far can correspond to multiple suffixes. At some point we either hit the first character of one of the possible suffixes which means that we have a match, hit a dead end, which means there are no more possible matches or get into situation where there is only one suffix candidate. In this case we just do compare remainder of the suffix.
Since trie lookups are constant time, worst case complexity is o(maximum suffix length). The function turned out to be pretty fast. On 2.8Ghz Core i5 it can check 33,000,000 strings per second for 2K possible suffixes. 2K suffixes totaling 18 kilobytes, expanded to 320kb trie/state machine table. I guess that I could have stored it more efficiently but this solution seems to work good enough for the time being.
Since suffix list was so large, I didn't want to code it all by hand so I ended up writing C# application that generated C code for the suffix checking function:
public static uint GetFourBytes(string s, int index)
byte[] bytes = new byte[4] { 0, 0, 0, 0};
int len = Math.Min(s.Length - index, 4);
Encoding.ASCII.GetBytes(s, index, len, bytes, 0);
return BitConverter.ToUInt32(bytes, 0);
public static string ReverseString(string s)
char[] chars = s.ToCharArray();
return new string(chars);
static StringBuilder trieArray = new StringBuilder();
static int trieArraySize = 0;
static void Main(string[] args)
// read all non-empty lines from input file
var suffixes = File
.Where(l => !string.IsNullOrEmpty(l));
var reversedSuffixes = suffixes
.Select(s => ReverseString(s));
int start = CreateTrieNode(reversedSuffixes, "");
string outFName = #"checkStringSuffix.debug.h";
if (args.Length != 0 && args[0] == "--release")
outFName = #"checkStringSuffix.h";
using (StreamWriter wrt = new StreamWriter(outFName))
"#pragma once\n\n" +
"#define TRIE_NONE -1000000\n"+
"#define TRIE_DONE -2000000\n\n"
wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray);
"inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+
" int len = trie[0];\n"+
" if (curr - str < len) return false;\n"+
" const char* cmp = (const char*)(trie + 1);\n"+
" while (len-- > 0) {\n"+
" if (*--curr != *cmp++) return false;\n"+
" }\n"+
" return true;\n"+
"bool checkStringSuffix(const char* str, int len) {\n" +
" if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" +
" const char* curr = (str + len - 1);\n"+
" int currTrie = " + start.ToString() + ";\n"+
" while (curr >= str) {\n" +
" assert(*curr >= 0x20 && *curr <= 0x7f);\n" +
" currTrie = trieArray[currTrie + *curr - 0x20];\n" +
" if (currTrie < 0) {\n" +
" if (currTrie == TRIE_NONE) return false;\n" +
" if (currTrie == TRIE_DONE) return true;\n" +
" return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" +
" }\n"+
" --curr;\n"+
" }\n" +
" return false;\n"+
private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix)
int retVal = trieArraySize;
if (suffixes.Count() == 1)
string theSuffix = suffixes.Single();
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix);
for (int i = 0; i < theSuffix.Length; i += 4)
trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i));
retVal = -(retVal + 1);
var groupByFirstChar =
from s in suffixes
let first = s[0]
let remainder = s.Substring(1)
group remainder by first;
string[] trieIndexes = new string[0x60];
for (int i = 0; i < trieIndexes.Length; ++i)
trieIndexes[i] = "TRIE_NONE";
foreach (var g in groupByFirstChar)
if (g.Any(s => s == string.Empty))
trieIndexes[g.Key - 0x20] = "TRIE_DONE";
trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString();
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix);
retVal = trieArraySize;
trieArraySize += 0x60;
return retVal;
So it generates code like this:
inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {
int len = trie[0];
if (curr - str < len) return false;
const char* cmp = (const char*)(trie + 1);
while (len-- > 0) {
if (*--curr != *cmp++) return false;
return true;
bool checkStringSuffix(const char* str, int len) {
if (len < 5) return false;
const char* curr = (str + len - 1);
int currTrie = 81921;
while (curr >= str) {
assert(*curr >= 0x20 && *curr <= 0x7f);
currTrie = trieArray[currTrie + *curr - 0x20];
if (currTrie < 0) {
if (currTrie == TRIE_NONE) return false;
if (currTrie == TRIE_DONE) return true;
return checkSingleSuffix(str, curr, trieArray - currTrie - 1);
return false;
Since for my particular set of data len in checkSingleSuffix was never more than 9, I tried to replace the comparison loop with switch (len) and hardcoded comparison routines that compared up to 8 bytes of data at a time but it didn't affect overall performance at all either way.
Generate string lexicographically larger than input

Given an input string A, is there a concise way to generate a string B that is lexicographically larger than A, i.e. A < B == true?
My raw solution would be to say:
B = A;
but in general this won't work because:
A might be empty
The last character of A may be close to wraparound, in which case the resulting character will have a smaller value i.e. B < A.
Adding an extra character every time is wasteful and will quickly in unreasonably large strings.
So I was wondering whether there's a standard library function that can help me here, or if there's a strategy that scales nicely when I want to start from an arbitrary string.
You can duplicate A into B then look at the final character. If the final character isn't the final character in your range, then you can simply increment it by one.
Otherwise you can look at last-1, last-2, last-3. If you get to the front of the list of chars, then append to the length.
Here is my dummy solution:
std::string make_greater_string(std::string const &input)
std::string ret{std::numeric_limits<
if (!input.empty())
if (std::numeric_limits<std::string::value_type>::max()
== input.back())
ret = input + ret;
ret = input;
return ret;
Ideally I'd hope to avoid the explicit handling of all special cases, and use some facility that can more naturally handle them. Already looking at the answer by #JosephLarson I see that I could increment more that the last character which would improve the range achievable without adding more characters.
And here's the refinement after the suggestions in this post:
std::string make_greater_string(std::string const &input)
constexpr char minC = ' ', maxC = '~';
// Working with limits was a pain,
// using ASCII typical limit values instead.
std::string ret{minC};
auto rit = input.rbegin();
while (rit != input.rend())
if (maxC == *rit)
if (rit == input.rend())
ret = input + ret;
ret = input;
++(*(ret.rbegin() + std::distance(input.rbegin(), rit)));
return ret;
You can copy the string and append some letters - this will produce a lexicographically larger result.
B = A + "a"

Searching for an exact string match in a (arbitrary large) stream - C++

I am building a simple multi-server for string matching. I handle multiple clients at the same time by using sockets and select. The only job that the server does is this: a client connects to a server and sends a needle (of size less than 10 GB) and a haystack (of arbitrary size) as a stream through a network socket. Needle and haystack are an arbitrary binary data.
Server needs to search the haystack for all occurrences of the needle (as an exact string match) and sends a number of needle matches back to the client. Server needs to process clients on the fly and be able to handle any input in a reasonable time (that is a search algorithm have to have a linear time complexity).
To do this I obviously need to split the haystack into a small parts (possibly smaller than the needle) in order to process them as they are coming through the network socket. That is I would need a search algorithm that is able to handle a string, that is split into parts and search in it, the same way as strstr(...) does.
I could not find any standard C or C++ library function nor a Boost library object that could handle a string by parts. If I am not mistaken, algorithms in strstr(), string.find() and Boost searching/knuth_morris_pratt.hpp are only able to handle the search, when a whole haystack is in a continuous block of memory. Or is there some trick, that I could use to search a string by parts that I am missing? Do you guys know of any C/C++ library, that is able to cope with such a large needles and haystacks resp. that is able to handle haystack streams or search in haystack by parts?
I did not find any useful library by googling and hence I was forced to create my own variation of Knuth Morris Pratt algorithm, that is able to remember its own state (shown bellow). However I do not find it to be an optimal solution, as a well tuned string searching algorithm would surely perform better in my opinion, and it would be a less worry for a debugging later.
So my question is:
Is there some more elegant way to search in a large haystack stream by parts, other than creating my own search algorithm? Is there any trick how to use a standard C string library for this? Is there some C/C++ library that is specialized for a this kind of task?
Here is a (part of) code of my midified KMP algorithm:
#include <cstdlib>
#include <cstring>
#include <cstdio>
class knuth_morris_pratt {
const char* const needle;
const size_t needle_len;
const int* const lps; // a longest proper suffix table (skip table)
// suffix_len is an ofset of a longest haystack_part suffix matching with
// some prefix of the needle. suffix_len myst be shorter than needle_len.
// Ofset is defined as a last matching character in a needle.
size_t suffix_len;
size_t match_count; // a number of needles found in haystack
inline knuth_morris_pratt(const char* needle, size_t len) :
needle(needle), needle_len(len),
lps( build_lps_array() ), suffix_len(0),
match_count(len == 0 ? 1 : 0) { }
inline ~knuth_morris_pratt() { free((void*)lps); }
void search_part(const char* haystack_part, size_t hp_len); // processes a given part of the haystack stream
inline size_t get_match_count() { return match_count; }
const int* build_lps_array();
// Worst case complexity: linear space, linear time
// see: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
// see article: KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings
void knuth_morris_pratt::search_part(const char* haystack_part, size_t hp_len) {
if(needle_len == 0) {
match_count += hp_len;
const char* hs = haystack_part;
size_t i = 0; // index for txt[]
size_t j = suffix_len; // index for pat[]
while (i < hp_len) {
if (needle[j] == hs[i]) {
if (j == needle_len) {
// a needle found
j = lps[j - 1];
else if (i < hp_len && needle[j] != hs[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
i = i + 1;
suffix_len = j;
const int* knuth_morris_pratt::build_lps_array() {
int* const new_lps = (int*)malloc(needle_len);
// check_cond_fatal(new_lps != NULL, "Unable to alocate memory in knuth_morris_pratt(..)");
// length of the previous longest prefix suffix
size_t len = 0;
new_lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
size_t i = 1;
while (i < needle_len) {
if (needle[i] == needle[len]) {
new_lps[i] = len;
else // (pat[i] != pat[len])
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = new_lps[len - 1];
// Also, note that we do not increment
// i here
else // if (len == 0)
new_lps[i] = 0;
return new_lps;
int main()
const char* needle = "lorem";
const char* p1 = "sit voluptatem accusantium doloremque laudantium qui dolo";
const char* p2 = "rem ipsum quia dolor sit amet";
const char* p3 = "dolorem eum fugiat quo voluptas nulla pariatur?";
knuth_morris_pratt searcher(needle, strlen(needle));
searcher.search_part(p1, strlen(p1));
searcher.search_part(p2, strlen(p2));
searcher.search_part(p3, strlen(p3));
printf("%d \n", (int)searcher.get_match_count());
return 0;
You can have a look at BNDM, which has same performances as KMP:
O(m) for preprocessing
O(n) for matching.
It is used for nrgrep, the sources of which can be found here which containts C sources.
C source for BNDM algo are here.
See here for more information.
If I have well understood your problem, you want to search if a large std::string received part by part contains a substring.
If it is the case, I think you can store for each iteration the overlapping section between two contiguous received packets. And then you just have to check for each iteration that either the overlap or the packet contains the desired pattern to find.
In the example below, I consider the following contains() function to search a pattern in a std::string:
bool contains(const std::string & str, const std::string & pattern)
bool found(false);
if(!pattern.empty() && (pattern.length() < str.length()))
for(size_t i = 0; !found && (i <= str.length()-pattern.length()); ++i)
if((str[i] == pattern[0]) && (str.substr(i, pattern.length()) == pattern))
found = true;
return found;
std::string pattern("something"); // The pattern we want to find
std::string end_of_previous_packet(""); // The first part of overlapping section
std::string beginning_of_current_packet(""); // The second part of overlapping section
std::string overlap; // The string to store the overlap at each iteration
bool found(false);
while(!found && !all_data_received()) // stop condition
// Get the current packet
std::string packet = receive_part();
// Set the beginning of the current packet
beginning_of_current_packet = packet.substr(0, pattern.length());
// Build the overlap
overlap = end_of_previous_packet + beginning_of_current_packet;
// If the overlap or the packet contains the pattern, we found a match
if(contains(overlap, pattern) || contains(packet, pattern))
found = true;
// Set the end of previous packet for the next iteration
end_of_previous_packet = packet.substr(packet.length()-pattern.length());
Of course, in this example I made the assumption that the method receive_part() already exists. Same thing for the all_data_received() function. It is just an example to illustrate the idea.
I hope it will help you to find a solution.

Character pointers messed up in simple Boyer-Moore implementation

I am currently experimenting with a very simple Boyer-Moore variant.
In general my implementation works, but if I try to utilize it in a loop the character pointer containing the haystack gets messed up. And I mean that characters in it are altered, or mixed.
The result is consistent, i.e. running the same test multiple times yields the same screw up.
This is the looping code:
string src("This haystack contains a needle! needless to say that only 2 matches need to be found!");
string pat("needle");
const char* res = src.c_str();
while((res = boyerMoore(res, pat)))
This is my implementation of the string search algorithm (the above code calls a convenience wrapper which pulls the character pointer and length of the string):
unsigned char*
boyerMoore(const unsigned char* src, size_t srcLgth, const unsigned char* pat, size_t patLgth)
if(srcLgth < patLgth || !src || !pat)
return nullptr;
size_t skip[UCHAR_MAX]; //this is the skip table
for(int i = 0; i < UCHAR_MAX; ++i)
skip[i] = patLgth; //initialize it with default value
for(size_t i = 0; i < patLgth; ++i)
skip[(int)pat[i]] = patLgth - i - 1; //set skip value of chars in pattern
std::cout<<src<<"\n"; //just to see what's going on here!
size_t srcI = patLgth - 1; //our first character to check
while(srcI < srcLgth)
size_t j = 0; //char match ct
while(j < patLgth)
if(src[srcI - j] == pat[patLgth - j - 1])
//since the number of characters to skip may be negative, I just increment in that case
size_t t = skip[(int)src[srcI - j]];
if(t > j)
srcI = srcI + t - j;
if(j == patLgth)
return (unsigned char*)&src[srcI + 1 - j];
return nullptr;
The loop produced this output (i.e. these are the haystacks the algorithm received):
This haystack contains a needle! needless to say that only 2 matches need to be found!
eedle! needless to say that only 2 matches need to be found!
eedless to say that eed 2 meed to beed to be found!
As you can see the input is completely messed up after the second run. What am I missing? I thought the contents could not be modified, since I'm passing const pointers.
Is the way of setting the pointer in the loop wrong, or is my string search screwing up?
Btw: This is the complete code, except for includes and the main function around the looping code.
The missing nullptr of the first return was due to a copy/paste error, in the source it is actually there.
For clarification, this is my wrapper function:
inline char* boyerMoore(const string &src, const string &pat)
return (const char*) boyerMoore((const unsigned char*) src.c_str(), src.size(),
(const unsigned char*) pat.c_str(), pat.size());
In your boyerMoore() function, the first return isn't returning a value (you have just return; rather than return nullptr;) GCC doesn't always warn about missing return values, and not returning anything is undefined behavior. That means that when you store the return value in res and call the function again, there's no telling what will print out. You can see a related discussion here.
Also, you have omitted your convenience function that calculates the length of the strings that you are passing in. I would recommend double checking that logic to make sure the sizes are correct - I'm assuming you are using strlen or similar.

Checking if a word is contained within an array

I want to check for a word contained within a bigger string, but not necessarily in the same order. Example: The program will check if the word "car" exists in "crqijfnsa". In this case, it does, because the second string contains c, a, and r.
You could build a map containing the letters "car" with the values set to 0. Cycle through the array with all the letters and if it is a letter in the word "car" change the value to 1. If all the keys in the map have a value greater than 0, than the word can be constructed. Try implementing this.
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once;
So, actually what you are looking for is an algorithm to check if two words are "Anagrams" are not.
Following thread provides psuedocode that might be helpful
Finding anagrams for a given word
A very primitive code would be something like this:
for ( std::string::iterator it=str.begin(); it!=str.end(); ++it)
for ( std::string::iterator it2=str2.begin(); it2!=str2.end(); ++it2) {
if (*it == *it2) {
if (str2.empty())
found = true;
You could build up a table of count of characters of each letter in the word you are searching for, then decrement those counts as you work through the search string.
bool IsWordInString(const char* word, const char* str)
// build up table of characters in word to match
std::array<int, 256> cword = {0};
for(;*word;++word) {
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
return std::accumulate(cword.begin(), cword.end(), 0) == 0;
It's also possible to return as soon as you find a match, but the code isn't as simple.
bool IsWordInString(const char* word, const char* str)
// empty string
if (*word == 0)
return true;
// build up table of characters in word to match
int unmatched = 0;
char cword[256] = {0};
for(;*word;++word) {
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
if (unmatched == 0)
return true;
return false;
Some test cases
"" in "crqijfnsa" => 1
"car" in "crqijfnsa" => 1
"ccar" in "crqijfnsa" => 0
"ccar" in "crqijfnsac" => 1
I think the easiest (and probably fastest, test that youself :) ) implementation would be done with std::includes:
std::string testword {"car"};
std::string testarray {"crqijfnsa"};
bool is_in_array = std::includes(testarray.begin(),testarray.end(),
This also handles all cases of duplicate letters correctly.
The complexity of this approach should be O(n * log n) where n is the length of testarray. (sort is O(n log n) and includes has linear complexity.

Create longest possible string from vector<char>

I receive data as a vector<char>, from which I need to create a string. Vector may contain utf-16 characters (i.e. null bytes) and is a fixed size. Actual data is padded with null bytes to this fixed sized. So, for example, I can have the following vector:
\0 a \0 b \0 c \0 d \0 \0 \0 \0
Fixed size is 12 and the vector contains utf-16 string "abcd" padded with 4 null chars to size.
From this, I need to actually extract this string. I already have the code for converting from utf-16 to string, the thing where I got myself confused is find the number of characters (bytes) in the vector without the padding. In the example above, the number is 8.
I started by doing something like:
std::string CrmxFile::StringFromBytes(std::vector<char> data, int fixedsize) {
std::vector<char>iterator it = data.rbegin();
while(it != data.rend() && *it == '\0') {
return std::string(&data[0], fixedsize - (it - data.rbegin());
However in the full context, the vector contains a lot of data and I need to do the above manipulation with only a specified part of it. For example, the vector may contain 1000 elements and I need to get the string that starts at position 30 and goes for a max of 12 chars. Of course, I can create another vector and copy the required 21 characters into it before applying the above logic, but I feel that I should be able to do something directly on the given vector. Yet, I can't grasp what iterators I am comparing with what. Any help is appreciated.
Now, this is embarrassing: vector<char>::iterator is obviously a random access iterator, therefore I can decrement it. Hence my method now looks like this:
std::string CrmxFile::StringFromBytes(std::vector<char> data, int fixedsize) {
std::vector<char>::iterator begin = data.begin() + start;
std::vector<char>::iterator end = start + length - 1;
while(it >= begin && *it == '\0') {
if(it >= begin) {
int len = it - begin + 1;
if(IsUtf8Heuristic(begin, begin + len) {
return std::string(begin, begin + len);
else { //(heuristically this is utf-16)
len = ((len + 1) >> 1) << 1;
std::string res;
ConvertUtf16To8(begin, begin + len, std::back_inserter(res));
return res;
else {
return "";
As I understand the question, you want to extract a part of max fixedsize from data, and erase all trailing zeroes. And from the comments you want the optimal solution.
For me, your code is overly complicated if the data will always be in array form. Use indices, they are more self describing.
std::vector<char> data = ...;
int fixedsize = ...;
int start = ...;
int i = start + fixedsize - 1; // last character that can be in the string
while(i >= start && data[i] == 0) i--; // 'remove' the trailing zeroes
std::string result(&data[start], i - start + 1);
This is the optimal algorithm, there are no 'more optimal' algorithm (there is a micro-optimization that consists in testing with ints rather than chars, ie 4 chars in a row).