I am trying to optimize a text search, where I am searching for multiple words. I want to know the frequency of all the words, per line.
I have tried to make it as fast as I can, as I want to run the search many times, with multiple keywords, on the same data.
I still am thinking though that there should be a more efficient way to solve this, so anybody has some good suggestions?
I have put up a simple demo to show the POC on gitlab:
My current search time is 410ms on 6 keywords in a dataset of 408MB
Also, the source of the demo is this:
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
using namespace std;
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
id = it->second;
return id;
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string& line)
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
char str[line.size()];
strncpy(str, line.c_str(), line.size());
// Getting the first token
char *token = strtok(str, newsDelimiters);
while (token != NULL)
//finding a word:
unsigned int id = addWord(wordLookup, token);
// Getting the next token
// If there are no tokens left, NULL is returned
token = strtok(NULL, newsDelimiters);
int main()
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = {"this", "blog", "political", "debate", "climate", "iphone"};
unsigned int searchLength = searchWords.size();
unsigned int searchWordIds[searchLength];
//convert searchWords
unsigned int i = 0;
for(const std::string& word : searchWords)
searchWordIds[i] = addWord(wordLookup, word);
//#### This part is not time critical ####
//reading file and convert words to numbers
fstream newsFile;
if (newsFile.is_open())
string line;
while(getline(newsFile, line))
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
for(std::vector<unsigned int>& line : textAsNumbers)
unsigned int count = 0;
for(unsigned int word : line)
for(unsigned int s = 0; s < searchLength; ++s)
unsigned int searchWord = searchWordIds[s];
if(word == searchWord)
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
cout << elapsed.count() << "ms" << endl;
//#### Print for checking result, time insensitive :)
int n = 0;
for(unsigned int count : counts)
cout << "Count[" << n << "]: " << count << endl;
if(n > 100)
End results
I tried the multiple approaches, and the scores are as following:
Encoding words
410 ms
Hash tables
Öö Tiib & Jérôme Richard
135 ms
Ordered & encoded words
13 ms
Hash tables & encoded words
72 ms
The committed the results also to my gitlab, if you want to check for yourself.
Using hash tables to speed up the search is smart, and does indeed reduce the search time. Better than my blunt approach at least. But it is still using strings, and string comparisons / construction / hashing is rather slow.
The approach of A M to speed up the encoded word search is I think faster because of that.
I have also tried to combine the approaches, to use the hash tables and encoded words together, but that was still slower than A M's custom search.
So I think we learned that A M is pretty good at searching stuff.
Thanks everybody for your input!
If you just want to speed up the part that you marked, then you can get a drastical improvement by sorting all vectors, before you enter this loop.
The searching will be really superfast.
The runtime of the loop will be reduced from 490ms to 10ms.
Can you please check and feed back.
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
#include <algorithm>
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
id = it->second;
return id;
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string line)
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
#pragma warning(suppress : 4996)
// Getting the first token
#pragma warning(suppress : 4996)
char* token = strtok(line.data(), newsDelimiters);
while (token != NULL)
//finding a word:
unsigned int id = addWord(wordLookup, token);
// Getting the next token
// If there are no tokens left, NULL is returned
#pragma warning(suppress : 4996)
token = strtok(NULL, newsDelimiters);
int main()
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = { "this", "blog", "political", "debate", "climate", "iphone" };
unsigned int searchLength = searchWords.size();
std::vector<unsigned int> searchWordIds(searchLength);
//convert searchWords
unsigned int i = 0;
for (const std::string& word : searchWords)
searchWordIds[i] = addWord(wordLookup, word);
std::sort(searchWordIds.begin(), searchWordIds.end());
//#### This part is not time critical ####
//reading file and convert words to numbers
std::fstream newsFile;
newsFile.open("r:\\news.txt", std::ios::in);
if (newsFile.is_open())
std::string line;
while (std::getline(newsFile, line))
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
std::sort(textAsNumbers.back().begin(), textAsNumbers.back().end());
#if 1
std::vector<unsigned int>::iterator last2 = searchWordIds.end();
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
for (std::vector<unsigned int>& line : textAsNumbers)
unsigned int count = 0;
std::vector<unsigned int>::iterator first1 = line.begin();
std::vector<unsigned int>::iterator last1 = line.end();
std::vector<unsigned int>::iterator first2 = searchWordIds.begin();
while (first1 != last1 && first2 != last2) {
if (*first1 < *first2) {
else {
if (!(*first2 < *first1)) {
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
for ( std::vector<unsigned int>& line : textAsNumbers)
unsigned int count = 0;
for (unsigned int word : line)
for (unsigned int s = 0; s < searchLength; ++s)
unsigned int searchWord = searchWordIds[s];
if (word == searchWord)
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
//#### Print for checking result, time insensitive :)
int n = 0;
for (unsigned int count : counts)
std::cout << "Count[" << n << "]: " << count << '\n';
if (n > 100)
We can make the overall program much more faster by optimizing the design:
increase the IO-Buffer size
read the whole file in one shot (not line by line)
use a special encryption for the characters. Convert all none-essential characters to a SPACE. This will make comparison really fast
use special identifier for End-Of-Line, count it and with that get the number of lines
store all words as std::string_view
also the key for the hash map for the dictionary will be a std::string_view
build the hash map in the same loop where words and End-Of_lines will be identified. This reduces duplication of work
Build rows with IDs for words, so that we can compare single integers instead of strings
Sort all those rows will all encoded words. This will make comparing very fast
Use optimized search and compare algorithm to count the matches per line
All this will reduce the runtime for the whole program from the original roughly 40s to ~4.5s. So, nearly ten times faster.
We can see some astonishing results here:
Reading 430MB in 189 ms
And converting all this amount of data in 90 ms
Counting the number of lines in 80ms
Building a hash map with a size of 284k entries in 3.6 s
Sorting 5000 lines with each many entries in unbelievable 367 ms
And doing the matching and counting in 13 ms
Please see an example of an output. I use a 11 years old Windows 7 machine.
And the code:
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>
#include <filesystem>
#include <cstdint>
#include <array>
#include <execution>
#include <unordered_map>
#include <string_view>
// Basic definitions for data types
using MyChar = uint8_t;
using EncoderType = unsigned int;
// Dependent data types
using String = std::basic_string<MyChar, std::char_traits<MyChar>, std::allocator<MyChar>>;
using StringView = std::basic_string_view<MyChar, std::char_traits<MyChar>>;
using IFStream = std::basic_ifstream<MyChar, std::char_traits<MyChar>>;
using Dictionary = std::unordered_map<StringView, EncoderType>;
using DictionaryIter = Dictionary::iterator;
using EncodedLine = std::vector<EncoderType>;
using EncodedLineIter = EncodedLine::iterator;
using EncodedLines = std::vector<EncodedLine>;
using SearchWords = std::vector<StringView>;
using SearchWordsEncoded = EncodedLine;
using CounterForMatchesInOneLine = std::size_t;
using CounterForMatchesForEachLineLine = std::vector<CounterForMatchesInOneLine>;
StringView operator"" _msv(const char* str, std::size_t len) { return StringView{ reinterpret_cast<const MyChar*>(str), len }; };
// Special encoding of values in text
constexpr MyChar SPACE = 254;
constexpr MyChar EOL = 255;
// Speed up reading of file by using larger input buffer
constexpr std::size_t IOBufSize = 5'000'000u;
static MyChar ioBuf[IOBufSize];
// For measuring durations
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
long long elapsedTime{};
void start() { startTime = std::chrono::high_resolution_clock::now(); }
void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
// Main Programm
int main() {
Timer t{}, tAll{}; tAll.start(); // Define Timers
Dictionary dictionary(300000); // The dictionory for words and their encoded IS
EncoderType encodedWordIdentifier{}; // This is for encoding strings. It will be simply incremented for each new word
// The words that we want to search. We use string_views for more efficient processing
SearchWords searchWords{ "this"_msv, "blog"_msv, "political"_msv, "debate"_msv, "climate"_msv, "iphone"_msv };
// And here we will store the encoded search words
SearchWordsEncoded searchWordsEncoded{};
// Add words to dictionary
for (const StringView& searchWord : searchWords) {
dictionary[searchWord] = encodedWordIdentifier;
// Now read the complete text file and start all fata processing
// Open file and check, if it could be opened
if (IFStream ifs{ "r:\\news.txt",std::ios::binary }; ifs) {
// To speed up reading of the file, we will set a bigger input buffer
ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
// Here we will store the complete file, all data
String text{};
// Get number of bytes in file
const std::uintmax_t size = std::filesystem::file_size("r:\\news.txt");
// Read the whole file with one statement. Will be ultrafast
ifs.read(text.data(), size);
t.stop(); std::cout << "Duration for reading complete file:\t\t\t\t" << t << "\tData read: " << ifs.gcount() << " bytes\n";
// No convert characters. Set all none essential characters to space. Build lowercase text. Special Mark for end of line
std::transform(std::execution::par, text.begin(), text.end(), text.begin(), [&](const MyChar c) {return Convert[c]; });
t.stop(); std::cout << "Duration for converting all text data:\t\t\t\t" << t << '\n';
// Count the number of lines. We need this to pre-allocate space for our vectors
std::size_t numberOfLines = std::count(std::execution::par, text.begin(), text.end(), EOL);
if (text.back() == EOL) ++numberOfLines;
t.stop(); std::cout << "Duration for counting number of lines:\t\t\t\t" << t << "\tNumber of lines identified: " <<numberOfLines << '\n';
// Now we can define the vector for the encoded lines with the exact needed size
EncodedLines encodedLines(numberOfLines);
// Start building the hash map. We will store string_views to optimize space
std::size_t wordLength{}; // Length of word that will be added to the hash map
MyChar* startWord{}; // Startposition (in the overall text) of the word to be added
bool waitForWord{ true }; // Mini state machine. Either we wait for start of word or its end
std::size_t index{}; // This will be used for addressing the current line
// Iterate over all characters from the text file
for (MyChar& c : text) {
if (waitForWord) { // If we are in state of waiting for the beginning of the next word
if (c & 0b1000'0000) { // if the charcter is either space or end of line, continue to wait
if (c == EOL) ++index; // If we foound an end of line, then we will address the next line from now one
else { // Else, we found a character, so the beginning of a new word
startWord = &c; // Remember start position (in complete text file) of word
wordLength = 1; // The word length is now already 1, because we have foound the first character
waitForWord = false; // From now on we are "in" a word and wait for the end of the word, the next SPACE or EOL
else { // If we are in state of waiting for the end of the word
if (c & 0b1000'0000) { // If we have found a SPACE or EOL, then we found the end of a word
const StringView wordAsStringView{ startWord, wordLength }; // Build a string_view of the word
EncoderType currentEncodedWordIdentifier{ encodedWordIdentifier }; // Temporaray for the next encoding if
// Either add to dictioanry of use existing encoding ID
if (DictionaryIter entry = dictionary.find(wordAsStringView); entry != dictionary.end())
currentEncodedWordIdentifier = entry->second; // Already existing ID found. use it
dictionary[wordAsStringView] = encodedWordIdentifier++; // Create new entry in the hash map
if (c == EOL) ++index; // If we have read an EOL, we will now address the next line
waitForWord = true; // We will change the state and from now on wait for the beginning of the next word again
++wordLength; // If we are in state of waiting for the end of the word and found a normal character, increment word length counter
t.stop(); std::cout << "Duration for building the dictionary and encode the lines:\t" << t << "Number of hashes : " << dictionary.size() << '\n';
// Sort all rows with line ideas. Will be very fast
std::for_each(std::execution::par, encodedLines.begin(), encodedLines.end(), [](std::vector<unsigned int>& encodedLine) { std::sort(encodedLine.begin(), encodedLine.end()); });
t.stop(); std::cout << "Duration for sorting all line id encodings:\t\t\t" << t << '\n';
// Now, we will count, how often a search word appears in a line
CounterForMatchesForEachLineLine counterForMatchesForEachLineLine{}; // Vector of match-counters for each lines
counterForMatchesForEachLineLine.reserve(numberOfLines); // Preallocate memory
const EncodedLineIter searchWordsEnd = searchWordsEncoded.end(); // Pointer to search word vector end
for (EncodedLine& encodedLine : encodedLines) // For all lines
CounterForMatchesInOneLine counterForMatchesInOneLine{}; // Counter for matches in current line
EncodedLineIter encodedLineCurrent = encodedLine.begin(); // Pointer to encoded value for current line
const EncodedLineIter encodedLineEnd = encodedLine.end(); // Pointer to last encoded value for current line
EncodedLineIter searchWordCurrent = searchWordsEncoded.begin(); // Pointer to beginning of search word IDs
// Compare and search. Take advantage of sorted IDs
while (encodedLineCurrent != encodedLineEnd && searchWordCurrent != searchWordsEnd) {
if (*encodedLineCurrent < *searchWordCurrent) {
else {
if (!(*searchWordCurrent < *encodedLineCurrent)) {
// Number of matches in this line has been detected. Store count for this line and continue with next line
t.stop(); std::cout << "Duration for searching, comparing and counting:\t\t\t" << t << '\n';
tAll.stop(); std::cout << "\n\nDuration Program processing overall: " << tAll << '\n';
// Debug output
std::cout << "\n\nDemo Result. First 100 counts of matches:\n";
int lineCounter{};
for (CounterForMatchesInOneLine counterForMatchesInOneLine : counterForMatchesForEachLineLine)
std::cout << "Count[" << lineCounter++ << "]: " << counterForMatchesInOneLine << '\n';
if (lineCounter > 100) break;
std::cerr << "\n***Error: Could not open file\n";
I'd try building a https://en.wikipedia.org/wiki/Radix_tree that contains all your search words. When processing each line of text you then only need one maintain one pointer into the radix tree for each character position, and need to advance all of them with every additionally consumed character (or remove the pointer of the character sequence can no longer reach a valid word). Whenever an advanced pointer points to the end of a word, you increment your counter.
This shouldn't require any tokenization.
You do not need to iterate over all the searchWordIds items. Assuming this array do no contains any duplicates, you can use hash table for that so to make the algorithm runs in O(n²) time rather than O(n³) time (thanks to a O(1) search in searchWordIds). More specifically, an std::unordered_set<int> can be used so to check if word is in searchWordIds in constant time. You need to convert searchWordIds to a std::unordered_set<int> first. If the array has duplicates, then you can use a std::unordered_map<int, int> so to store the number of duplicates associated to a given word. The 2 nested loops consist in doing count += searchWordIds[word] in this last case.
If this is not enthough, you can use a Bloom filter so to speed up the lookup in searchWordIds. Indeed, this probabilistic data structure can very quickly find if word is not in searchWordIds (100% sure) or say if it is certainly in it (with a good accuracy assuming the bloom filter is sufficiently large). This should be at least twice faster. Possibly even more (the unordered_set and unordered_map are generally not very efficient, partially due to the use of linked-list-based buckets and a slow hash management).
If this is still not enough, you can parallelize the outermost loop. The idea is to compute a local count value for each section of the textAsNumbers array and then perform a final reduction. This assume the size of the sub arrays is relatively uniform (it will not scale well if one line is much much bigger than all others). You can flatten the vector<vector<int>> so to better load-balance the work and certainly even improve the performance in sequential (due to less indirections and likely less cache misses).
In practice I would perhaps serialize the whole text into std::unordered_map<std::string, int>. There string is word and int is count of that word in text. That operation is about O(X) where X is count of all words in text assuming that individual words are too short for hashing of those to matter. You said it is not time critical ... but just for the record.
After that searching a word in it is O(1) assuming again that the "word" means relatively short string and also we already have count of those words. If you have a list of words to search then it is O(N) where N is length of list.
I have a vector containing strings that follow the format of text_number-number
Eg: Example_45-3
I only want the first number (45 in the example) and nothing else which I am able to do with my current code:
std::vector<std::string> imgNumStrVec;
for(size_t i = 0; i < StrVec.size(); i++){
std::vector<std::string> seglist;
std::stringstream ss(StrVec[i]);
std::string seg, seg2;
while(std::getline(ss, seg, '_')) seglist.push_back(seg);
std::stringstream ss2(seglist[1]);
std::getline(ss2, seg2, '-');
Are there more streamlined and simpler ways of doing this? and if so what are they?
I ask purely out of desire to learn how to code better as at the end of the day, the code above does successfully extract just the first number, but it seems long winded and round-about.
You can also use the built in find_first_of and find_first_not_of to find the first "numberstring" in any string.
std::string first_numberstring(std::string const & str)
char const* digits = "0123456789";
std::size_t const n = str.find_first_of(digits);
if (n != std::string::npos)
std::size_t const m = str.find_first_not_of(digits, n);
return str.substr(n, m != std::string::npos ? m-n : m);
return std::string();
This should be more efficient than Ashot Khachatryan's solution. Note the use of '_' and '-' instead of "_" and "-". And also, the starting position of the search for '-'.
inline std::string mid_num_str(const std::string& s) {
std::string::size_type p = s.find('_');
std::string::size_type pp = s.find('-', p + 2);
return s.substr(p + 1, pp - p - 1);
If you need a number instead of a string, like what Alexandr Lapenkov's solution has done, you may also want to try the following:
inline long mid_num(const std::string& s) {
return std::strtol(&s[s.find('_') + 1], nullptr, 10);
updated for C++11
(important note for compiler regex support: for gcc. you need version 4.9 or later. i tested this on g++ version 4.9[1], and 9.2. cppreference.com has in browser compiler that i used.)
Thanks to user #2b-t who found a bug in the c++11 code!
Here is the C++11 code:
#include <iostream>
#include <string>
#include <regex>
using std::cout;
using std::endl;
int main() {
std::string input = "Example_45-3";
std::string output = std::regex_replace(
cout << input << endl;
cout << output << endl;
boost solution that only requires C++98
Minimal implementation example that works on many strings (not just strings of the form "text_45-text":
#include <iostream>
#include <string>
using namespace std;
#include <boost/regex.hpp>
int main() {
string input = "Example_45-3";
string output = boost::regex_replace(
cout << input << endl;
cout << output << endl;
console output:
Other example strings that this would work on:
"asdfasdf 45 sdfsdf"
"X = 45, sdfsdf"
For this example I used g++ on Linux with #include <boost/regex.hpp> and -lboost_regex. You could also use C++11x regex.
Feel free to edit my solution if you have a better regex.
If there aren't performance constraints, using Regex is ideal for this sort of thing because you aren't reinventing the wheel (by writing a bunch of string parsing code which takes time to write/test-fully).
Additionally if/when your strings become more complex or have more varied patterns regex easily accommodates the complexity. (The question's example pattern is easy enough. But often times a more complex pattern would take 10-100+ lines of code when a one line regex would do the same.)
Apparently full support for C++11 <regex> was implemented and released for g++ version 4.9.x and on Jun 26, 2015. Hat tip to SO questions #1 and #2 for figuring out the compiler version needing to be 4.9.x.
Check this out
std::string ex = "Example_45-3";
int num;
sscanf( ex.c_str(), "%*[^_]_%d", &num );
I can think of two ways of doing it:
Use regular expressions
Use an iterator to step through the string, and copy each consecutive digit to a temporary buffer. Break when it reaches an unreasonable length or on the first non-digit after a string of consecutive digits. Then you have a string of digits that you can easily convert.
std::string s = "Example_45-3";
int p1 = s.find("_");
int p2 = s.find("-");
std::string number = s.substr(p1 + 1, p2 - p1 - 1)
The 'best' way to do this in C++11 and later is probably using regular expressions, which combine high expressiveness and high performance when the test is repeated often enough.
The following code demonstrates the basics. You should #include <regex> for it to work.
// The example inputs
std::vector<std::string> inputs {
"Example_0-0", "Example_0-1", "Example_0-2", "Example_0-3", "Example_0-4",
"Example_1-0", "Example_1-1", "Example_1-2", "Example_1-3", "Example_1-4"
// The regular expression. A lot of the cost is incurred when building the
// std::regex object, but when it's reused a lot that cost is amortised.
std::regex imgNumRegex { "^[^_]+_([[:digit:]]+)-([[:digit:]]+)$" };
for (const auto &input: inputs){
// This wil contain the match results. Parts of the regular expression
// enclosed in parentheses will be stored here, so in this case: both numbers
std::smatch matchResults;
if (!std::regex_match(input, matchResults, imgNumRegex)) {
// Handle failure to match
// Note that the first match is in str(1). str(0) contains the whole string
std::string theFirstNumber = matchResults.str(1);
std::string theSecondNumber = matchResults.str(2);
std::cout << "The input had numbers " << theFirstNumber;
std::cout << " and " << theSecondNumber << std::endl;
Using #Pixelchemist's answer and e.g. std::stoul:
bool getFirstNumber(std::string const & a_str, unsigned long & a_outVal)
auto pos = a_str.find_first_of("0123456789");
if (std::string::npos != pos)
a_outVal = std::stoul(a_str.substr(pos));
return true;
catch (...)
// handle conversion failure
// ...
return false;