I am trying to optimize a text search, where I am searching for multiple words. I want to know the frequency of all the words, per line.
I have tried to make it as fast as I can, as I want to run the search many times, with multiple keywords, on the same data.
I still am thinking though that there should be a more efficient way to solve this, so anybody has some good suggestions?
I have put up a simple demo to show the POC on gitlab:
https://gitlab.com/dkruithof/textfind
My current search time is 410ms on 6 keywords in a dataset of 408MB
Also, the source of the demo is this:
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
using namespace std;
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string& line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
char str[line.size()];
strncpy(str, line.c_str(), line.size());
// Getting the first token
char *token = strtok(str, newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = {"this", "blog", "political", "debate", "climate", "iphone"};
unsigned int searchLength = searchWords.size();
unsigned int searchWordIds[searchLength];
//convert searchWords
unsigned int i = 0;
for(const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
//#### This part is not time critical ####
//reading file and convert words to numbers
fstream newsFile;
newsFile.open("news.txt",ios::in);
if (newsFile.is_open())
{
string line;
while(getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
}
newsFile.close();
}
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for(std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for(unsigned int word : line)
{
for(unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if(word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
cout << elapsed.count() << "ms" << endl;
//#### Print for checking result, time insensitive :)
int n = 0;
for(unsigned int count : counts)
{
cout << "Count[" << n << "]: " << count << endl;
++n;
if(n > 100)
{
break;
}
}
}
End results
I tried the multiple approaches, and the scores are as following:
Approach
User
Time
Encoding words
kcid42
410 ms
Hash tables
Öö Tiib & Jérôme Richard
135 ms
Ordered & encoded words
A M
13 ms
Hash tables & encoded words
Everybody
72 ms
The committed the results also to my gitlab, if you want to check for yourself.
Analysis
Using hash tables to speed up the search is smart, and does indeed reduce the search time. Better than my blunt approach at least. But it is still using strings, and string comparisons / construction / hashing is rather slow.
The approach of A M to speed up the encoded word search is I think faster because of that.
I have also tried to combine the approaches, to use the hash tables and encoded words together, but that was still slower than A M's custom search.
So I think we learned that A M is pretty good at searching stuff.
Thanks everybody for your input!
If you just want to speed up the part that you marked, then you can get a drastical improvement by sorting all vectors, before you enter this loop.
The searching will be really superfast.
The runtime of the loop will be reduced from 490ms to 10ms.
Can you please check and feed back.
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
#include <algorithm>
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
#pragma warning(suppress : 4996)
// Getting the first token
#pragma warning(suppress : 4996)
char* token = strtok(line.data(), newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
#pragma warning(suppress : 4996)
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = { "this", "blog", "political", "debate", "climate", "iphone" };
unsigned int searchLength = searchWords.size();
std::vector<unsigned int> searchWordIds(searchLength);
//convert searchWords
unsigned int i = 0;
for (const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
std::sort(searchWordIds.begin(), searchWordIds.end());
//#### This part is not time critical ####
//reading file and convert words to numbers
std::fstream newsFile;
newsFile.open("r:\\news.txt", std::ios::in);
if (newsFile.is_open())
{
std::string line;
while (std::getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
std::sort(textAsNumbers.back().begin(), textAsNumbers.back().end());
}
newsFile.close();
}
#if 1
std::vector<unsigned int>::iterator last2 = searchWordIds.end();
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for (std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
std::vector<unsigned int>::iterator first1 = line.begin();
std::vector<unsigned int>::iterator last1 = line.end();
std::vector<unsigned int>::iterator first2 = searchWordIds.begin();
while (first1 != last1 && first2 != last2) {
if (*first1 < *first2) {
++first1;
}
else {
if (!(*first2 < *first1)) {
++count;
++first1;
}
else
++first2;
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#else
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for ( std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for (unsigned int word : line)
{
for (unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if (word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#endif
//#### Print for checking result, time insensitive :)
int n = 0;
for (unsigned int count : counts)
{
std::cout << "Count[" << n << "]: " << count << '\n';
++n;
if (n > 100)
{
break;
}
}
}
.
Edit:
We can make the overall program much more faster by optimizing the design:
increase the IO-Buffer size
read the whole file in one shot (not line by line)
use a special encryption for the characters. Convert all none-essential characters to a SPACE. This will make comparison really fast
use special identifier for End-Of-Line, count it and with that get the number of lines
store all words as std::string_view
also the key for the hash map for the dictionary will be a std::string_view
build the hash map in the same loop where words and End-Of_lines will be identified. This reduces duplication of work
Build rows with IDs for words, so that we can compare single integers instead of strings
Sort all those rows will all encoded words. This will make comparing very fast
Use optimized search and compare algorithm to count the matches per line
All this will reduce the runtime for the whole program from the original roughly 40s to ~4.5s. So, nearly ten times faster.
We can see some astonishing results here:
Reading 430MB in 189 ms
And converting all this amount of data in 90 ms
Counting the number of lines in 80ms
Building a hash map with a size of 284k entries in 3.6 s
Sorting 5000 lines with each many entries in unbelievable 367 ms
And doing the matching and counting in 13 ms
Please see an example of an output. I use a 11 years old Windows 7 machine.
And the code:
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>
#include <filesystem>
#include <cstdint>
#include <array>
#include <execution>
#include <unordered_map>
#include <string_view>
// Basic definitions for data types
using MyChar = uint8_t;
using EncoderType = unsigned int;
// Dependent data types
using String = std::basic_string<MyChar, std::char_traits<MyChar>, std::allocator<MyChar>>;
using StringView = std::basic_string_view<MyChar, std::char_traits<MyChar>>;
using IFStream = std::basic_ifstream<MyChar, std::char_traits<MyChar>>;
using Dictionary = std::unordered_map<StringView, EncoderType>;
using DictionaryIter = Dictionary::iterator;
using EncodedLine = std::vector<EncoderType>;
using EncodedLineIter = EncodedLine::iterator;
using EncodedLines = std::vector<EncodedLine>;
using SearchWords = std::vector<StringView>;
using SearchWordsEncoded = EncodedLine;
using CounterForMatchesInOneLine = std::size_t;
using CounterForMatchesForEachLineLine = std::vector<CounterForMatchesInOneLine>;
StringView operator"" _msv(const char* str, std::size_t len) { return StringView{ reinterpret_cast<const MyChar*>(str), len }; };
// Special encoding of values in text
constexpr MyChar SPACE = 254;
constexpr MyChar EOL = 255;
constexpr std::array<MyChar, 256> Convert{ SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,EOL,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,48,49,50,51,52,53,54,55,56,57,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE };
// Speed up reading of file by using larger input buffer
constexpr std::size_t IOBufSize = 5'000'000u;
static MyChar ioBuf[IOBufSize];
// For measuring durations
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
long long elapsedTime{};
void start() { startTime = std::chrono::high_resolution_clock::now(); }
void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
};
// Main Programm
int main() {
Timer t{}, tAll{}; tAll.start(); // Define Timers
Dictionary dictionary(300000); // The dictionory for words and their encoded IS
EncoderType encodedWordIdentifier{}; // This is for encoding strings. It will be simply incremented for each new word
// The words that we want to search. We use string_views for more efficient processing
SearchWords searchWords{ "this"_msv, "blog"_msv, "political"_msv, "debate"_msv, "climate"_msv, "iphone"_msv };
// And here we will store the encoded search words
SearchWordsEncoded searchWordsEncoded{};
// Add words to dictionary
for (const StringView& searchWord : searchWords) {
dictionary[searchWord] = encodedWordIdentifier;
searchWordsEncoded.push_back(encodedWordIdentifier++);
}
// Now read the complete text file and start all fata processing
// Open file and check, if it could be opened
if (IFStream ifs{ "r:\\news.txt",std::ios::binary }; ifs) {
// To speed up reading of the file, we will set a bigger input buffer
ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
// Here we will store the complete file, all data
String text{};
// Get number of bytes in file
const std::uintmax_t size = std::filesystem::file_size("r:\\news.txt");
text.resize(size);
// Read the whole file with one statement. Will be ultrafast
t.start();
ifs.read(text.data(), size);
t.stop(); std::cout << "Duration for reading complete file:\t\t\t\t" << t << "\tData read: " << ifs.gcount() << " bytes\n";
// No convert characters. Set all none essential characters to space. Build lowercase text. Special Mark for end of line
t.start();
std::transform(std::execution::par, text.begin(), text.end(), text.begin(), [&](const MyChar c) {return Convert[c]; });
t.stop(); std::cout << "Duration for converting all text data:\t\t\t\t" << t << '\n';
// Count the number of lines. We need this to pre-allocate space for our vectors
t.start();
std::size_t numberOfLines = std::count(std::execution::par, text.begin(), text.end(), EOL);
if (text.back() == EOL) ++numberOfLines;
t.stop(); std::cout << "Duration for counting number of lines:\t\t\t\t" << t << "\tNumber of lines identified: " <<numberOfLines << '\n';
// Now we can define the vector for the encoded lines with the exact needed size
EncodedLines encodedLines(numberOfLines);
// Start building the hash map. We will store string_views to optimize space
std::size_t wordLength{}; // Length of word that will be added to the hash map
MyChar* startWord{}; // Startposition (in the overall text) of the word to be added
bool waitForWord{ true }; // Mini state machine. Either we wait for start of word or its end
std::size_t index{}; // This will be used for addressing the current line
t.start();
// Iterate over all characters from the text file
for (MyChar& c : text) {
if (waitForWord) { // If we are in state of waiting for the beginning of the next word
if (c & 0b1000'0000) { // if the charcter is either space or end of line, continue to wait
if (c == EOL) ++index; // If we foound an end of line, then we will address the next line from now one
}
else { // Else, we found a character, so the beginning of a new word
startWord = &c; // Remember start position (in complete text file) of word
wordLength = 1; // The word length is now already 1, because we have foound the first character
waitForWord = false; // From now on we are "in" a word and wait for the end of the word, the next SPACE or EOL
}
}
else { // If we are in state of waiting for the end of the word
if (c & 0b1000'0000) { // If we have found a SPACE or EOL, then we found the end of a word
const StringView wordAsStringView{ startWord, wordLength }; // Build a string_view of the word
EncoderType currentEncodedWordIdentifier{ encodedWordIdentifier }; // Temporaray for the next encoding if
// Either add to dictioanry of use existing encoding ID
if (DictionaryIter entry = dictionary.find(wordAsStringView); entry != dictionary.end())
currentEncodedWordIdentifier = entry->second; // Already existing ID found. use it
else
dictionary[wordAsStringView] = encodedWordIdentifier++; // Create new entry in the hash map
encodedLines[index].push_back(currentEncodedWordIdentifier);
if (c == EOL) ++index; // If we have read an EOL, we will now address the next line
waitForWord = true; // We will change the state and from now on wait for the beginning of the next word again
}
else
++wordLength; // If we are in state of waiting for the end of the word and found a normal character, increment word length counter
}
}
t.stop(); std::cout << "Duration for building the dictionary and encode the lines:\t" << t << "Number of hashes : " << dictionary.size() << '\n';
// Sort all rows with line ideas. Will be very fast
t.start();
std::for_each(std::execution::par, encodedLines.begin(), encodedLines.end(), [](std::vector<unsigned int>& encodedLine) { std::sort(encodedLine.begin(), encodedLine.end()); });
t.stop(); std::cout << "Duration for sorting all line id encodings:\t\t\t" << t << '\n';
// Now, we will count, how often a search word appears in a line
CounterForMatchesForEachLineLine counterForMatchesForEachLineLine{}; // Vector of match-counters for each lines
counterForMatchesForEachLineLine.reserve(numberOfLines); // Preallocate memory
const EncodedLineIter searchWordsEnd = searchWordsEncoded.end(); // Pointer to search word vector end
t.start();
for (EncodedLine& encodedLine : encodedLines) // For all lines
{
CounterForMatchesInOneLine counterForMatchesInOneLine{}; // Counter for matches in current line
EncodedLineIter encodedLineCurrent = encodedLine.begin(); // Pointer to encoded value for current line
const EncodedLineIter encodedLineEnd = encodedLine.end(); // Pointer to last encoded value for current line
EncodedLineIter searchWordCurrent = searchWordsEncoded.begin(); // Pointer to beginning of search word IDs
// Compare and search. Take advantage of sorted IDs
while (encodedLineCurrent != encodedLineEnd && searchWordCurrent != searchWordsEnd) {
if (*encodedLineCurrent < *searchWordCurrent) {
++encodedLineCurrent;
}
else {
if (!(*searchWordCurrent < *encodedLineCurrent)) {
++counterForMatchesInOneLine;
++encodedLineCurrent;
}
else
++searchWordCurrent;
}
}
// Number of matches in this line has been detected. Store count for this line and continue with next line
counterForMatchesForEachLineLine.push_back(counterForMatchesInOneLine);
}
t.stop(); std::cout << "Duration for searching, comparing and counting:\t\t\t" << t << '\n';
tAll.stop(); std::cout << "\n\nDuration Program processing overall: " << tAll << '\n';
// Debug output
std::cout << "\n\nDemo Result. First 100 counts of matches:\n";
int lineCounter{};
for (CounterForMatchesInOneLine counterForMatchesInOneLine : counterForMatchesForEachLineLine)
{
std::cout << "Count[" << lineCounter++ << "]: " << counterForMatchesInOneLine << '\n';
if (lineCounter > 100) break;
}
}
else
std::cerr << "\n***Error: Could not open file\n";
}
I'd try building a https://en.wikipedia.org/wiki/Radix_tree that contains all your search words. When processing each line of text you then only need one maintain one pointer into the radix tree for each character position, and need to advance all of them with every additionally consumed character (or remove the pointer of the character sequence can no longer reach a valid word). Whenever an advanced pointer points to the end of a word, you increment your counter.
This shouldn't require any tokenization.
You do not need to iterate over all the searchWordIds items. Assuming this array do no contains any duplicates, you can use hash table for that so to make the algorithm runs in O(n²) time rather than O(n³) time (thanks to a O(1) search in searchWordIds). More specifically, an std::unordered_set<int> can be used so to check if word is in searchWordIds in constant time. You need to convert searchWordIds to a std::unordered_set<int> first. If the array has duplicates, then you can use a std::unordered_map<int, int> so to store the number of duplicates associated to a given word. The 2 nested loops consist in doing count += searchWordIds[word] in this last case.
If this is not enthough, you can use a Bloom filter so to speed up the lookup in searchWordIds. Indeed, this probabilistic data structure can very quickly find if word is not in searchWordIds (100% sure) or say if it is certainly in it (with a good accuracy assuming the bloom filter is sufficiently large). This should be at least twice faster. Possibly even more (the unordered_set and unordered_map are generally not very efficient, partially due to the use of linked-list-based buckets and a slow hash management).
If this is still not enough, you can parallelize the outermost loop. The idea is to compute a local count value for each section of the textAsNumbers array and then perform a final reduction. This assume the size of the sub arrays is relatively uniform (it will not scale well if one line is much much bigger than all others). You can flatten the vector<vector<int>> so to better load-balance the work and certainly even improve the performance in sequential (due to less indirections and likely less cache misses).
In practice I would perhaps serialize the whole text into std::unordered_map<std::string, int>. There string is word and int is count of that word in text. That operation is about O(X) where X is count of all words in text assuming that individual words are too short for hashing of those to matter. You said it is not time critical ... but just for the record.
After that searching a word in it is O(1) assuming again that the "word" means relatively short string and also we already have count of those words. If you have a list of words to search then it is O(N) where N is length of list.