Find digits in file names and cross reference them with others - c++

First, I will quickly describe my motivation for this and the actual problem:
I deal with large batches of files constantly and more specifically, I find myself having to rename them according to the following rule:
They may all contain words and digits, but only one set of digits is incrementing and not 'constant'. I need to extract those and only those digits and rename the files accordingly. For example:
Foo_1_Bar_2015.jpg
Foo_2_Bar_2015.jpg
Foo_03_Bar_2015.jpg
Foo_4_Bar_2015.jpg
Will be renamed:
1.jpg
2.jpg
3.jpg or 03.jpg (The leading zero can stay or go)
4.jpg
So what we start with is a vector with std::wstring objects for all the filenames in the specified directory. I urge you to stop reading for 3 minutes and think about how to approach this before I continue with my attempts and questions. I don't want my ideas to nudge you in one direction or another and I've always found fresh ideas are the best.
Now, here are two ways that I can think of:
1) Old style C string manipulation and comparisons:
In my mind, this entails parsing each filename and remembering each digit sequence position and length. This is easily stored in a vector or what-not for each file. This works well (basically uses string searches with increasing offsets):
while((offset = filename_.find_first_of(L"0123456789", offset)) != filename.npos)
{
size = filename.find_first_not_of(L"0123456789", offset) - offset;
digit_locations_vec.emplace_back(offset, size);
offset += size;
}
What I have after this is a vector of (Location, Size) pairs for all the digits in the filename, constant (by using the definition in the motivation) or not.
After this, chaos ensues, as you need to cross reference the strings and find out which digits are the ones that need to be extracted. This will grow exponentially with the number of files (which tends to be huge) not to mentioned multiplied by the number of digit sequences in each string. Also, not very readable, maintainable or elegant. No go.
2) Regular Expressions
If there was ever a use for regex's, it's this. Create a regex object out of the first filename and try to match it with what comes next. Success? Instant ability to extract the required number. Failure? Add the offending filename as a new regex object and try to match against the two existing regex's. Rinse and repeat. The regex's would look something like this:
Foo_(\d+)_Bar_(\d+).jpg
or create a regex for each digit sequence separately:
Foo_(\d+)_Bar_2015.jpg
Foo_1_Bar_(\d+).jpg
The rest is cake. Just keep on matching as you go, and in the best case, it might require only one pass! Question is...
What I need to know:
1) Can you think of any other superior way to achieve this? I've been banging my head against the wall for days.
2) Although the cost of string manipulation and vector constructing\destructing may be substantial in the first method, perhaps it pales in comparison to the cost of regex objects. Second method, worst case: as many regex objects as files. Would this be disastrous with potentially thousands of files?
3) The second method can be adjusted for one of two possibilities: Few std::regex object constructions, many regex_match calls or the other way around. Which is more expensive, the construction of the regex object or trying to match a string with it?

For me (gcc4.6.2 32-bit optimizations O3), manual string manipulation was about 2x faster than regular expressions. Not worth the cost.
Example runnable complete code (link with boost_system and boost_regex, or change include if you have regex in the compiler already):
#include <ctime>
#include <cctype>
#include <algorithm>
#include <string>
#include <iostream>
#include <vector>
#include <sstream>
using namespace std;
#include <boost/regex.hpp>
using namespace boost;
/*
Foo_1_Bar_2015.jpg
Foo_1_Bar_2016.jpg
Foo_2_Bar_2016.jpg
Foo_2_Bar_2015.jpg
...
*/
vector<string> generateNames(int lenPerYear, int yearStart, int years);
/*
Foo_1_Bar_2015.jpg -> 1_2015.jpg
Foo_7_Bar_2016.jpg -> 7_2016.jpg
*/
void rename_method_string(const vector<string> & names, vector<string> & renamed);
void rename_method_regex(const vector<string> & names, vector<string> & renamed);
typedef void rename_method_t(const vector<string> & names, vector<string> & renamed);
void testMethod(const vector<string> & names, const string & description, rename_method_t method);
int main()
{
vector<string> names = generateNames(10000, 2014, 100);
cout << "names.size() = " << names.size() << '\n';
cout << '\n';
testMethod(names, "method 1 - string manipulation: ", rename_method_string);
cout << '\n';
testMethod(names, "method 2 - regular expressions: ", rename_method_regex);
return 0;
}
void testMethod(const vector<string> & names, const string & description, rename_method_t method)
{
vector<string> renamed(names.size());
clock_t timeStart = clock();
method(names, renamed);
clock_t timeEnd = clock();
cout << "renamed examples:\n";
for (int i = 0; i < 10 && i < names.size(); ++i)
cout << names[i] << " -> " << renamed[i] << '\n';
cout << description << 1000 * (timeEnd - timeStart) / CLOCKS_PER_SEC << " ms\n";
}
vector<string> generateNames(int lenPerYear, int yearStart, int years)
{
vector<string> result;
for (int year = yearStart, yearEnd = yearStart + years; year < yearEnd; ++year)
{
for (int i = 0; i < lenPerYear; ++i)
{
ostringstream oss;
oss << "Foo_" << i << "_Bar_" << year << ".jpg";
result.push_back(oss.str());
}
}
return result;
}
template<typename T>
bool equal_safe(T itShort, T itShortEnd, T itLong, T itLongEnd)
{
if (itLongEnd - itLong < itShortEnd - itShort)
return false;
return equal(itShort, itShortEnd, itLong);
}
void rename_method_string(const vector<string> & names, vector<string> & renamed)
{
//manually: "Foo_(\\d+)_Bar_(\\d+).jpg" -> \1_\2.jpg
const string foo = "Foo_", bar = "_Bar_", jpg = ".jpg";
for (int i = 0; i < names.size(); ++i)
{
const string & name = names[i];
//starts with foo?
if (!equal_safe(foo.begin(), foo.end(), name.begin(), name.end()))
{
renamed[i] = "ERROR no foo";
continue;
}
//extract number
auto it = name.begin() + foo.size();
for (; it != name.end() && isdigit(*it); ++it) {}
string str_num1(name.begin() + foo.size(), it);
//continues with bar?
if (!equal_safe(bar.begin(), bar.end(), it, name.end()))
{
renamed[i] = "ERROR no bar";
continue;
}
//extract number
it += bar.size();
auto itStart = it;
for (; it != name.end() && isdigit(*it); ++it) {}
string str_num2(itStart, it);
//check *.jpg
if (!equal_safe(jpg.begin(), jpg.end(), it, name.end()))
{
renamed[i] = "ERROR no .jpg";
continue;
}
renamed[i] = str_num1 + "_" + str_num2 + ".jpg";
}
}
void rename_method_regex(const vector<string> & names, vector<string> & renamed)
{
regex searching("Foo_(\\d+)_Bar_(\\d+).jpg");
smatch found;
for (int i = 0; i < names.size(); ++i)
{
if (regex_search(names[i], found, searching))
{
if (3 != found.size())
renamed[i] = "ERROR weird match";
else
renamed[i] = found[1].str() + "_" + found[2].str() + ".jpg";
}
else renamed[i] = "ERROR no match";
}
}
It produces output for me:
names.size() = 1000000
renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 1 - string manipulation: 421 ms
renamed examples:
Foo_0_Bar_2014.jpg -> 0_2014.jpg
Foo_1_Bar_2014.jpg -> 1_2014.jpg
Foo_2_Bar_2014.jpg -> 2_2014.jpg
Foo_3_Bar_2014.jpg -> 3_2014.jpg
Foo_4_Bar_2014.jpg -> 4_2014.jpg
Foo_5_Bar_2014.jpg -> 5_2014.jpg
Foo_6_Bar_2014.jpg -> 6_2014.jpg
Foo_7_Bar_2014.jpg -> 7_2014.jpg
Foo_8_Bar_2014.jpg -> 8_2014.jpg
Foo_9_Bar_2014.jpg -> 9_2014.jpg
method 2 - regular expressions: 796 ms
Also, I think it's completely pointless, because actual I/O (getting file name, renaming file) will be much slower than any CPU string manipulation, in your example. So to answer your questions:
I don't see any superior way, I/O is what's slow, don't bother with superiority
regex object wasn't expensive in my experience, within 2x slow-down vs manual method, that's constant slow-down and negligible, compared to how much work it saves
How many std::regex objects for how many regex_match calls? Depends on the amount of regex_match calls: more matches there are, more it's worth to create specific std::regex object. However this will be very library-dependant. If there are lots of match calls, create separate, if you are not sure, do not bother.

Why dont you use split to split the string between letters and numbers:
Regex.Split(fileName, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)");
then get whichever index you need for the numbers, maybe using a Where clause, to find the ones increasing in value, while the others indices are matching, then you can use .Last() to get the extension.

Related

Is there a better way than O(n³) to solve this text search?

I am trying to optimize a text search, where I am searching for multiple words. I want to know the frequency of all the words, per line.
I have tried to make it as fast as I can, as I want to run the search many times, with multiple keywords, on the same data.
I still am thinking though that there should be a more efficient way to solve this, so anybody has some good suggestions?
I have put up a simple demo to show the POC on gitlab:
https://gitlab.com/dkruithof/textfind
My current search time is 410ms on 6 keywords in a dataset of 408MB
Also, the source of the demo is this:
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
using namespace std;
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string& line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
char str[line.size()];
strncpy(str, line.c_str(), line.size());
// Getting the first token
char *token = strtok(str, newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = {"this", "blog", "political", "debate", "climate", "iphone"};
unsigned int searchLength = searchWords.size();
unsigned int searchWordIds[searchLength];
//convert searchWords
unsigned int i = 0;
for(const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
//#### This part is not time critical ####
//reading file and convert words to numbers
fstream newsFile;
newsFile.open("news.txt",ios::in);
if (newsFile.is_open())
{
string line;
while(getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
}
newsFile.close();
}
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for(std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for(unsigned int word : line)
{
for(unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if(word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
cout << elapsed.count() << "ms" << endl;
//#### Print for checking result, time insensitive :)
int n = 0;
for(unsigned int count : counts)
{
cout << "Count[" << n << "]: " << count << endl;
++n;
if(n > 100)
{
break;
}
}
}
End results
I tried the multiple approaches, and the scores are as following:
Approach
User
Time
Encoding words
kcid42
410 ms
Hash tables
Öö Tiib & Jérôme Richard
135 ms
Ordered & encoded words
A M
13 ms
Hash tables & encoded words
Everybody
72 ms
The committed the results also to my gitlab, if you want to check for yourself.
Analysis
Using hash tables to speed up the search is smart, and does indeed reduce the search time. Better than my blunt approach at least. But it is still using strings, and string comparisons / construction / hashing is rather slow.
The approach of A M to speed up the encoded word search is I think faster because of that.
I have also tried to combine the approaches, to use the hash tables and encoded words together, but that was still slower than A M's custom search.
So I think we learned that A M is pretty good at searching stuff.
Thanks everybody for your input!
If you just want to speed up the part that you marked, then you can get a drastical improvement by sorting all vectors, before you enter this loop.
The searching will be really superfast.
The runtime of the loop will be reduced from 490ms to 10ms.
Can you please check and feed back.
#include <iostream>
#include <fstream>
#include <cstring>
#include <string>
#include <map>
#include <algorithm>
#include <vector>
#include <chrono>
#include <algorithm>
unsigned int addWord(std::map<std::string, unsigned int>& wordLookup, std::string word)
{
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
auto it = wordLookup.find(word);
unsigned int id;
if (it == wordLookup.end())
{
id = wordLookup.size(); //assign consecutive numbers using size()
wordLookup[word] = id;
}
else
{
id = it->second;
}
return id;
}
void tokenizeWords(std::map<std::string, unsigned int>& wordLookup, std::vector<unsigned int>& wordList, std::string line)
{
static const char newsDelimiters[] = "., !?\"()'\n\r\t<>/\\";
#pragma warning(suppress : 4996)
// Getting the first token
#pragma warning(suppress : 4996)
char* token = strtok(line.data(), newsDelimiters);
while (token != NULL)
{
//finding a word:
unsigned int id = addWord(wordLookup, token);
wordList.push_back(id);
// Getting the next token
// If there are no tokens left, NULL is returned
#pragma warning(suppress : 4996)
token = strtok(NULL, newsDelimiters);
}
}
int main()
{
std::vector<std::vector<unsigned int>> textAsNumbers;
std::map<std::string, unsigned int> wordLookup;
std::vector<std::string> searchWords = { "this", "blog", "political", "debate", "climate", "iphone" };
unsigned int searchLength = searchWords.size();
std::vector<unsigned int> searchWordIds(searchLength);
//convert searchWords
unsigned int i = 0;
for (const std::string& word : searchWords)
{
searchWordIds[i] = addWord(wordLookup, word);
++i;
}
std::sort(searchWordIds.begin(), searchWordIds.end());
//#### This part is not time critical ####
//reading file and convert words to numbers
std::fstream newsFile;
newsFile.open("r:\\news.txt", std::ios::in);
if (newsFile.is_open())
{
std::string line;
while (std::getline(newsFile, line))
{
textAsNumbers.push_back(std::vector<unsigned int>());
std::vector<unsigned int>& wordList = *textAsNumbers.rbegin();
tokenizeWords(wordLookup, wordList, line);
std::sort(textAsNumbers.back().begin(), textAsNumbers.back().end());
}
newsFile.close();
}
#if 1
std::vector<unsigned int>::iterator last2 = searchWordIds.end();
//#### This part should be fast ####
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for (std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
std::vector<unsigned int>::iterator first1 = line.begin();
std::vector<unsigned int>::iterator last1 = line.end();
std::vector<unsigned int>::iterator first2 = searchWordIds.begin();
while (first1 != last1 && first2 != last2) {
if (*first1 < *first2) {
++first1;
}
else {
if (!(*first2 < *first1)) {
++count;
++first1;
}
else
++first2;
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#else
auto start = std::chrono::system_clock::now();
std::vector<unsigned int> counts; //end result
counts.reserve(textAsNumbers.size());
for ( std::vector<unsigned int>& line : textAsNumbers)
{
unsigned int count = 0;
for (unsigned int word : line)
{
for (unsigned int s = 0; s < searchLength; ++s)
{
unsigned int searchWord = searchWordIds[s];
if (word == searchWord)
{
++count;
}
}
}
counts.push_back(count);
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << "ms\n";
#endif
//#### Print for checking result, time insensitive :)
int n = 0;
for (unsigned int count : counts)
{
std::cout << "Count[" << n << "]: " << count << '\n';
++n;
if (n > 100)
{
break;
}
}
}
.
Edit:
We can make the overall program much more faster by optimizing the design:
increase the IO-Buffer size
read the whole file in one shot (not line by line)
use a special encryption for the characters. Convert all none-essential characters to a SPACE. This will make comparison really fast
use special identifier for End-Of-Line, count it and with that get the number of lines
store all words as std::string_view
also the key for the hash map for the dictionary will be a std::string_view
build the hash map in the same loop where words and End-Of_lines will be identified. This reduces duplication of work
Build rows with IDs for words, so that we can compare single integers instead of strings
Sort all those rows will all encoded words. This will make comparing very fast
Use optimized search and compare algorithm to count the matches per line
All this will reduce the runtime for the whole program from the original roughly 40s to ~4.5s. So, nearly ten times faster.
We can see some astonishing results here:
Reading 430MB in 189 ms
And converting all this amount of data in 90 ms
Counting the number of lines in 80ms
Building a hash map with a size of 284k entries in 3.6 s
Sorting 5000 lines with each many entries in unbelievable 367 ms
And doing the matching and counting in 13 ms
Please see an example of an output. I use a 11 years old Windows 7 machine.
And the code:
#include <iostream>
#include <fstream>
#include <string>
#include <chrono>
#include <filesystem>
#include <cstdint>
#include <array>
#include <execution>
#include <unordered_map>
#include <string_view>
// Basic definitions for data types
using MyChar = uint8_t;
using EncoderType = unsigned int;
// Dependent data types
using String = std::basic_string<MyChar, std::char_traits<MyChar>, std::allocator<MyChar>>;
using StringView = std::basic_string_view<MyChar, std::char_traits<MyChar>>;
using IFStream = std::basic_ifstream<MyChar, std::char_traits<MyChar>>;
using Dictionary = std::unordered_map<StringView, EncoderType>;
using DictionaryIter = Dictionary::iterator;
using EncodedLine = std::vector<EncoderType>;
using EncodedLineIter = EncodedLine::iterator;
using EncodedLines = std::vector<EncodedLine>;
using SearchWords = std::vector<StringView>;
using SearchWordsEncoded = EncodedLine;
using CounterForMatchesInOneLine = std::size_t;
using CounterForMatchesForEachLineLine = std::vector<CounterForMatchesInOneLine>;
StringView operator"" _msv(const char* str, std::size_t len) { return StringView{ reinterpret_cast<const MyChar*>(str), len }; };
// Special encoding of values in text
constexpr MyChar SPACE = 254;
constexpr MyChar EOL = 255;
constexpr std::array<MyChar, 256> Convert{ SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,EOL,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,48,49,50,51,52,53,54,55,56,57,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE
,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE,SPACE };
// Speed up reading of file by using larger input buffer
constexpr std::size_t IOBufSize = 5'000'000u;
static MyChar ioBuf[IOBufSize];
// For measuring durations
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
long long elapsedTime{};
void start() { startTime = std::chrono::high_resolution_clock::now(); }
void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
};
// Main Programm
int main() {
Timer t{}, tAll{}; tAll.start(); // Define Timers
Dictionary dictionary(300000); // The dictionory for words and their encoded IS
EncoderType encodedWordIdentifier{}; // This is for encoding strings. It will be simply incremented for each new word
// The words that we want to search. We use string_views for more efficient processing
SearchWords searchWords{ "this"_msv, "blog"_msv, "political"_msv, "debate"_msv, "climate"_msv, "iphone"_msv };
// And here we will store the encoded search words
SearchWordsEncoded searchWordsEncoded{};
// Add words to dictionary
for (const StringView& searchWord : searchWords) {
dictionary[searchWord] = encodedWordIdentifier;
searchWordsEncoded.push_back(encodedWordIdentifier++);
}
// Now read the complete text file and start all fata processing
// Open file and check, if it could be opened
if (IFStream ifs{ "r:\\news.txt",std::ios::binary }; ifs) {
// To speed up reading of the file, we will set a bigger input buffer
ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
// Here we will store the complete file, all data
String text{};
// Get number of bytes in file
const std::uintmax_t size = std::filesystem::file_size("r:\\news.txt");
text.resize(size);
// Read the whole file with one statement. Will be ultrafast
t.start();
ifs.read(text.data(), size);
t.stop(); std::cout << "Duration for reading complete file:\t\t\t\t" << t << "\tData read: " << ifs.gcount() << " bytes\n";
// No convert characters. Set all none essential characters to space. Build lowercase text. Special Mark for end of line
t.start();
std::transform(std::execution::par, text.begin(), text.end(), text.begin(), [&](const MyChar c) {return Convert[c]; });
t.stop(); std::cout << "Duration for converting all text data:\t\t\t\t" << t << '\n';
// Count the number of lines. We need this to pre-allocate space for our vectors
t.start();
std::size_t numberOfLines = std::count(std::execution::par, text.begin(), text.end(), EOL);
if (text.back() == EOL) ++numberOfLines;
t.stop(); std::cout << "Duration for counting number of lines:\t\t\t\t" << t << "\tNumber of lines identified: " <<numberOfLines << '\n';
// Now we can define the vector for the encoded lines with the exact needed size
EncodedLines encodedLines(numberOfLines);
// Start building the hash map. We will store string_views to optimize space
std::size_t wordLength{}; // Length of word that will be added to the hash map
MyChar* startWord{}; // Startposition (in the overall text) of the word to be added
bool waitForWord{ true }; // Mini state machine. Either we wait for start of word or its end
std::size_t index{}; // This will be used for addressing the current line
t.start();
// Iterate over all characters from the text file
for (MyChar& c : text) {
if (waitForWord) { // If we are in state of waiting for the beginning of the next word
if (c & 0b1000'0000) { // if the charcter is either space or end of line, continue to wait
if (c == EOL) ++index; // If we foound an end of line, then we will address the next line from now one
}
else { // Else, we found a character, so the beginning of a new word
startWord = &c; // Remember start position (in complete text file) of word
wordLength = 1; // The word length is now already 1, because we have foound the first character
waitForWord = false; // From now on we are "in" a word and wait for the end of the word, the next SPACE or EOL
}
}
else { // If we are in state of waiting for the end of the word
if (c & 0b1000'0000) { // If we have found a SPACE or EOL, then we found the end of a word
const StringView wordAsStringView{ startWord, wordLength }; // Build a string_view of the word
EncoderType currentEncodedWordIdentifier{ encodedWordIdentifier }; // Temporaray for the next encoding if
// Either add to dictioanry of use existing encoding ID
if (DictionaryIter entry = dictionary.find(wordAsStringView); entry != dictionary.end())
currentEncodedWordIdentifier = entry->second; // Already existing ID found. use it
else
dictionary[wordAsStringView] = encodedWordIdentifier++; // Create new entry in the hash map
encodedLines[index].push_back(currentEncodedWordIdentifier);
if (c == EOL) ++index; // If we have read an EOL, we will now address the next line
waitForWord = true; // We will change the state and from now on wait for the beginning of the next word again
}
else
++wordLength; // If we are in state of waiting for the end of the word and found a normal character, increment word length counter
}
}
t.stop(); std::cout << "Duration for building the dictionary and encode the lines:\t" << t << "Number of hashes : " << dictionary.size() << '\n';
// Sort all rows with line ideas. Will be very fast
t.start();
std::for_each(std::execution::par, encodedLines.begin(), encodedLines.end(), [](std::vector<unsigned int>& encodedLine) { std::sort(encodedLine.begin(), encodedLine.end()); });
t.stop(); std::cout << "Duration for sorting all line id encodings:\t\t\t" << t << '\n';
// Now, we will count, how often a search word appears in a line
CounterForMatchesForEachLineLine counterForMatchesForEachLineLine{}; // Vector of match-counters for each lines
counterForMatchesForEachLineLine.reserve(numberOfLines); // Preallocate memory
const EncodedLineIter searchWordsEnd = searchWordsEncoded.end(); // Pointer to search word vector end
t.start();
for (EncodedLine& encodedLine : encodedLines) // For all lines
{
CounterForMatchesInOneLine counterForMatchesInOneLine{}; // Counter for matches in current line
EncodedLineIter encodedLineCurrent = encodedLine.begin(); // Pointer to encoded value for current line
const EncodedLineIter encodedLineEnd = encodedLine.end(); // Pointer to last encoded value for current line
EncodedLineIter searchWordCurrent = searchWordsEncoded.begin(); // Pointer to beginning of search word IDs
// Compare and search. Take advantage of sorted IDs
while (encodedLineCurrent != encodedLineEnd && searchWordCurrent != searchWordsEnd) {
if (*encodedLineCurrent < *searchWordCurrent) {
++encodedLineCurrent;
}
else {
if (!(*searchWordCurrent < *encodedLineCurrent)) {
++counterForMatchesInOneLine;
++encodedLineCurrent;
}
else
++searchWordCurrent;
}
}
// Number of matches in this line has been detected. Store count for this line and continue with next line
counterForMatchesForEachLineLine.push_back(counterForMatchesInOneLine);
}
t.stop(); std::cout << "Duration for searching, comparing and counting:\t\t\t" << t << '\n';
tAll.stop(); std::cout << "\n\nDuration Program processing overall: " << tAll << '\n';
// Debug output
std::cout << "\n\nDemo Result. First 100 counts of matches:\n";
int lineCounter{};
for (CounterForMatchesInOneLine counterForMatchesInOneLine : counterForMatchesForEachLineLine)
{
std::cout << "Count[" << lineCounter++ << "]: " << counterForMatchesInOneLine << '\n';
if (lineCounter > 100) break;
}
}
else
std::cerr << "\n***Error: Could not open file\n";
}
I'd try building a https://en.wikipedia.org/wiki/Radix_tree that contains all your search words. When processing each line of text you then only need one maintain one pointer into the radix tree for each character position, and need to advance all of them with every additionally consumed character (or remove the pointer of the character sequence can no longer reach a valid word). Whenever an advanced pointer points to the end of a word, you increment your counter.
This shouldn't require any tokenization.
You do not need to iterate over all the searchWordIds items. Assuming this array do no contains any duplicates, you can use hash table for that so to make the algorithm runs in O(n²) time rather than O(n³) time (thanks to a O(1) search in searchWordIds). More specifically, an std::unordered_set<int> can be used so to check if word is in searchWordIds in constant time. You need to convert searchWordIds to a std::unordered_set<int> first. If the array has duplicates, then you can use a std::unordered_map<int, int> so to store the number of duplicates associated to a given word. The 2 nested loops consist in doing count += searchWordIds[word] in this last case.
If this is not enthough, you can use a Bloom filter so to speed up the lookup in searchWordIds. Indeed, this probabilistic data structure can very quickly find if word is not in searchWordIds (100% sure) or say if it is certainly in it (with a good accuracy assuming the bloom filter is sufficiently large). This should be at least twice faster. Possibly even more (the unordered_set and unordered_map are generally not very efficient, partially due to the use of linked-list-based buckets and a slow hash management).
If this is still not enough, you can parallelize the outermost loop. The idea is to compute a local count value for each section of the textAsNumbers array and then perform a final reduction. This assume the size of the sub arrays is relatively uniform (it will not scale well if one line is much much bigger than all others). You can flatten the vector<vector<int>> so to better load-balance the work and certainly even improve the performance in sequential (due to less indirections and likely less cache misses).
In practice I would perhaps serialize the whole text into std::unordered_map<std::string, int>. There string is word and int is count of that word in text. That operation is about O(X) where X is count of all words in text assuming that individual words are too short for hashing of those to matter. You said it is not time critical ... but just for the record.
After that searching a word in it is O(1) assuming again that the "word" means relatively short string and also we already have count of those words. If you have a list of words to search then it is O(N) where N is length of list.

Insert characters between a string

I am faced with a simple yet complex challenge today.
In my program, I wish to insert a - character every three characters of a string. How would this be accomplished? Thank you for your help.
#include <iostream>
int main()
{
std::string s = "thisisateststring";
// Desired output: thi-sis-ate-sts-tri-ng
std::cout << s << std::endl;
return 0;
}
There is no need to "build a new string".
Loop a position iteration, starting at 3, incrementing by 4 with each pass, inserting a - at the position indicated. Stop when the next insertion point would breach the string (which has been growing by one with each pass, thus the need for the 4 slot skip):
#include <iostream>
#include <string>
int main()
{
std::string s = "thisisateststring";
for (std::string::size_type i=3; i<s.size(); i+=4)
s.insert(i, 1, '-');
// Desired output: thi-sis-ate-sts-tri-ng
std::cout << s << std::endl;
return 0;
}
Output
thi-sis-ate-sts-tri-ng
just take an empty string and append "-" at every count divisible by 3
#include <iostream>
int main()
{
std::string s = "thisisateststring";
std::string res="";
int count=0;
for(int i=0;i<s.length();i++){
count++;
res+=s[i];
if(count%3==0){
res+="-";
}
}
std::cout << res << std::endl;
return 0;
}
output
thi-sis-ate-sts-tri-ng
A general (and efficient) approach is to build a new string by iterating character-by-character over the existing one, making any desired changes as you go. In this case, every third character you can insert a hyphen:
std::string result;
result.reserve(s.size() + s.size() / 3);
for (size_t i = 0; i != s.size(); ++i) {
if (i != 0 && i % 3 == 0)
result.push_back('-');
result.push_back(s[i]);
}
Simple. Iterate the string and build a new one
Copy each character from the old string to the new one and every time you've copied 3 characters add an extra '-' to the end of the new string and restart your count of copied characters.
Like 99% problems with text, this one can be solved with a regular expression one-liner:
std::regex_replace(input, std::regex{".{3}"}, "$&-")
However, it brings not one, but two new problems:
it is not a very performant solution
regex library is huge and bloats resulting binary
So think twice.
You could write a simple functor to add the hyphens, like this:
#include <iostream>
struct inserter
{
unsigned n = 0u;
void operator()(char c)
{
std::cout << c;
if (++n%3 == 0) std::cout << '-';
}
};
This can be passed to the standard for_each() algorithm:
#include <algorithm>
int main()
{
const std::string s = "thisisateststring";
std::for_each(s.begin(), s.end(), inserter());
std::cout << std::endl;
}
Exercise: extend this class to work with different intervals, output streams, replacement characters and string types (narrow or wide).

searching pattern in a string

I am really having a hard time getting recursions but i tried recursion to match a pattern inside a string.
Suppose i have a string geeks for geeks and i have a pattern eks to match.I could use many methods out there like regex, find method of string class but i really want to do this thing by recursions.
To achieve this i tried this code:
void recursion(int i,string str)
{
if(!str.compare("eks"))
cout<<"pattern at :"<<i<<'\n';
if(i<str.length() && str.length()-1!=0)
recursion(i,str.substr(i,str.length()-1));
}
int main()
{
string str("geeks for geeks");
for(int i=0;i<str.length();i++)
recursion(i,str.substr(i,str.length()));
}
Output :
Desired Ouput :
pattern at 2
pattern at 12
What could i be doing wrong here and what would be a good way to do this with recursions?
I understood a lot of topics in cpp but with recursions , i know how they work and even with that whenever i try to code something with recursions , it never works.Could there be any place that could help me with recursions as well?
You will never get pattern at 2, since compare doesn't work like that. Ask yourself, what will
std::string("eks for geeks").compare("eks")
return? Well, according to the documentation, you will get something positive, since "eks for geeks" is longer than "eks". So your first step is to fix this:
void recursion(int i, std::string str){
if(!str.substr(0,3).compare("eks")) {
std::cout << "pattern at: " << i << '\n';
}
Next, we have to recurse. But there's something off. i should be the current position of your "cursor". Therefore, you should advance it:
i = i + 1;
And if we reduce the length of the string in every iteration, we must not test i < str.length, otherwise we won't check the later half of the string:
if(str.length() - 1 > 0) {
recursion(i, str.substr(1));
}
}
Before we actually compile this code, let's reason about it:
we have a substring of the correct length for comparison with "eks"
we never use i except for the current position
we advance the position before we recurse
we "advance" the string by removing the first character
we will end up with an empty string at some point
Seems reasonable:
#include <iostream>
#include <string>
void recursion(int i, std::string str){
if(!str.substr(0,3).compare("eks")) {
std::cout << "pattern at: " << i << '\n';
}
i = i + 1;
if(str.length() - 1 > 0) {
recursion(i, str.substr(1));
}
}
int main () {
recursion(0, "geeks for geeks");
return 0;
}
Output:
pattern at: 2
pattern at: 12
However, that's not optimal. There are several optimizations that are possible. But that's left as an exercise.
Exercises
compare needs to use substr due to it's algorithm. Write your own comparison function that doesn't need substr.
There's a lot of copying going on. Can you get rid of that?
The for loop was wrong. Why?
Recursive function must not run into loop. And you have some mistakes.Try this code.
void recursion(string str, string subStr, int i){
if(str.find(subStr) != string::npos ) {
int pos = str.find(subStr);
str = str.substr(pos + subStr.length(), str.length()-1);
cout << "pattern at " << (pos + i) << endl;
recursion(str, subStr, pos+subStr.length() );
}
}
int main(int argc, char** argv) {
string str("geeks for geeks");
string subStr("eks");
recursion(str, subStr, 0);
return 0;
}

C++ Extract number from the middle of a string

I have a vector containing strings that follow the format of text_number-number
Eg: Example_45-3
I only want the first number (45 in the example) and nothing else which I am able to do with my current code:
std::vector<std::string> imgNumStrVec;
for(size_t i = 0; i < StrVec.size(); i++){
std::vector<std::string> seglist;
std::stringstream ss(StrVec[i]);
std::string seg, seg2;
while(std::getline(ss, seg, '_')) seglist.push_back(seg);
std::stringstream ss2(seglist[1]);
std::getline(ss2, seg2, '-');
imgNumStrVec.push_back(seg2);
}
Are there more streamlined and simpler ways of doing this? and if so what are they?
I ask purely out of desire to learn how to code better as at the end of the day, the code above does successfully extract just the first number, but it seems long winded and round-about.
You can also use the built in find_first_of and find_first_not_of to find the first "numberstring" in any string.
std::string first_numberstring(std::string const & str)
{
char const* digits = "0123456789";
std::size_t const n = str.find_first_of(digits);
if (n != std::string::npos)
{
std::size_t const m = str.find_first_not_of(digits, n);
return str.substr(n, m != std::string::npos ? m-n : m);
}
return std::string();
}
This should be more efficient than Ashot Khachatryan's solution. Note the use of '_' and '-' instead of "_" and "-". And also, the starting position of the search for '-'.
inline std::string mid_num_str(const std::string& s) {
std::string::size_type p = s.find('_');
std::string::size_type pp = s.find('-', p + 2);
return s.substr(p + 1, pp - p - 1);
}
If you need a number instead of a string, like what Alexandr Lapenkov's solution has done, you may also want to try the following:
inline long mid_num(const std::string& s) {
return std::strtol(&s[s.find('_') + 1], nullptr, 10);
}
updated for C++11
(important note for compiler regex support: for gcc. you need version 4.9 or later. i tested this on g++ version 4.9[1], and 9.2. cppreference.com has in browser compiler that i used.)
Thanks to user #2b-t who found a bug in the c++11 code!
Here is the C++11 code:
#include <iostream>
#include <string>
#include <regex>
using std::cout;
using std::endl;
int main() {
std::string input = "Example_45-3";
std::string output = std::regex_replace(
input,
std::regex("[^0-9]*([0-9]+).*"),
std::string("$1")
);
cout << input << endl;
cout << output << endl;
}
boost solution that only requires C++98
Minimal implementation example that works on many strings (not just strings of the form "text_45-text":
#include <iostream>
#include <string>
using namespace std;
#include <boost/regex.hpp>
int main() {
string input = "Example_45-3";
string output = boost::regex_replace(
input,
boost::regex("[^0-9]*([0-9]+).*"),
string("\\1")
);
cout << input << endl;
cout << output << endl;
}
console output:
Example_45-3
45
Other example strings that this would work on:
"asdfasdf 45 sdfsdf"
"X = 45, sdfsdf"
For this example I used g++ on Linux with #include <boost/regex.hpp> and -lboost_regex. You could also use C++11x regex.
Feel free to edit my solution if you have a better regex.
Commentary:
If there aren't performance constraints, using Regex is ideal for this sort of thing because you aren't reinventing the wheel (by writing a bunch of string parsing code which takes time to write/test-fully).
Additionally if/when your strings become more complex or have more varied patterns regex easily accommodates the complexity. (The question's example pattern is easy enough. But often times a more complex pattern would take 10-100+ lines of code when a one line regex would do the same.)
[1]
[1]
Apparently full support for C++11 <regex> was implemented and released for g++ version 4.9.x and on Jun 26, 2015. Hat tip to SO questions #1 and #2 for figuring out the compiler version needing to be 4.9.x.
Check this out
std::string ex = "Example_45-3";
int num;
sscanf( ex.c_str(), "%*[^_]_%d", &num );
I can think of two ways of doing it:
Use regular expressions
Use an iterator to step through the string, and copy each consecutive digit to a temporary buffer. Break when it reaches an unreasonable length or on the first non-digit after a string of consecutive digits. Then you have a string of digits that you can easily convert.
std::string s = "Example_45-3";
int p1 = s.find("_");
int p2 = s.find("-");
std::string number = s.substr(p1 + 1, p2 - p1 - 1)
The 'best' way to do this in C++11 and later is probably using regular expressions, which combine high expressiveness and high performance when the test is repeated often enough.
The following code demonstrates the basics. You should #include <regex> for it to work.
// The example inputs
std::vector<std::string> inputs {
"Example_0-0", "Example_0-1", "Example_0-2", "Example_0-3", "Example_0-4",
"Example_1-0", "Example_1-1", "Example_1-2", "Example_1-3", "Example_1-4"
};
// The regular expression. A lot of the cost is incurred when building the
// std::regex object, but when it's reused a lot that cost is amortised.
std::regex imgNumRegex { "^[^_]+_([[:digit:]]+)-([[:digit:]]+)$" };
for (const auto &input: inputs){
// This wil contain the match results. Parts of the regular expression
// enclosed in parentheses will be stored here, so in this case: both numbers
std::smatch matchResults;
if (!std::regex_match(input, matchResults, imgNumRegex)) {
// Handle failure to match
abort();
}
// Note that the first match is in str(1). str(0) contains the whole string
std::string theFirstNumber = matchResults.str(1);
std::string theSecondNumber = matchResults.str(2);
std::cout << "The input had numbers " << theFirstNumber;
std::cout << " and " << theSecondNumber << std::endl;
}
Using #Pixelchemist's answer and e.g. std::stoul:
bool getFirstNumber(std::string const & a_str, unsigned long & a_outVal)
{
auto pos = a_str.find_first_of("0123456789");
try
{
if (std::string::npos != pos)
{
a_outVal = std::stoul(a_str.substr(pos));
return true;
}
}
catch (...)
{
// handle conversion failure
// ...
}
return false;
}

Read integer from string

I have a string in form "blah-blah..obj_xx..blah-blah" where xx are digits. E.g. the string may be "root75/obj_43.dat".
I want to read "xx" (or 43 from the sample above) as an integer. How do I do it?
I tried to find "obj_" first:
std::string::size_type const cpos = name.find("obj_");
assert(std::string::npos != cpos);
but what's next?
My GCC doesn't support regexes fully, but I think this should work:
#include <iostream>
#include <string>
#include <regex>
#include <iterator>
int main ()
{
std::string input ("blah-blah..obj_42..blah-blah");
std::regex expr ("obj_([0-9]+)");
std::sregex_iterator i = std::sregex_iterator(input.begin(), input.end(), expr);
std::smatch match = *i;
int number = std::stoi(match.str());
std::cout << number << '\n';
}
With something this simple you can do
auto b = name.find_first_of("0123456789", cpos);
auto e = name.find_first_not_of("0123456789", b);
if (b != std::string::npos)
{
auto digits = name.substr(b, e);
int n = std::stoi(digits);
}
else
{
// Error handling
}
For anything more complicated I would use regex.
How about:
#include <iostream>
#include <string>
int main()
{
const std::string test("root75/obj_43.dat");
int number;
// validate input:
const auto index = test.find("obj_");
if(index != std::string::npos)
{
number = std::stoi(test.substr(index+4));
std::cout << "number: " << number << ".\n";
}
else
std::cout << "Input validation failed.\n";
}
Live demo here. Includes (very) basic input validation (e.g. it will fail if the string contains multiple obj_), variable length numbers at the end, or even more stuff following it (adjust the substr call accordingly) and you can add a second argument to std::stoi to make sure it didn't fail for some reason.
Here's another option
//your code:
std::string::size_type const cpos = name.find("obj_");
assert(std::string::npos != cpos);
//my code starts here:
int n;
std::stringstream sin(name.substr(cpos+4));
sin>>n;
Dirt simple method, though probably pretty inefficient, and doesn't take advantage of the STL:
(Note that I didn't try to compile this)
unsigned GetFileNumber(std::string &s)
{
const std::string extension = ".dat";
/// get starting position - first character to the left of the file extension
/// in a real implementation, you'd want to verify that the string actually contains
/// the correct extension.
int i = (int)(s.size() - extension.size() - 1);
unsigned sum = 0;
int tensMultiplier = 1;
while (i >= 0)
{
/// get the integer value of this digit - subtract (int)'0' rather than
/// using the ASCII code of `0` directly for clarity. Optimizer converts
/// it to a literal immediate at compile time, anyway.
int digit = s[i] - (int)'0';
/// if this is a valid numeric character
if (digit >= 0 && digit <= 9)
{
/// add the digit's value, adjusted for it's place within the numeric
/// substring, to the accumulator
sum += digit * tensMultiplier;
/// set the tens place multiplier for the next digit to the left.
tensMultiplier *= 10;
}
else
{
break;
}
i--;
}
return sum;
}
If you need it as a string, just append the found digits to a result string rather than accumulating their values in sum.
This also assumes that .dat is the last part of your string. If not, I'd start at the end, count left until you find a numeric character, and then start the above loop. This is nice because it's O(n), but may not be as clear as the regex or find approaches.