Boost Spirit parser crashes on input - c++

I have a boost spirit parser that uses the qi::double_ numeric parser. I have case where the user's data contains a uuid string:
"00573e443ef1ec10b5a1f23ac8a69c43c415cedf"
And I am getting a crash inside the spirit pow10_helper() function below. Testing some more it appears to happen for any string starting with a number followed by e and another number. For example 1e999 also crashes. To reproduce the crash, try:
#include <boost/spirit/include/qi.hpp>
namespace qi = boost::spirit::qi;
int main()
{
double x;
std::string s = "1e999";
auto a = s.begin();
auto b = s.end();
qi::parse(a, b, qi::double_, x); // <--- crash/assert in debug mode
}
I'm using spirit due to it's raw performance (qi::double_ is roughly 2x faster than strtod()). My question is, is there a way to workaround this limitation? Switching to a slower parser will be painful, but let me know if you have particular suggestions.
The relevant boost code is crashing (boost/spirit/home/support/detail/pow10.hpp) for reference:
template <>
struct pow10_helper<double>
{
static double call(unsigned dim)
{
static double const exponents[] =
{
1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6, 1e7, 1e8, 1e9,
...
1e300, 1e301, 1e302, 1e303, 1e304, 1e305, 1e306, 1e307, 1e308,
};
BOOST_ASSERT(dim < sizeof(exponents)/sizeof(double));
return exponents[dim]; // <--- crash here, dim is 999 which is >308
}
};
As a side note, this seems like a huge bug in the in spirit implementation. You should be able to easily crash any spirit app that parses doubles by passing in a dummy input value like 1e999.

This is a known issue and has been fixed in 1_57_0 AFAIR
Here's the mailing list discussion about it:
http://boost.2283326.n4.nabble.com/bug-in-quot-double-quot-parser-causes-program-abort-minimum-example-provided-td4668308.html
On November 7th Joel de Guzman wrote:
This is now fixed in the develop branch along with a whole slew of
improvements in floating point precision parsing (the corner cases).
There are some backward incompatible changes, but it should only
affect those who are using the real parser policies, in patrticular,
those who specialize parse_frac_n. The changes will be documented in
due time.

Related

perl regex faster than c++/boost

I wrote a CGI script for my website which reads through blocks of text and matches all occurrences of English words. I've been making some fundamental changes to the site's code recently which have necessitated rewriting most of it in C++. As I'd hoped, almost everything has become much faster in C++ than perl, with the exception of this function.
I know that regexes are a relatively recent addition to C++ and not necessarily its strongest suit. It may simply be the case that it is slower than perl in this instance. But I wanted to share my code in the hopes that someone might be able to find a way of speeding up what I am doing in C++.
Here is the perl code:
open(WORD, "</file/path/wordthree.txt") || die "opening";
while(<WORD>) {
chomp;
push #wordlist,$_;
}
close (WORD) || die "closing";
foreach (#wordlist) {
while ($bloc =~ m/$_/g) {
$location = pos($bloc) - length($_);
$match=$location.";".pos($bloc).";".$_;
push(#hits,$match);
}
}
wordthree.txt is a list of ~270,000 English words separated by new lines, and $bloc is 3200 characters of text. Perl performs these searches in about one second. You can see it in play here if you like: http://libraryofbabel.info/anglishize.cgi?05y-w1-s3-v20:1
With C++ I have tried the following:
typedef std::map<std::string::difference_type, std::string> hitmap;
hitmap hits;
void regres(const boost::match_results<std::string::const_iterator>& what) {
hits[what.position()]=what[0].str();
}
words.open ("/file/path/wordthree.txt");
std::string wordlist[274784];
unsigned i = 0;
while (words >> wordlist[i]) {i++;}
words.close();
for (unsigned i=0;i<274783;i++) {
boost::regex word(wordlist[i]);
boost::sregex_iterator lex(book.begin(),book.end(), word);
boost::sregex_iterator end;
std::for_each(lex, end, &regres);
}
The C++ version takes about 12 seconds to read the same amount of text the same number of times. Any advice on how to make it competitive with the perl script is greatly appreciated.
Firstly I'd cut down on the number of allocations:
use string_ref instead of std::string where possible
use mapped files instead of reading it all in memory ahead of time
use const char* instead std::string::const_iterator to navigate the book
Here is a sample that uses Boost Spirit Qi to parse the wordlist (I don't have yours, so I assume line-separated words).
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
In full Live On Coliru¹
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace qi = boost::spirit::qi;
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<sref, It, void> {
static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
};
} } }
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
std::cout << "Wordlist contains " << wordlist.size() << " entries\n";
io::mapped_file_source book("/etc/dictionaries-common/words");
for (auto const& s: wordlist) {
regex word(s.to_string());
boost::cregex_iterator lex(book.begin(), book.end(), word), end;
std::for_each(lex, end, &regres);
}
}
Next step
This still creates a regex each iteration. I have a suspicion it will be a lot more efficient if you combine it all into a single pattern. You'll spend more memory/CPU creating the regex, but you'll reduce the power of the loop by the number of entries in the word list.
Because the regex library might not have been designed for this scale, you could have better results with a custom search and a trie implementation.
Here's a simple attempt (that is indeed much faster for my /etc/dictionaries-common/words file of 99171 lines):
Live On Coliru
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
io::mapped_file_params params("/etc/dictionaries-common/words");
params.flags = io::mapped_file::mapmode::priv;
io::mapped_file mapped(params);
std::replace(mapped.data(), mapped.end(), '\n', '|');
regex const wordlist(mapped.begin(), mapped.end() - 1);
io::mapped_file_source book("/etc/dictionaries-common/words");
boost::cregex_iterator lex(book.begin(), book.end(), wordlist), end;
std::for_each(lex, end, &regres);
}
¹ of course coliru doesn't have a suitable wordlist
It looks to me like Perl is smart enough to figure out that you're abusing regular expressions to do an ordinary linear search, a straightforward lookup. You are looking up straight text, and none of your search patterns appear to be, well, a pattern. Based on your description, all your search patterns look like ordinary strings, so Perl is likely optimizing it down to a linear string search.
I am not familiar with Boost's internal implementation of regular expression matching, but it's likely that it's compiling each search string into a state machine, and then executing the state machine for each search. That's the usual approach used with generic regular expression implementations. And that's a lot of work. A lot of completely needless work, in this specific case.
What you should do is as follows:
1) You are reading wordthree.txt into an array of strings. Instead of doing it, read it into a std::set<std::string> instead.
2) You are reading the entire text to search into a single book container. It's not clear, based on your code, whether book is a single std::string, or a std::vector<char>. But whatever the case, don't do that. Read the text to search iteratively, one word at a time. For each word, look it up in the std::set, and go from there.
This is, after all, what you're trying to do directly, and you should do that instead of taking a grand detour through the wonders of regular expressions, which accomplishes very little other than wasting a lot of time.
If you implement this correctly, you'll likely see C++ being just as fast, if not faster, than Perl.
I could also think of several other, more aggresively optimized approaches which also leverage std::set, but with custom classes and comparators, that seek to avoid all the heap allocations inherent with using a bunch of std::string, but it probably won't be necessary. A basic approach using a std::set-based lookup should be fast enough.

Boost R-tree : counting elements satisfying a query

So far, when I want to count how many elements in my R-tree satisfy a specific spatial query, it boils down to running the query, collecting the matches and then counting them, roughly as follow:
std::vector<my_type> results;
rtree_ptr->query(bgi::intersects(query_box), std::back_inserter(results));
int nbElements = results.size();
Is there a better way, i.e. a way to directly count without retrieving the actual elements? I haven't found anything to do that but who knows. (I'm building my tree with the packing algorithm, in case it has any relevance.)
My motivation is that I noticed that the speed of my queries depend on the number of matches. If there are 0 matches, the query is more or less instantaneous ; if there are 10 000 matches, it takes several seconds. Since it's possible to determine very fast whether there are any matches, it seems that traversing the tree is extremely fast (at least in the index I made) ; it is collecting all the results that makes the queries slower in case of many matches. Since I'm not interested in collecting but simply counting (at least for some queries), it would be awesome if I could just skip the collecting.
I had a late brainwave. Even better than using function_output_iterator could be using the boost::geometry::index query_iterators.
In principle, it will lead to exactly the same behaviour with slightly simpler code:
box query_box;
auto r = boost::make_iterator_range(bgi::qbegin(tree, bgi::intersects(query_box)), {});
// in c++03, spell out the end iterator: bgi::qend(tree)
size_t nbElements = boost::distance(r);
NOTE: size() is not available because the query_const_iterators are not of the random-access category.
But it may be slightly more comfortable to combine. Say, if you wanted an additional check per item, you'd use standard library algorithms like:
size_t matching = std::count_if(r.begin(), r.end(), some_predicate);
I think the range-based solution is somewhat more flexible (the same code can be used to achieve other algorithms like partial_sort_copy or std::transform which would be hard to fit into the output-iterator idiom from my earlier answer).
You can use a function output iterator:
size_t cardinality = 0; // number of matches in set
auto count_only = boost::make_function_output_iterator([&cardinality] (Tree::value_type const&) { ++cardinality; });
Use it like this:
C++11 using a lambda
Live On Coliru
#include <boost/function_output_iterator.hpp>
#include <boost/geometry/geometries/box.hpp>
#include <boost/geometry/geometries/point_xy.hpp>
#include <boost/geometry/core/cs.hpp>
#include <boost/geometry/index/rtree.hpp>
namespace bgi = boost::geometry::index;
using point = boost::geometry::model::d2::point_xy<int, boost::geometry::cs::cartesian>;
using box = boost::geometry::model::box<point>;
int main()
{
using Tree = bgi::rtree<box, bgi::rstar<32> >;
Tree tree;
size_t cardinality = 0; // number of matches in set
auto count_only = boost::make_function_output_iterator([&cardinality] (Tree::value_type const&) { ++cardinality; });
box query_box;
tree.query(bgi::intersects(query_box), count_only);
int nbElements = cardinality;
return nbElements;
}
C++03 using a function object
For C++ you can replace the lambda with a (polymorphic!) function object:
struct count_only_f {
count_only_f(size_t& card) : _cardinality(&card) { }
template <typename X>
void operator()(X) const {
++(*_cardinality);
}
private:
size_t *_cardinality;
};
// .... later:
boost::function_output_iterator<count_only_f> count_only(cardinality);
C++03 using Boost Phoenix
I would consider this a good place to use Boost Phoenix:
#include <boost/phoenix.hpp>
// ...
size_t cardinality = 0; // number of matches in set
tree.query(bgi::intersects(query_box), boost::make_function_output_iterator(++boost::phoenix::ref(cardinality)));
Or, more typically with namespace aliases:
#include <boost/phoenix.hpp>
// ...
size_t cardinality = 0; // number of matches in set
tree.query(bgi::intersects(query_box), make_function_output_iterator(++phx::ref(cardinality)));

How to parse a mathematical expression with boost::spirit and bind it to a function

I would like to define a function taking 2 arguments
double func(double t, double x);
where the actual implementation is read from an external text file.
For example, specifying in the text file
function = x*t;
the function should implement the multiplication between x and t, so that it could be called at a later stage.
I'm trying to parse the function using boost::spirit. But I do not know how to actually achieve it.
Below, I created a simple function that implements the multiplication. I bind it to a boost function and I can use it. I also created a simple grammar, which parse the multiplication between two doubles.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include "boost/function.hpp"
#include "boost/bind.hpp"
#include <boost/spirit/include/qi_symbols.hpp>
#include <iostream>
#include <string>
namespace qi = boost::spirit::qi;
namespace ascii=boost::spirit::ascii;
using boost::spirit::ascii::space;
using boost::spirit::qi::symbols;
template< typename Iterator >
struct MyGrammar : public virtual qi::grammar< Iterator, ascii::space_type >
{
MyGrammar() : MyGrammar::base_type(expression)
{
using qi::double_;
//The expression should take x and t as symbolic expressions
expression = (double_ >> '*' >> double_)[std::cout << "Parse multiplication: " << (qi::_1 * qi::_2)];
}
qi::rule<Iterator, ascii::space_type> expression;
};
double func(const double a, const double b)
{
return a*b; //This is the operation to perform
}
int main()
{
typedef std::string::const_iterator iterator_Type;
typedef MyGrammar<iterator_Type> grammar_Type;
grammar_Type calc;
std::string str = "1.*2."; // This should be changed to x*t
iterator_Type iter = str.begin();
iterator_Type end = str.end();
bool r = phrase_parse(iter, end, calc, space);
typedef boost::function < double ( const double t,
const double x) > function_Type;
function_Type multiplication = boost::bind(&func, _1, _2);
std::cout << "\nResult: " << multiplication( 2.0, 3.0) << std::endl;
return 0;
}
If I modify the above code setting
std::string str = "x*t";
how can I parse such an expression and bind it to the function multiplication such that, if I call multiplication(1.0, 2.0), it associates t to 1.0, x to 2.0 and it returns the result of the operation?
You're going to learn Spirit. Great!
It seems you're biting off more than you can chew here, though.
Firstly, your grammar doesn't actually parse an expression yet. And it certainly doesn't result in a function that you can then bind.
In fact you're parsing the input using a grammar that is not producing any result. It only creates a side-effect (which is to print the result of the simple binary expression with immediate simple operands to the console). This /resembles/ interpreted languages, although it would soon break up when
you try to parse an expression like 2*8 + 9
you would have input that backtracks (oops, the side effect already fired)
Next up you're binding func (which is redundant by the way; you're not binding any arguments so you could just say function_Type multiplication(func); here), and calling it. While cool, this has literally nothing to do with the parsing input.
Finally, your question is about a third thing, that wasn't even touched upon anywhere in the above. This question is about symbol tables and identifier lookup.
This would imply you should parse the source for actual identifiers (x or t, e.g.)
you'd need to store these into a symbol table so they could be mapped to a value (and perhaps a scope/lifetime)
There is a gaping logic hole in the question where you don't define the source of the "formal parameter list" (you mention it in the text here: function = x*t; but the parser doesn't deal with it, neither did you hardcode any such metadata); so there is no way we could even start to map the x and t things to the formal argument list (because it doesn't exist).
Let's assume for the moment that in fact arguments are positional (as they are, and you seem to want this as you call the bound function with positional arguments anyways. (So we don't have to worry about a name because no one will ever see a name.)
the caller should pass in a context to the functions, so that values can be looked up by identifier name during evaluation.
So, while I could try to sit you down and talk you through all the nuts and bolts that need to be created first before you can even dream of glueing it together in fantastic ways like you are asking for, let's not.
It would take me too much time and you would likely be overwhelmed.
Suggestions
I can only suggest to look at simpler resources. Start with the tutorials
The calculator series of tutorials is nice. In this answer I list the calculator samples with short descriptions of what techniques they demonstrate: What is the most efficient way to recalculate attributes of a Boost Spirit parse with a different symbol table?
The compiler tutorials actually do everything you're trying for, but they're a bit advanced
Ask freely if you have any questions along the way and you're at risk of getting stuck. But at least then we have a question that is answerable and answers that genuinely help you.
For now, look at some other answers of mine where I actually implemented grammars like this (somewhat ordered by increasing complexity):
Nice for comparison: This answer to Boost::spirit how to parse and call c++ function-like expressions interprets the parsed expressions on-the-fly (this mimics the approach with [std::cout << "Parse multiplication: " << (qi::_1 * qi::_2)] in your own parser)
The other answer there (Boost::spirit how to parse and call c++ function-like expressions) achieves the goal but using a dedicated AST representation, and a separate interpretation phase.
The benefits of each approach are described in these answers. These parsers do not have a symbol table nor a evaluation context.
More examples:
a simple boolean expression grammar evaluator (which supports only literals, not variables)
Building a Custom Expression Tree in Spirit:Qi (Without Utree or Boost::Variant)

C++: Vector bounds

I am coming from Java and learning C++ in the moment. I am using Stroustrup's Progamming Principles and Practice of Using C++. I am working with vectors now. On page 117 he says that accessing a non-existant element of a vector will cause a runtime error (same in Java, index out of bounds). I am using the MinGW compiler and when I compile and run this code:
#include <iostream>
#include <cstdio>
#include <vector>
int main()
{
std::vector<int> v(6);
v[8] = 10;
std::cout << v[8];
return 0;
}
It gives me as output 10. Even more interesting is that if I do not modify the non-existent vector element (I just print it expecting a runtime error or at least a default value) it prints some large integers. So... is Stroustrup wrong, or does GCC have some strange ways of compiling C++?
The book is a bit vague. It's not as much a "runtime error" as it is undefined behaviour which manifests at runtime. This means that anything could happen. But the error is strictly with you, not with the program execution, and it is in fact impossible and non sensible to even talk about the execution of a program with undefined behaviour.
There is nothing in C++ that protects you against programming errors, quite unlike in Java.
As #sftrabbit says, std::vector has an alternative interface, .at(), which always gives a correct program (though it may throw exceptions), and consequently one which one can reason about.
Let me repeat the point with an example, because I believe this is an important fundamental aspect of C++. Suppose we're reading an integer from the user:
int read_int()
{
std::cout << "Please enter a number: ";
int n;
return (std::cin >> n) ? n : 18;
}
Now consider the following three programs:
The dangerous one: The correctness of this program depends on the user input! It is not necessarily incorrect, but it is unsafe (to the point where I would call it broken).
int main()
{
int n = read_int();
int k = read_int();
std::vector<int> v(n);
return v[k];
}
Unconditionally correct: No matter what the user enters, we know how this program behaves.
int main() try
{
int n = read_int();
int k = read_int();
std::vector<int> v(n);
return v.at(k);
}
catch (...)
{
return 0;
}
The sane one: The above version with .at() is awkward. Better to check and provide feedback. Because we perform dynamic checking, the unchecked vector access is actually guaranteed to be fine.
int main()
{
int n = read_int();
if (n <= 0) { std::cout << "Bad container size!\n"; return 0; }
int k = read_int();
if (k < 0 || k >= n) { std::cout << "Bad index!\n"; return 0; }
std::vector<int> v(n);
return v[k];
}
(We're ignoring the possibility that the vector construction might throw an exception of its own.)
The moral is that many operations in C++ are unsafe and only conditionally correct, but it is expected of the programmer that you make the necessary checks ahead of time. The language doesn't do it for you, and so you don't pay for it, but you have to remember to do it. The idea is that you need to handle the error conditions anyway, and so rather than enforcing an expensive, non-specific operation at the library or language level, the responsibility is left to the programmer, who is in a better position to integrate the checking into the code that needs to be written anyway.
If I wanted to be facetious, I would contrast this approach to Python, which allows you to write incredibly short and correct programs, without any user-written error handling at all. The flip side is that any attempt to use such a program that deviates only slightly from what the programmer intended leaves you with a non-specific, hard-to-read exception and stack trace and little guidance on what you should have done better. You're not forced to write any error handling, and often no error handling ends up being written. (I can't quite contrast C++ with Java, because while Java is generally safe, I have yet to see a short Java program.)</rantmode>
This is a valuable comment by #Evgeny Sergeev that I promote to the answer:
For GCC, you can -D_GLIBCXX_DEBUG to replace standard containers with safe implementations. More recently, this now also seems to work with std::array. More info here: gcc.gnu.org/onlinedocs/libstdc++/manual/debug_mode.html
I would add, it is also possible to bundle individual "safe" versions of vector and other utility classes by using gnu_debug:: namespace prefix rather than std::.
In other words, do not re-invent the wheel, array checks are available at least with GCC.
C and C++ does not always do bounds checks. It MAY cause a runtime error. And if you were to overdo your number by enough, say 10000 or so, it's almost certain to cause a problem.
You can also use vector.at(10), which definitely should give you an exception.
see:
http://www.cplusplus.com/reference/vector/vector/at/
compared with:
http://www.cplusplus.com/reference/vector/vector/operator%5B%5D/
I hoped that vector's "operator[]" would check boundary as "at()" does, because I'm not so careful. :-)
One way would inherit vector class and override operator[] to call at() so that one can use more readable "[]" and no need to replace all "[]" to "at()". You can also define the inherited vector (ex:safer_vector) as normal vector.
The code will be like this(in C++11, llvm3.5 of Xcode 5).
#include <vector>
using namespace std;
template <class _Tp, class _Allocator = allocator<_Tp> >
class safer_vector:public vector<_Tp, _Allocator>{
private:
typedef __vector_base<_Tp, _Allocator> __base;
public:
typedef _Tp value_type;
typedef _Allocator allocator_type;
typedef typename __base::reference reference;
typedef typename __base::const_reference const_reference;
typedef typename __base::size_type size_type;
public:
reference operator[](size_type __n){
return this->at(__n);
};
safer_vector(_Tp val):vector<_Tp, _Allocator>(val){;};
safer_vector(_Tp val, const_reference __x):vector<_Tp, _Allocator>(val,__x){;};
safer_vector(initializer_list<value_type> __il):vector<_Tp, _Allocator>(__il){;}
template <class _Iterator>
safer_vector(_Iterator __first, _Iterator __last):vector<_Tp,_Allocator>(__first, __last){;};
// If C++11 Constructor inheritence is supported
// using vector<_Tp, _Allocator>::vector;
};
#define safer_vector vector

C++ replace multiple strings in a string in a single pass

Given the following string, "Hi ~+ and ^*. Is ^* still flying around ~+?"
I want to replace all occurrences of "~+" and "^*" with "Bobby" and "Danny", so the string becomes:
"Hi Bobby and Danny. Is Danny still flying around Bobby?"
I would prefer not to have to call Boost replace function twice to replace the occurrences of the two different values.
I managed to implement the required replacement function using Boost.Iostreams. Specifically, the method I used was a filtering stream using regular expression to match what to replace. I am not sure about the performance on gigabyte sized files. You will need to test it of course. Anyway, here's the code:
#include <boost/regex.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <iostream>
int main()
{
using namespace boost::iostreams;
regex_filter filter1(boost::regex("~\\+"), "Bobby");
regex_filter filter2(boost::regex("\\^\\*"), "Danny");
filtering_ostream out;
out.push(filter1);
out.push(filter2);
out.push(std::cout);
out << "Hi ~+ and ^*. Is ^* still flying around ~+?" << std::endl;
// for file conversion, use this line instead:
//out << std::cin.rdbuf();
}
The above prints "Hi Bobby and Danny. Is Danny still flying around Bobby?" when run, just like expected.
It would be interesting to see the performance results, if you decide to measure it.
Daniel
Edit: I just realized that regex_filter needs to read the entire character sequence into memory, making it pretty useless for gigabyte-sized inputs. Oh well...
I did notice it's been a year since this was active, but for what it's worth. I came across an article on CodeProject today that claims to solve this problem - maybe you can use ideas from there:
I can't vouch for its correctness, but might be worth taking a look at. :)
The implementation surely requires holding the entire string in memory, but you can easily work around that (as with any other implementation that performs the replacements) as long as you can split the input into blocks and guarantee that you never split at a position that is inside a symbol to be replaced. (One easy way to do that in your case is to split at a position where the next char isn't any of the chars used in a symbol.)
--
There is a reason beyond performance (though that is a sufficient reason in my book) to add a "ReplaceMultiple" method to one's string library: Simply doing the replace operation N times is NOT correct in general.
If the values that are substituted for the symbols are not constrained, values can end up being treated as symbols in subsequent replace operations. (There could be situations where you'd actually want this, but there are definitely cases where you don't. Using strange-looking symbols reduces the severity of the problem, but doesn't solve it, and "is ugly" because the strings to be formatted may be user-defineable - and so should not require exotic characters.)
However, I suspect there is a good reason why I can't easily find a general multi-replace implementation. A "ReplaceMultiple" operation simply isn't (obviously) well-defined in general.
To see this, consider what it might mean to "replace 'aa' with '!' and 'baa' with '?' in the string 'abaa'"? Is the result 'ab!' or 'a?' - or is such a replacement illegal?
One could require symbols to be "prefix-free", but in many cases that'd be unacceptable. Say I want to use this to format some template text. And say my template is for code. I want to replace "§table" with a database table name known only at runtime. It'd be annoying if I now couldn't use "§t" in the same template. The templated script could be something completely generic, and lo-and-behold, one day I encounter the client that actually made use of "§" in his table names... potentially making my template library rather less useful.
A perhaps better solution would be to use a recursive-descent parser instead of simply replacing literals. :)
Very late answer but none of answers so far give a solution.
With a bit of Boost Spirit Qi you can do this substitution in one pass, with extremely high efficiency.
#include <iostream>
#include <string>
#include <string_view>
#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
namespace bsq = boost::spirit::qi;
using SUBSTITUTION_MAP = std::map<std::string, std::string>;//,std::string>;
template <typename InputIterator>
struct replace_grammar
: bsq::grammar<InputIterator, std::string()>
{
replace_grammar(const SUBSTITUTION_MAP& substitution_items)
: replace_grammar::base_type(main_rule)
{
for(const auto& [key, value] : substitution_items) {
replace_items.add(key,value);
}
main_rule = *( replace_items [( [](const auto &val, auto& context) {
auto& res = boost::fusion::at_c<0>(context.attributes);
res += val; })]
|
bsq::char_
[( [](const auto &val, auto& context) {
auto& res = boost::fusion::at_c<0>(context.attributes);
res += val; })] );
}
private :
bsq::symbols<char, std::string> replace_items;
bsq::rule<InputIterator, std::string()> main_rule;
};
std::string replace_items(std::string_view input, const SUBSTITUTION_MAP& substitution_items)
{
std::string result;
result.reserve(input.size());
using iterator_type = std::string_view::const_iterator;
const replace_grammar<iterator_type> p(substitution_items);
if (!bsq::parse(input.begin(), input.end(), p, result))
throw std::logic_error("should not happen");
return result;
}
int main()
{
std::cout << replace_items("Hi ~+ and ^*. Is ^* still flying around ~+?",{{"~+", "Bobby"} , { "^*", "Danny"}});
}
The qi::symbol is essentially doing the job you ask for , i.e searching the given keys and replace with the given values.
https://www.boost.org/doc/libs/1_79_0/libs/spirit/doc/html/spirit/qi/reference/string/symbols.html
As said in the doc it builds behind the scene a Ternary Search Tree, which means that it is more efficient that searching n times the string for each key.
Boost string_algo does have a replace_all function. You could use that.
I suggest using the Boost Format library. Instead of ~+ and ^* you then use %1% and %2% and so on, a bit more systematically.
Example from the docs:
cout << boost::format("writing %1%, x=%2% : %3%-th try") % "toto" % 40.23 % 50;
// prints "writing toto, x=40.230 : 50-th try"
Cheers & hth.,
– Alf
I would suggest using std::map. So you have a set of replacements, so do:
std::map<std::string,std::string> replace;
replace["~+"]=Bobby;
replace["^*"]=Danny;
Then you could put the string into a vector of strings and check to see if each string occurs in the map and if it does replace it, you'd also need to take off any punctuation marks from the end. Or add those to the replacements. You could then do it in one loop. I'm not sure if this is really more efficient or useful than boost though.