std::regex escape special characters for use in regex - c++

I'm string to create a std::regex(__FILE__) as part of a unit test which checks some exception output that prints the file name.
On Windows it fails with:
regex_error(error_escape): The expression contained an invalid escaped character, or a trailing escape.
because the __FILE__ macro expansion contains un-escaped backslashes.
Is there a more elegant way to escape the backslashes than to loop through the resulting string (i.e. with a std algorithm or some std::string function)?

File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.
Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).
Fortunately, we can solve our regular expression problem with more regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*)");
return std::regex_replace(s, metacharacters, "\\$&");
}
(File paths can't contain * or ?, but I've included them to keep the function general.)
If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const char metacharacters[] = R"(\.^$-+()[]{}|?*)";
std::string out;
out.reserve(s.size());
for (auto ch : s) {
if (std::strchr(metacharacters, ch))
out.push_back('\\');
out.push_back(ch);
}
return out;
}
Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.

Here is polymapper.
It takes an operation that takes and element and returns a range, the "map operation".
It produces a function object that takes a container, and applies the "map operation" to each element. It returns the same type as the container, where each element has been expanded/contracted by the "map operation".
template<class Op>
auto polymapper( Op&& op ) {
return [op=std::forward<Op>(op)](auto&& r) {
using std::begin;
using R=std::decay_t<decltype(r)>;
using iterator = decltype( begin(r) );
using T = typename std::iterator_traits<iterator>::value_type;
std::vector<T> data;
for (auto&& e:decltype(r)(r)) {
for (auto&& out:op(e)) {
data.push_back(out);
}
}
return R{ data.begin(), data.end() };
};
}
Here is escape_stuff:
auto escape_stuff = polymapper([](char c)->std::vector<char> {
if (c != '\\') return {c};
else return {c,c};
});
live example.
int main() {
std::cout << escape_stuff(std::string(__FILE__)) << "\n";
}
The advantage of this approach is that the action of messing with the guts of the container is factored out. You write code that messes with the characters or elements, and the overall logic is not your problem.
The disadvantage is polymapper is a bit strange, and needless memory allocations are done. (Those could be optimized out, but that makes the code more convoluted).

EDIT
In the end, I switched to #AdrianMcCarthy 's more robust approach.
Here's the inelegant method in which I solved the problem in case someone stumbles on this actually looking for a workaround:
std::string escapeBackslashes(const std::string& s)
{
std::string out;
for (auto c : s)
{
out += c;
if (c == '\\')
out += c;
}
return out;
}
and then
std::regex(escapeBackslashes(__FILE__));
It's O(N) which is probably as good as you can do here, but involves a lot of string copying which I'd like to think isn't strictly necessary.

Related

Is it possible to construct a modifiable view of portion in a string?

I have a match table with start and end indices of portions, in array (in a callback) - I wrap that array into vector of strings - now recently I did have the need to modify the original portions of the string.
struct regexcontext {
std::vector<std::optional<std::string>> matches;
std::string subject;
};
int buildmatchvector(size_t(*offset_vector)[2], int max, regexcontext* pcontext) {
pcontext->matches.clear();
ranges::transform(ranges::span{ offset_vector, max }, std::back_inserter(pcontext->matches), [&](const auto& refarr) {
return refarr[0] == -1 ? std::optional<std::string> {} : std::optional<std::string>{ pcontext->subject.substr(refarr[0], refarr[1] - refarr[0]) };
});
return 0;
}
Is it possible to change the above definition in a way that by modifying the match vector I will modify the subject string as well.
I've heard of string view but I've also heard it can't be modified with a variable sized string.
Note I'm using ranges-v3 which is the only library that implements standard ranges at the moment plus the nonstandard ranges::span which allows me to compile on msvc (since std::span doesn't work there for some reason).
As long as you only need to change characters to others, but not add or remove characters, then you could use a vector of span. Supporting addition or removal would be much more complicated and I don't think there's any simple solution in the standard library. Example:
return refarr[0] == -1
? span<char> {}
: span<char> {
&pcontext->subject[refarr[0]],
refarr[1] - refarr[0]
};
Note that any invalidating operation on the pointed string would invalidate these spans, so it would be a good idea to make the string private.

perl regex faster than c++/boost

I wrote a CGI script for my website which reads through blocks of text and matches all occurrences of English words. I've been making some fundamental changes to the site's code recently which have necessitated rewriting most of it in C++. As I'd hoped, almost everything has become much faster in C++ than perl, with the exception of this function.
I know that regexes are a relatively recent addition to C++ and not necessarily its strongest suit. It may simply be the case that it is slower than perl in this instance. But I wanted to share my code in the hopes that someone might be able to find a way of speeding up what I am doing in C++.
Here is the perl code:
open(WORD, "</file/path/wordthree.txt") || die "opening";
while(<WORD>) {
chomp;
push #wordlist,$_;
}
close (WORD) || die "closing";
foreach (#wordlist) {
while ($bloc =~ m/$_/g) {
$location = pos($bloc) - length($_);
$match=$location.";".pos($bloc).";".$_;
push(#hits,$match);
}
}
wordthree.txt is a list of ~270,000 English words separated by new lines, and $bloc is 3200 characters of text. Perl performs these searches in about one second. You can see it in play here if you like: http://libraryofbabel.info/anglishize.cgi?05y-w1-s3-v20:1
With C++ I have tried the following:
typedef std::map<std::string::difference_type, std::string> hitmap;
hitmap hits;
void regres(const boost::match_results<std::string::const_iterator>& what) {
hits[what.position()]=what[0].str();
}
words.open ("/file/path/wordthree.txt");
std::string wordlist[274784];
unsigned i = 0;
while (words >> wordlist[i]) {i++;}
words.close();
for (unsigned i=0;i<274783;i++) {
boost::regex word(wordlist[i]);
boost::sregex_iterator lex(book.begin(),book.end(), word);
boost::sregex_iterator end;
std::for_each(lex, end, &regres);
}
The C++ version takes about 12 seconds to read the same amount of text the same number of times. Any advice on how to make it competitive with the perl script is greatly appreciated.
Firstly I'd cut down on the number of allocations:
use string_ref instead of std::string where possible
use mapped files instead of reading it all in memory ahead of time
use const char* instead std::string::const_iterator to navigate the book
Here is a sample that uses Boost Spirit Qi to parse the wordlist (I don't have yours, so I assume line-separated words).
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
In full Live On Coliru¹
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace qi = boost::spirit::qi;
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<sref, It, void> {
static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
};
} } }
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
std::cout << "Wordlist contains " << wordlist.size() << " entries\n";
io::mapped_file_source book("/etc/dictionaries-common/words");
for (auto const& s: wordlist) {
regex word(s.to_string());
boost::cregex_iterator lex(book.begin(), book.end(), word), end;
std::for_each(lex, end, &regres);
}
}
Next step
This still creates a regex each iteration. I have a suspicion it will be a lot more efficient if you combine it all into a single pattern. You'll spend more memory/CPU creating the regex, but you'll reduce the power of the loop by the number of entries in the word list.
Because the regex library might not have been designed for this scale, you could have better results with a custom search and a trie implementation.
Here's a simple attempt (that is indeed much faster for my /etc/dictionaries-common/words file of 99171 lines):
Live On Coliru
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
io::mapped_file_params params("/etc/dictionaries-common/words");
params.flags = io::mapped_file::mapmode::priv;
io::mapped_file mapped(params);
std::replace(mapped.data(), mapped.end(), '\n', '|');
regex const wordlist(mapped.begin(), mapped.end() - 1);
io::mapped_file_source book("/etc/dictionaries-common/words");
boost::cregex_iterator lex(book.begin(), book.end(), wordlist), end;
std::for_each(lex, end, &regres);
}
¹ of course coliru doesn't have a suitable wordlist
It looks to me like Perl is smart enough to figure out that you're abusing regular expressions to do an ordinary linear search, a straightforward lookup. You are looking up straight text, and none of your search patterns appear to be, well, a pattern. Based on your description, all your search patterns look like ordinary strings, so Perl is likely optimizing it down to a linear string search.
I am not familiar with Boost's internal implementation of regular expression matching, but it's likely that it's compiling each search string into a state machine, and then executing the state machine for each search. That's the usual approach used with generic regular expression implementations. And that's a lot of work. A lot of completely needless work, in this specific case.
What you should do is as follows:
1) You are reading wordthree.txt into an array of strings. Instead of doing it, read it into a std::set<std::string> instead.
2) You are reading the entire text to search into a single book container. It's not clear, based on your code, whether book is a single std::string, or a std::vector<char>. But whatever the case, don't do that. Read the text to search iteratively, one word at a time. For each word, look it up in the std::set, and go from there.
This is, after all, what you're trying to do directly, and you should do that instead of taking a grand detour through the wonders of regular expressions, which accomplishes very little other than wasting a lot of time.
If you implement this correctly, you'll likely see C++ being just as fast, if not faster, than Perl.
I could also think of several other, more aggresively optimized approaches which also leverage std::set, but with custom classes and comparators, that seek to avoid all the heap allocations inherent with using a bunch of std::string, but it probably won't be necessary. A basic approach using a std::set-based lookup should be fast enough.

Sorting a string with std::sort so that capital letters come after lower case

I'd like to sort a vector so that the capital letters follow the lower case letter. If I have something like
This is a test
this is a test
Cats
cats
this thing
I would like the output to be
cats
Cats
this is a test
This is a test
this thing
The standard library sort will output
Cats
This is a test
cats
this is a test
this thing
I want to pass a predicate to std::sort so that it compares the lowercase version of the strings that I pass as arguments.
bool compare(std::string x, std::string y)
{
return lowercase(x) < lowercase(y);
}
I tried lowering each character within the function and then making the comparison but it didn't work. I would like to test this approach by converting the string to lowercase by some other method. How do I convert strings into lowercase?
EDIT::
Actually I figured out the problem. This works. When I first wrote the function, instead of ref = tolower(ref) I had tolower(ref) without reassigning to ref so it wasn't doing anything.
bool compare(std::string x, std::string y)
{
for(auto &ref:x)
ref = tolower(ref);
for(auto &ref:y)
ref = tolower(ref);
return x < y;
}
EDIT::
This code actually sorts with the capital letter first sometimes and the capital letter second in other times so it doesn't solve the problem completely.
The usual way to do this would be to build a collation table. That's just a table giving the relative ordering of every character. In your case, you want each upper-case letter immediately following the corresponding lower-case letter.
We can do that something like this:
class comp_char {
std::vector<int> collation_table;
public:
comp_char() : collation_table(std::numeric_limits<unsigned char>::max()) {
std::iota(collation_table.begin(), collation_table.end(), 0);
for (int i = 0; i < 26; i++) {
collation_table['a' + i] = i * 2;
collation_table['A' + i] = i * 2 + 1;
}
}
bool operator()(unsigned char a, unsigned char b) {
return collation_table[a] < collation_table[b];
}
};
For the moment, I've ignored the (possibly knotty) problem of the relative ordering of letters to other characters. As it's written, everything else sorts before letters, but it would be pretty easy to change that so (for example) letters sorted before anything else instead. It probably doesn't make a huge difference either way though -- most people don't have strong expectations about whether 'a' < ';' or not.
In any case, once the collation table is built and usable, you want to use it to compare strings:
struct cmp_str {
bool operator()(std::string const &a, std::string const &b) {
comp_char cmp;
size_t i = 0;
while (a[i] == b[i] && i < a.size())
++i;
return cmp(a[i], b[i]);
}
};
...which we can use to do sorting, something like this:
int main(){
std::vector<std::string> inputs {
"This is a test",
"this is a test",
"Cats",
"cats",
"this thing"
};
std::sort(inputs.begin(), inputs.end(), cmp_str());
std::copy(inputs.begin(), inputs.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
For the moment, I've only written the collation table to handle the basic US-ASCII letters. For real use, you'd typically want to have things like letters with accents and such sort next to their corresponding un-accented equivalents. For that, you typically end up pre-building the table to (partially) match things like the Unicode specification for how things should be ordered.
Note that this output doesn't quite match what the original question says is desired, but I think in this case the question has a mistake. I can't see any way it would be even marginally reasonable to produce an order like:
this is a test
This is a test
this thing
This has "T" sorting both after and before "t", which doesn't seem to make sense (or at least doesn't fit with a lexical sort, which is what people nearly always want for strings).
The simplest solution is to use the collation-aware sorting provided by the standard locale object.
A locale's operator()(std::string, std::string) is exactly the locale's collation-aware comparison operator, so you can just insert it directly into your call to std::sort:
// Adjust to the locale you actually want to use
std::sort(strings.begin(), strings.end(), std::locale("en_US.UTF-8"));
Example on ideone
Your solution is almost there, you just need to make a special case if the lower case version of the strings are equal:
std::string to_lower(std::string s)
{
for (auto & c : s)
c = std::tolower(c);
return s;
}
bool string_comp(std::string const & lhs, std::string const & rhs)
{
auto lhs_lower = to_lower(lhs);
auto rhs_lower = to_lower(rhs);
if (lhs_lower == rhs_lower)
return rhs < lhs;
return lhs_lower < rhs_lower;
}
This could use some optimization. Copying the string is not necessary. You can, of course, do a case insensitive comparison in place. But that is feature is not conveniently available in the standard library, so I'll leave that exercise up to you.
To be clear, I was aiming at the usual lexicographic type comparison but somehow make uppercase follow the lowercase if the strings were identical otherwise.
This requires a two-steps comparison then:
compare the strings in case-insensitive mode
if two strings are equal in case-insensitive mode, we want the reverse result of a case sensitive comparison (which puts upper-case first)
So, the comparator gives:
class Comparator {
public:
bool operator()(std::string const& left, std::string const& right) {
size_t const size = std::min(left.size(), right.size());
// case-insensitive comparison
for (size_t i = 0; i != size; ++i) {
if (std::tolower(left[i]) < std::tolower(right[i])) { return true; }
}
if (left.size() != right.size()) { return size == left.size(); }
// and now, case-sensitive (reversed)
return right < left;
}
}; // class Comparator
You need to do the comparison one char at a time, stopping at the first different char and then returning the result depending on the case conversion first, and on original char otherwise:
bool mylt(const std::string& a, const std::string& b) {
int i=0, na=a.size(), nb=b.size();
while (i<na && i<nb && a[i]==b[i]) i++;
if (i==na || i==nb) return i<nb;
char la=std::tolower(a[i]), lb=std::tolower(b[i]);
return la<lb || (la==lb && a[i]<b[i]);
}
Warning: untested breakfast code
Either use locals that already have the ordering you want, or write a character by character comparison function then use std::lexicographical_compare to turn it into a string comparison function.
I would try locals first, but if that proved frustrating the lexicographic is not horrible.
To compare chqracters, create two tuples or pairs of lower_case_letter, unchanged_letter, and call < on it. This will first order by lower case, then if that fails by the unchanged. I forget what order the upper vs lower will sort in: but if the order is backwards, just swap which lower case letter gets paired with which upper case letter, and you'll reverse the order!

C++ replace multiple strings in a string in a single pass

Given the following string, "Hi ~+ and ^*. Is ^* still flying around ~+?"
I want to replace all occurrences of "~+" and "^*" with "Bobby" and "Danny", so the string becomes:
"Hi Bobby and Danny. Is Danny still flying around Bobby?"
I would prefer not to have to call Boost replace function twice to replace the occurrences of the two different values.
I managed to implement the required replacement function using Boost.Iostreams. Specifically, the method I used was a filtering stream using regular expression to match what to replace. I am not sure about the performance on gigabyte sized files. You will need to test it of course. Anyway, here's the code:
#include <boost/regex.hpp>
#include <boost/iostreams/filter/regex.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <iostream>
int main()
{
using namespace boost::iostreams;
regex_filter filter1(boost::regex("~\\+"), "Bobby");
regex_filter filter2(boost::regex("\\^\\*"), "Danny");
filtering_ostream out;
out.push(filter1);
out.push(filter2);
out.push(std::cout);
out << "Hi ~+ and ^*. Is ^* still flying around ~+?" << std::endl;
// for file conversion, use this line instead:
//out << std::cin.rdbuf();
}
The above prints "Hi Bobby and Danny. Is Danny still flying around Bobby?" when run, just like expected.
It would be interesting to see the performance results, if you decide to measure it.
Daniel
Edit: I just realized that regex_filter needs to read the entire character sequence into memory, making it pretty useless for gigabyte-sized inputs. Oh well...
I did notice it's been a year since this was active, but for what it's worth. I came across an article on CodeProject today that claims to solve this problem - maybe you can use ideas from there:
I can't vouch for its correctness, but might be worth taking a look at. :)
The implementation surely requires holding the entire string in memory, but you can easily work around that (as with any other implementation that performs the replacements) as long as you can split the input into blocks and guarantee that you never split at a position that is inside a symbol to be replaced. (One easy way to do that in your case is to split at a position where the next char isn't any of the chars used in a symbol.)
--
There is a reason beyond performance (though that is a sufficient reason in my book) to add a "ReplaceMultiple" method to one's string library: Simply doing the replace operation N times is NOT correct in general.
If the values that are substituted for the symbols are not constrained, values can end up being treated as symbols in subsequent replace operations. (There could be situations where you'd actually want this, but there are definitely cases where you don't. Using strange-looking symbols reduces the severity of the problem, but doesn't solve it, and "is ugly" because the strings to be formatted may be user-defineable - and so should not require exotic characters.)
However, I suspect there is a good reason why I can't easily find a general multi-replace implementation. A "ReplaceMultiple" operation simply isn't (obviously) well-defined in general.
To see this, consider what it might mean to "replace 'aa' with '!' and 'baa' with '?' in the string 'abaa'"? Is the result 'ab!' or 'a?' - or is such a replacement illegal?
One could require symbols to be "prefix-free", but in many cases that'd be unacceptable. Say I want to use this to format some template text. And say my template is for code. I want to replace "§table" with a database table name known only at runtime. It'd be annoying if I now couldn't use "§t" in the same template. The templated script could be something completely generic, and lo-and-behold, one day I encounter the client that actually made use of "§" in his table names... potentially making my template library rather less useful.
A perhaps better solution would be to use a recursive-descent parser instead of simply replacing literals. :)
Very late answer but none of answers so far give a solution.
With a bit of Boost Spirit Qi you can do this substitution in one pass, with extremely high efficiency.
#include <iostream>
#include <string>
#include <string_view>
#include <map>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
namespace bsq = boost::spirit::qi;
using SUBSTITUTION_MAP = std::map<std::string, std::string>;//,std::string>;
template <typename InputIterator>
struct replace_grammar
: bsq::grammar<InputIterator, std::string()>
{
replace_grammar(const SUBSTITUTION_MAP& substitution_items)
: replace_grammar::base_type(main_rule)
{
for(const auto& [key, value] : substitution_items) {
replace_items.add(key,value);
}
main_rule = *( replace_items [( [](const auto &val, auto& context) {
auto& res = boost::fusion::at_c<0>(context.attributes);
res += val; })]
|
bsq::char_
[( [](const auto &val, auto& context) {
auto& res = boost::fusion::at_c<0>(context.attributes);
res += val; })] );
}
private :
bsq::symbols<char, std::string> replace_items;
bsq::rule<InputIterator, std::string()> main_rule;
};
std::string replace_items(std::string_view input, const SUBSTITUTION_MAP& substitution_items)
{
std::string result;
result.reserve(input.size());
using iterator_type = std::string_view::const_iterator;
const replace_grammar<iterator_type> p(substitution_items);
if (!bsq::parse(input.begin(), input.end(), p, result))
throw std::logic_error("should not happen");
return result;
}
int main()
{
std::cout << replace_items("Hi ~+ and ^*. Is ^* still flying around ~+?",{{"~+", "Bobby"} , { "^*", "Danny"}});
}
The qi::symbol is essentially doing the job you ask for , i.e searching the given keys and replace with the given values.
https://www.boost.org/doc/libs/1_79_0/libs/spirit/doc/html/spirit/qi/reference/string/symbols.html
As said in the doc it builds behind the scene a Ternary Search Tree, which means that it is more efficient that searching n times the string for each key.
Boost string_algo does have a replace_all function. You could use that.
I suggest using the Boost Format library. Instead of ~+ and ^* you then use %1% and %2% and so on, a bit more systematically.
Example from the docs:
cout << boost::format("writing %1%, x=%2% : %3%-th try") % "toto" % 40.23 % 50;
// prints "writing toto, x=40.230 : 50-th try"
Cheers & hth.,
– Alf
I would suggest using std::map. So you have a set of replacements, so do:
std::map<std::string,std::string> replace;
replace["~+"]=Bobby;
replace["^*"]=Danny;
Then you could put the string into a vector of strings and check to see if each string occurs in the map and if it does replace it, you'd also need to take off any punctuation marks from the end. Or add those to the replacements. You could then do it in one loop. I'm not sure if this is really more efficient or useful than boost though.

std::string comparison (check whether string begins with another string)

I need to check whether an std:string begins with "xyz". How do I do it without searching through the whole string or creating temporary strings with substr().
I would use compare method:
std::string s("xyzblahblah");
std::string t("xyz")
if (s.compare(0, t.length(), t) == 0)
{
// ok
}
An approach that might be more in keeping with the spirit of the Standard Library would be to define your own begins_with algorithm.
#include <algorithm>
using namespace std;
template<class TContainer>
bool begins_with(const TContainer& input, const TContainer& match)
{
return input.size() >= match.size()
&& equal(match.begin(), match.end(), input.begin());
}
This provides a simpler interface to client code and is compatible with most Standard Library containers.
Look to the Boost's String Algo library, that has a number of useful functions, such as starts_with, istart_with (case insensitive), etc. If you want to use only part of boost libraries in your project, then you can use bcp utility to copy only needed files
It seems that std::string::starts_with is inside C++20, meanwhile std::string::find can be used
std::string s1("xyzblahblah");
std::string s2("xyz")
if (s1.find(s2) == 0)
{
// ok, s1 starts with s2
}
I feel I'm not fully understanding your question. It looks as though it should be trivial:
s[0]=='x' && s[1]=='y' && s[2]=='z'
This only looks at (at most) the first three characters. The generalisation for a string which is unknown at compile time would require you to replace the above with a loop:
// look for t at the start of s
for (int i=0; i<s.length(); i++)
{
if (s[i]!=t[i])
return false;
}