convert std::string to std::regex and escape special regex symbols [duplicate]

convert std::string to std::regex and escape special regex symbols [duplicate] - c++

I'm string to create a std::regex(__FILE__) as part of a unit test which checks some exception output that prints the file name.
On Windows it fails with:
regex_error(error_escape): The expression contained an invalid escaped character, or a trailing escape.
because the __FILE__ macro expansion contains un-escaped backslashes.
Is there a more elegant way to escape the backslashes than to loop through the resulting string (i.e. with a std algorithm or some std::string function)?

File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.
Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).
Fortunately, we can solve our regular expression problem with more regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*)");
return std::regex_replace(s, metacharacters, "\\$&");
}
(File paths can't contain * or ?, but I've included them to keep the function general.)
If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const char metacharacters[] = R"(\.^$-+()[]{}|?*)";
std::string out;
out.reserve(s.size());
for (auto ch : s) {
if (std::strchr(metacharacters, ch))
out.push_back('\\');
out.push_back(ch);
}
return out;
}
Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.

Here is polymapper.
It takes an operation that takes and element and returns a range, the "map operation".
It produces a function object that takes a container, and applies the "map operation" to each element. It returns the same type as the container, where each element has been expanded/contracted by the "map operation".
template<class Op>
auto polymapper( Op&& op ) {
return [op=std::forward<Op>(op)](auto&& r) {
using std::begin;
using R=std::decay_t<decltype(r)>;
using iterator = decltype( begin(r) );
using T = typename std::iterator_traits<iterator>::value_type;
std::vector<T> data;
for (auto&& e:decltype(r)(r)) {
for (auto&& out:op(e)) {
data.push_back(out);
}
}
return R{ data.begin(), data.end() };
};
}
Here is escape_stuff:
auto escape_stuff = polymapper([](char c)->std::vector<char> {
if (c != '\\') return {c};
else return {c,c};
});
live example.
int main() {
std::cout << escape_stuff(std::string(__FILE__)) << "\n";
}
The advantage of this approach is that the action of messing with the guts of the container is factored out. You write code that messes with the characters or elements, and the overall logic is not your problem.
The disadvantage is polymapper is a bit strange, and needless memory allocations are done. (Those could be optimized out, but that makes the code more convoluted).

EDIT
In the end, I switched to #AdrianMcCarthy 's more robust approach.
Here's the inelegant method in which I solved the problem in case someone stumbles on this actually looking for a workaround:
std::string escapeBackslashes(const std::string& s)
{
std::string out;
for (auto c : s)
{
out += c;
if (c == '\\')
out += c;
}
return out;
}
and then
std::regex(escapeBackslashes(__FILE__));
It's O(N) which is probably as good as you can do here, but involves a lot of string copying which I'd like to think isn't strictly necessary.

Related

Why does std::views::split() compile but not split with an unnamed string literal as a pattern?

When std::views::split() gets an unnamed string literal as a pattern, it will not split the string but works just fine with an unnamed character literal.
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string>
#include <string_view>
int main(void)
{
using namespace std::literals;
// returns the original string (not splitted)
auto splittedWords1 = std::views::split("one:.:two:.:three", ":.:");
for (const auto word : splittedWords1)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the splitted string
auto splittedWords2 = std::views::split("one:.:two:.:three", ":.:"sv);
for (const auto word : splittedWords2)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the splitted string
auto splittedWords3 = std::views::split("one:two:three", ':');
for (const auto word : splittedWords3)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the original string (not splitted)
auto splittedWords4 = std::views::split("one:two:three", ":");
for (const auto word : splittedWords4)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
return 0;
}
See live # godbolt.org.
I understand that string literals are always lvalues. But even though, I am missing some important piece of information that connects everything together. Why can I pass the string that I want splitted as an unnamed string literal whereas it fails (as-in: returns a range of ranges with the original string) when I do the same with the pattern?

String literals always end with a null-terminator, so ":.:" is actually a range with the last element of \0 and a size of 4.
Since the original string does not contain such a pattern, it is not split.
When dealing with C++20 ranges, I strongly recommend using string_view instead of raw string literals, which works well with <ranges> and can avoid the error-prone null-terminator issue.

This answer is completely correct, I'd just like to add a couple additional notes that might be interesting.
First, if you use {fmt} for printing, it's a lot easier to see what's going on, since you also don't have to write your own loop. You can just write this:
fmt::print("{}\n", rv::split("one:.:two:.:three", ":.:"));
Which will output (this is the default output for a range of range of char):
[[o, n, e, :, ., :, t, w, o, :, ., :, t, h, r, e, e, ]]
In C++23, there will be a way to directly specify that this print as a range of strings, but that hasn't been added to {fmt} yet. In the meantime, because split preserves the initial range category, you can add:
auto to_string_views = std::views::transform([](auto sr){
return std::string_view(sr.data(), sr.size());
});
And then:
fmt::print("{}\n", std::views::split("one:.:two:.:three", ":.:") | to_string_views);
prints:
["one:.:two:.:three\x00"]
Note the visibly trailing zero. Likewise, the next three attempts format as:
["one", "two", "three\x00"]
["one", "two", "three\x00"]
["one:two:three\x00"]
The fact that we can clearly see the \x00 helps track down the issue.
Next, consider the difference between:
std::views::split("one:.:two:.:three", ":.:")
and
"one:.:two:.:three" | std::views::split(":.:")
We typically consider these to be equivalent, but they're... not entirely. In the latter case, the library has to capture and stash these values - which involves decaying them. In this case, because ":.:" decays into char const*, that's no longer a valid pattern for the incoming string literal. So the above doesn't actually compile.
Now, it'd be great if it both compiled and also worked correctly. Unfortunately, it's impossible to tell in the language between a string literal (where you don't want to include the null terminator) and an array of char (where you want to include the whole array). So at least, with this latter formulation, you can get the wrong thing to not compile. And at least - "doesn't compile" is better than "compiles and does something wildly different from what I expected"?
Demo.

Split a mathematical expression using regex

I want to split the following mathematical expression -1+33+4.4+sin(3)-2-x^2 into tokens using regex. I use the following site to test my regex expression link, this says that nothing wrong. When I implement the regex into my C++, throwing the following error Invalid special open parenthesis I looked for the solution and I find the following stackoverflow site link but it do not helped me solve my problem.
My regex code is (?<=[-+*\/^()])|(?=[-+*\/^()]). In the C++ code I do not use \.
The other problem is that I do not know how to determine the minus sign is an unary operator or a binary operator, if the minus is an unary operator I want to look like this {-1}
I want the tokens looks like this : {-1,+,33,+4.4,+,sin,(,3,),-,2,-,x,^,2}
The unary minus can be anywhere in the string.
If I do not use ^ it still wrong.
code:
std::vector<std::string> split(const std::string& s, std::string rgx_str) {
std::vector<std::string> elems;
std::regex rgx (rgx_str);
std::sregex_token_iterator iter(s.begin(), s.end(), rgx);
std::sregex_token_iterator end;
while (iter != end) {
elems.push_back(*iter);
++iter;
}
return elems;
}
int main() {
std::string str = "-1+33+4.4+sin(3)-2-x^2";
std::string reg = "(?<=[-+*/()^])|(?=[-+*/()^])";
std::vector<std::string> s = split(str,reg);
for(auto& a : s)
cout << a << endl;
return 0;
}

C++ uses a modified ECMAScript regular expression grammar for its std::regex by default. It does support lookaheads (?=) and (?!), but not lookbehinds. So, the (?<=) is not a valid std::regex syntax.
There is a proposal to add this in C++23, but it is not currently implemented.

Stringize operator on argument with spaces

I have
#define ARG(TEXT , REPLACEMENT) replace(#TEXT, REPLACEMENT)
so
QString str= QString("%ONE, %TWO").ARG(%ONE, "1").ARG(%TWO, "2");
becomes
str= QString("%ONE, %TWO").replace("%ONE", "1").replace("%TWO", "2");
//str = "1, 2"
The problem is that VS2019, when formatting the code (Edit.FormatSelection) interprets that % sign as an operator and adds a whitespace
QString str= QString("%ONE, %TWO").ARG(% ONE, "1").ARG(% TWO, "2");
(I think it's a bug in VS). The code compiles without warnings.
As I am dealing with some ancient code that has this "feature" spread, I'm worried to auto-format text containing this and break functionality.
Is there a way at compile time to detect such arguments to a macro having space(s)?

Is there a way at compile time to detect such arguments to a macro having space(s)?
Here's what I would do:
#define ARG(TEXT, REPLACEMENT) \
replace([]{ \
static constexpr char x[] = #TEXT; \
static_assert(x[0] == '%' && x[1] != ' '); \
return x; \
}(), REPLACEMENT)

Apparently some time in the next decade C++ will provide a better solution, and indeed there might be a much less clunky solution than the one I provide below, but it's maybe a place to start.
This version uses the Boost Preprocessor library to do a repetition which would have been straight-forward to write with a template if C++ allowed string literals as template arguments, a feature which has not yet gotten into the standard for motivations I can only guess at. So it doesn't actually test whether the argument has no spaces; rather it tests that there are no spaces in the first 64 characters (where 64 is an almost entirely arbitrary number which can be changed as your needs dictate). I used the Boost Preprocessor library; you could do this with your own special purpose macros if for some reason you don't want to use Boost.
#include <boost/preprocessor/repetition/repeat.hpp>
#define NO_SPACE_AT_N(z, N, s) && (N >= sizeof(s) || s[N] != ' ')
#define NO_SPACE(s) true BOOST_PP_REPEAT(64, NO_SPACE_AT_N, s)
// Arbitrary constant, change as needed---^
// Produce a compile time error if there's a space.
template<bool ok> struct NoSpace {
const char* operator()(const char* s) {
static_assert(ok, "Unexpected space");
return s;
}
};
#define ARG(TEXT, REPL) replace(NoSpace<NO_SPACE(#TEXT)>()(#TEXT), REPL)
(Test on gcc.godbolt.)

If the question is to produce a compilation error when the first argument of ARG contains a space, I managed to get this to work:
#include <cstdlib>
template<size_t N>
constexpr int string_validate( const char (&s)[N] )
{
for (int i = 0; i < N; ++i)
if ( s[i] == ' ' )
return 0;
return 1;
}
template<int N> void assert_const() { static_assert(N, "string validation failed"); }
int replace(char const *, char const *) { return 0; } // dummy for example
#define ARG(TEXT , REPLACEMENT) replace((assert_const<string_validate(#TEXT)>(), #TEXT), REPLACEMENT)
int main()
{
auto b = ARG(%TWO, "2");
auto a = ARG(% ONE, "1"); // causes assertion failure
}
Undoubtedly there is a shorter way. Prior to C++20 you can't use a string literal in a template parameter, hence the constexpr function to produce an integer from the string literal and then we can check the integer at compile-time by using it as a template parameter.

It's unlikely.
Visual Studio works on source code, without running the preprocessor first and without performing what would be quite a difficult computation to work out whether the preprocessor would fundamentally alter the line it's formatting.
Besides, people don't really use macros in this way any more, or shouldn't (we have cheap functions!).
So this isn't really what the formatting feature expects.
If you can modify the code, make the user write .ARG("%ONE", "1"), then not only does the problem go away but also the would be more consistent.
Otherwise, you'll have to stick with formatting the code by hand.

How could I speed up comparison of std::string against string literals?

I have a bunch of code where objects of type std::string are compared for equality against string literals. Something like this:
//const std:string someString = //blahblahblah;
if( someString == "(" ) {
//do something
} else if( someString == ")" ) {
//do something else
} else if// this chain can be very long
The comparison time accumulates to a serious amount (yes, I profiled) and so it'd be nice to speed it up.
The code compares the string against numerous short string literals and this comparison can hardly be avoided. Leaving the string declared as std::string is most likely inevitable - there're thousands lines of code like that. Leaving string literals and comparison with == is also likely inevitable - rewriting the whole code would be a pain.
The problem is the STL implementation that comes with Visual C++11 uses somewhat strange approach. == is mapped onto std::operator==(const basic_string&, const char*) which calls basic_string::compare( const char* ) which in turn calls std::char_traits<char>( const char* ) which calls strlen() to compute the length of the string literal. Then the comparison runs for the two strings and lengths of both strings are passed into that comparison.
The compiler has a hard time analyzing all this and emits code that traverses the string literal twice. With short literals that's not much time but every comparison involves traversing the literal twice instead of once. Simply calling strcmp() would most likely be faster.
Is there anything I could do like perhaps writing a custom comparator class that would help avoid traversing the string literals twice in this scenario?

Similar to Dietmar's solution, but with slightly less editing: you can wrap the string (once) instead of each literal
#include <string>
#include <cstring>
struct FastLiteralWrapper {
std::string const &s;
explicit FastLiteralWrapper(std::string const &s_) : s(s_) {}
template <std::size_t ArrayLength>
bool operator== (char const (&other)[ArrayLength]) {
std::size_t const StringLength = ArrayLength - 1;
return StringLength == s.size()
&& std::memcmp(s.data(), other, StringLength) == 0;
}
};
and your code becomes:
const std:string someStdString = "blahblahblah";
// just for the context of the comparison:
FastLiteralWrapper someString(someStdString);
if( someString == "(" ) {
//do something
} else if( someString == ")" ) {
//do something else
} else if// this chain can be very long
NB. the fastest solution - at the cost of more editing - is probably to build a (perfect) hash or trie mapping string literals to enumerated constants, and then just switch on the looked-up value. Long if/else if chains usually smell bad IMO.

Well, aside from C++14's string_literal, you could easily code up a solution:
For comparison with a single character, use a character literal and:
bool operator==(const std::string& s, char c)
{
return s.size() == 1 && s[0] == c;
}
For comparison with a string literal, you can use something like this:
template<std::size_t N>
bool operator==(const std::string& s, char const (&literal)[N])
{
return s.size() == N && std::memcmp(s.data(), literal, N-1) == 0;
}
Disclaimer:
The first might even be superfluous,
Only do this if you measure an improvement over what you had.

If you have long chain of string literals to compare to there is likely some potential to deal with comparing prefixes to group common processing. Especially when comparing a known set of strings for equality with an input string, there is also the option to use a perfect hash and key the operations off an integer produced by those.
Since the use of a perfect hash will probably have the best performance but also requires major changes of the code layout, an alternative could be to determine the size of the string literals at compile time and use this size while comparing. For example:
class Literal {
char const* d_base;
std::size_t d_length;
public:
template <std::size_t Length>
Literal(char const (&base)[Length]): d_base(base), d_length(Length - 1) {}
bool operator== (std::string const& other) const {
return other.size() == this->d_length
&& !other.memcmp(this->d_base, other.c_str(), this->d_length);
}
bool operator!=(std::string const& other) const { return !(*this == other); }
};
bool operator== (std::string const& str, Literal const& literal) {
return literal == str;
}
bool operator!= (std::string const& str, Literal const& literal) {
return !(str == literal);
}
Obviously, this assumes that your literals don't embed null characters ('\0') other than the implicitly added terminating null character as the static length would otherwise be distorted. Using C++11 constexpr it would be possible to guard against that possibility but the code gets somewhat more complicated without any good reason. You'd then compare your strings using something like
if (someString == Literal("(")) {
...
}
else if (someString == Literal(")")) {
...
}

The fastest string comparison you can get is by interning the strings: Build a large hash table that contains all strings that are ever created. Ensure that whenever a string object is created, it is first looked up from the hash table, only creating a new object if no preexisting object is found. Naturally, this functionality should be encapsulated in your own string class.
Once you have done this, string comparison is equivalent to comparing their addresses.
This is actually quite an old technique first popularized with the LISP language.
The point, why this is faster, is that every string only has to be created once. If you are careful, you'll never generate the same string twice from the same input bytes, so string creation overhead is controlled by the amount of input data you work through. And hashing all your input data once is not a big deal.
The comparisons, on the other hand, tend to involve the same strings over and over again (like your comparing to literal strings) when you write some kind of a parser or interpreter. And these comparisons are reduced to a single machine instruction.

2 other ideas :
A) Build a FSA using a lexical analyser tool like flex, so the string is converted to an integer token value, depending what it matches.
B) Use length, to break up long elseif chains, possibly partly table driven
Why not get the length of the string something, at the top then just compare against the literals it could possibly match.
If there's a lot of them, it may be worth making it table driven and use a map and function pointers. You could just special case the single character literals, for example perhaps using a function lookup table.
Finding non-matches fast and the common lengths may suffice, and not require too much code restructuring, but be more maintainable as well as faster.
int len = strlen (something);
if ( ! knownliterallength[ len]) {
// not match
...
} else {
// First char may be used to index search, or literals are stored in map with *func()
switch (len)
{
case 1: // Could use a look table index by char and *func()
processchar( something[0]);
break;
case 2: // Short strings
case 3:
case 4:
processrunts( something);
break
default:
// First char used to index search, or literals are stored in map with *func()
processlong( something);
break
}
}

This is not the prettiest solution but it has proved quite fast when there is a lot of short strings to be compared (like operators and control characters/keywords in a script parser?).
Create a search tree based on string length and only compare characters. Try to represent known strings as an enumeration if this makes it cleaner in the particular implementation.
Short example:
enum StrE {
UNKNOWN = 0 ,
RIGHT_PAR ,
LEFT_PAR ,
NOT_EQUAL ,
EQUAL
};
StrE strCmp(std::string str)
{
size_t l = str.length();
switch(l)
{
case 1:
{
if(str[0] == ')') return RIGHT_PAR;
if(str[0] == '(') return LEFT_PAR;
// ...
break;
}
case 2:
{
if(str[0] == '!' && str[1] == '=') return NOT_EQUAL;
if(str[0] == '=' && str[1] == '=') return EQUAL;
// ...
break;
}
// ...
}
return UNKNOWN;
}
int main()
{
std::string input = "==";
switch(strCmp(input))
{
case RIGHT_PAR:
printf("right par");
break;
case LEFT_PAR:
printf("left par");
break;
case NOT_EQUAL:
printf("not equal");
break;
case EQUAL:
printf("equal");
break;
case UNKNOWN:
printf("unknown");
break;
}
}

Using templates for implementing a generic string parser

I am trying to come up with a generic solution for parsing strings (with a given format). For instance, I would like to be able to parse a string containing a list of numeric values (integers or floats) and return a std::vector. This is what I have so far:
template<typename T, typename U>
T parse_value(const U& u) {
throw std::runtime_error("no parser available");
}
template<typename T>
std::vector<T> parse_value(const std::string& s) {
std::vector<std::string> parts;
boost::split(parts, s, boost::is_any_of(","));
std::vector<T> res;
std::transform(parts.begin(), parts.end(), std::back_inserter(res),
[](const std::string& s) { return boost::lexical_cast<T>(s); });
return res;
}
Additionally, I would like to be able to parse strings containing other type of values. For instance:
struct Foo { /* ... */ };
template<>
Foo parse_value(const std::string& s) {
/* parse string and return a Foo object */
}
The reason to maintain a single "hierarchy" of parse_value functions is because, sometimes, I want to parse an optional value (which may exist or not), using boost::optional. Ideally, I would like to have just a single parse_optional_value function that would delegate on the corresponding parse_value function:
template<typename T>
boost::optional<T> parse_optional_value(const boost::optional<std::string>& s) {
if (!s) return boost::optional<T>();
return boost::optional<T>(parse_value<T>(*s));
}
So far, my current solution does not work (the compiler cannot deduce the exact function to use). I guess the problem is that my solution relies on deducing the template value based on the return type of parse_value functions. I am not really sure how to fix this (or even whether it is possible to fix it, since the design approach could just be totally flawed). Does anyone know a way to solve what I am trying to do? I would really appreciate if you could just point me to a possible way to address the issues that I am having with my current implementation. BTW, I am definitely open to completely different ideas for solving this problem too.

You cannot overload functions based on return value [1]. This is precisely why the standard IO library uses the construct:
std::cin >> a >> b;
which may not be your piece of cake -- many people don't like it, and it is truly not without its problems -- but it does a nice job of providing a target type to the parser. It also has the advantage over a static parse<X>(const std::string&) prototype that it allows for chaining and streaming, as above. Sometimes that's not needed, but in many parsing contexts it is essential, and the use of operator>> is actually a pretty cool syntax. [2]
The standard library doesn't do what would be far and away the coolest thing, which is to skip string constants scanf style and allow interleaved reading.
vector<int> integers;
std::cin >> "[" >> interleave(integers, ",") >> "]";
However, that could be defined. (Possibly it would be better to use an explicit wrapper around the string literals, but actually I prefer it like that; but if you were passing a variable you'd want to use a wrapper).
[1] With the new auto declaration, the reason for this becomes even clearer.
[2] IO manipulators, on the other hand, are a cruel joke. And error handling is pathetic. But you can't have everything.

Here is an example of libsass parser:
const char* interpolant(const char* src) {
return recursive_scopes< exactly<hash_lbrace>, exactly<rbrace> >(src);
}
// Match a single character literal.
// Regex equivalent: /(?:x)/
template <char chr>
const char* exactly(const char* src) {
return *src == chr ? src + 1 : 0;
}
where rules could be passed into the lex method.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

convert std::string to std::regex and escape special regex symbols [duplicate] - c++

Related

Why does std::views::split() compile but not split with an unnamed string literal as a pattern?

Split a mathematical expression using regex

Stringize operator on argument with spaces

How could I speed up comparison of std::string against string literals?

Using templates for implementing a generic string parser

Categories

Resources