first time regex (in c++ that is)
I have a hard time writing
(?<=name=")(?:[^\\"]+|\\.)*(?=")
that matches for example name="blabla" xyz as blabla as code...
How do I
std::regex TheName("(?<=name=")(?:[^\\"]+|\\.)*(?=")");
correctly please?
You need to use capturing rather than positive lookbehind in C++ regex. Also, it is advisable to use the unroll-the-loop principle to unroll your ([^"\\]|\\.)* subpattern to make the regex as fast as it can be, see [^\"\\]*(?:\\.[^\"\\]*)*. Also, it is advisable to use raw string literals (see R"(<PATTERN>)") when defining regex patterns in order to avoid overescaping.
See the C++ demo:
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::string s = "name=\"bla \\\"bla\\\"\"";
std::regex TheName(R"(name=\"([^\"\\]*(?:\\.[^\"\\]*)*)\")");
std::smatch m;
if (regex_search(s, m, TheName)) {
std::cout << m[1].str() << std::endl;
}
return 0;
}
Result: bla \"bla\"
Related
While working on a solution to this question, I came up with the following c++ regex:
#include <regex>
#include <string>
#include <iostream>
std::string remove_password(std::string const& input)
{
// I think this should work for skipping escaped quotes in the password.
// It works in javascript, but not in the standard library implementation.
// anyone have any ideas?
// (.*password\(("|'))(?:\\\2|[^\2])*?(\2.*)
// const char prog[] = R"__regex((.*password\(')([^']*)('.*)))__regex";
const char prog[] = R"__regex((.*password\(("|'))(?:\\\2|[^\2])*?(\2.*))__regex";
auto reg = std::regex(prog, std::regex_constants::syntax_option_type::ECMAScript);
std::smatch match;
std::regex_match(input, match, reg);
// match[0] is the entire string
// match[1] is pre-password
// match[2] is the password
// match[3] is post-password
return match[1].str() + "********" + match[3].str();
}
int main()
{
using namespace std::literals;
auto test_string = R"__(select * from run_on_hive(server('hdp230m2.labs.teradata.com'),username('vijay'),password('vijay'),dbname('default'),query('analyze table default.test01 compute statistics'));)__";
std::cout << remove_password(test_string);
}
I wanted to capture passwords, even if they contained an escaped quote or double-quote.
However the regex does not compile in clang or gcc.
It compiles correctly in regex101.com when using the javascript syntax.
Am I wrong, or is the implementation incorrect?
Note that ECMAScript is the default flavor in C++ std::regex, you do not have to specify it explicitly. At any rate, std::regex_constants::syntax_option_type::ECMAScript causes one error here since the compiler expects a std::regex_constants value here, and the simplest fix is to remove it or use std::regex(prog, std::regex_constants::ECMAScript).
The [^\2] pattern causes the second issue, Unexpected character in bracket expression. You cannot use backreferences inside bracket expressions, but you may use a negative lookahead to restrict a . / [^] pattern to match anything but what Group 2 holds.
Use
const char prog[] = R"((.*password\((["']))(?:\\\2|(?!\2)[^])*?(\2.*))";
See your fixed C++ demo.
However, it seems you may use a "cleaner" approach using std::regex_replace:
std::string remove_password(std::string const& input)
{
const char prog[] = R"((.*password\((["']))(?:\\\2|(?!\2)[^])*?(\2.*))";
auto reg = std::regex(prog);
return std::regex_replace(input, reg, "$1********$3");
}
See another C++ demo. The $1 and $3 are the placeholders for Group 1 and 3 values.
The code below doesn't give the same result in Visual Studio 2015 and IDEOne.com (C++14). More strange, in both cases the results are incorrect !
#include <iostream>
#include <regex>
int main()
{
const char* pszTestString = "ENDRESS+HAUSER*ST-DELL!HP||BESTMATCH&&ABCD\\ABCD";
const char* pszExpectedString = "ENDRESS\\+HAUSER\\*ST\\-DELL\\!HP\\||BESTMATCH\\&&ABCD\\\\ABCD";
std::cout << std::regex_replace(pszTestString, std::regex("[-+!\"\\[\\](){}^~*?:]|&&|\\|\\|"), "\\$0") << std::endl;
std::cout << pszExpectedString << std::endl;
return 0;
}
Under Visual Studio 2015 I got this strange result, the second line contains the expected result for both compilers :
ENDRESS\$0HAUSER\$0ST\$0DELL\$0HP\$0BESTMATCH\$0ABCD\ABCD
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\\ABCD
With IDEOne (C++14 compiler) :
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\ABCD
ENDRESS\+HAUSER\*ST\-DELL\!HP\||BESTMATCH\&&ABCD\\ABCD
We can see in the latter that there is a mistake : before the last "ABCD" there must be two backslashes and not a single one
What the heck is going on ? I wrote a manual parser instead of using std::regex_replace for the moment, but I really want make it work under VS2015 (and any other IDE ideally) and make a benchmark before choosing the manual parsing solution.
VS2015 default compiler does not treat $0 as a zeroth backreference. You need to use the "native" ECMAScript $& backreference to refer to the whole match from inside the replacement pattern.
Also, revo is right, in order to match \ you need to add it to the character class.
And note that in VS2015 you can use raw string literals. It is best practice to use raw string literals to define regex patterns as they help avoid overescaping (also called as backslash hell).
Solution:
std::cout << std::regex_replace(pszTestString,
std::regex(R"([-+!\\\"\[\](){}^~*?:]|&&|\|\|)"), "\\$&") << std::endl;
^^ ^^
I need to match the text '\0' with the same regex that I would match 'a' or 'b'. (a regex for a character constant in C++). I've tried a bunch of different regexes, but haven't gotten a successful one yet. My latest attempt:
^['].|\\0[']
Most of the other things I've tried have given seg faults, so this is really the closest I've gotten.
This works pretty nicely with what I've tested ('a','b','\0').
If you don't have std::regex or boost::regex I guess what you can get out of it is the fact that the regex I used is ('.'|'\\0').
#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<std::string> strings;
strings.push_back(R"('a')");
strings.push_back(R"('b')");
strings.push_back(R"('\0')");
boost::regex rgx(R"(('.'|'\\0'))");
boost::smatch match;
for(auto& i : strings) {
if(boost::regex_match(i,match, rgx)) {
boost::ssub_match submatch = match[1];
std::cout << submatch.str() << '\n';
}
}
}
Example
There's nothing magic about '\0'; it's just a character, like any other character, and there's nothing (almost) special you have to do to use it in a regular expression. The only problem you might run into is if you use it in the middle of a character literal that you pass to a function that treats it as the end of a string. To avoid that, force it into a std::string:
const char s[] = "a\0b";
std::string not_my_str(s); // not_my_str holds "a"
std::string str(s, 3); // str holds "a\0b"
Once you've constructed the string object, the embedded '\0' gets no special treatment. Except, of course, if you copy the contents with a function that treats it specially.
The regex that works (in this instance, using the C header ) is:
^('(.|([\\]0))')
Thanks to #WhozCraig for the help!
I'm writing a small command-line program that asks the user for polynomials in the form ax^2+bx^1+cx^0. I'm going to parse the data later but for now I'm just trying to see if I can match the polynomial with the regular expression(\+|-|^)(\d*)x\^([0-9*]*)My problem is, it doesn't match multiple terms in the user-entered polynomial unless I change it to((\+|-|^)(\d*)x\^([0-9*]*))*(the difference is the entire expression is grouped and has an asterisk at the end). The first expression works if I type something such as "4x^2" but not "4x^2+3x^1+2x^0", since it doesn't check multiple times.
My question is, why won't Boost.Regex'sregex_match()find multiple matches within the same string? It does in the regular expression editor I used (Expresso) but not in the actual C++ code. Is it supposed to be like that?
Let me know if something doesn't make sense and I'll try to clarify. Thanks for the help.
Edit1: Here's my code (I'm following the tutorial here: http://onlamp.com/pub/a/onlamp/2006/04/06/boostregex.html?page=3)
int main()
{
string polynomial;
cmatch matches; // matches
regex re("((\\+|-|^)(\\d*)x\\^([0-9*]*))*");
cout << "Please enter your polynomials in the form ax^2+bx^1+cx^0." << endl;
cout << "Polynomial:";
getline(cin, polynomial);
if(regex_match(polynomial.c_str(), matches, re))
{
for(int i = 0; i < matches.size(); i++)
{
string match(matches[i].first, matches[i].second);
cout << "\tmatches[" << i << "] = " << match << endl;
}
}
system("PAUSE");
return 0;
}
You're using the wrong thing -- regex_match is intended to check whether a (single) regex matches the entirety of a sequence of characters. As such, you need to either specify a regex that matches the whole input, or use something else. For your situation, it probably makes the most sense to just modify the regex as you've already done (group it and add a Kleene star). If you wanted to iterate over the individual terms of the polynomial, you'd probably want to use something like a regex_token_iterator.
Edit: Of course, since you're embedding this into C++, you also have to double all your backslashes. Looking at it, I'm also a little confused about the regex you're using -- it doesn't look to me like it should really work quite right. Just for example, it seems to require a "+", "-" or "^" at the beginning of a term, but the first term won't normally have that. I'm also somewhat uncertain why there would be a "^" at the beginning of a term. Since the exponent is normally omitted when it's zero, it's probably better to allow it to be omitted. Taking those into account, I get something like: "[-+]?(\d*)x(\^([0-9])*)".
Incorporating that into some code, we can get something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
At least for me, that prints out each term individually:
4x^2
+3x^1
+2x
Note that for the moment, I've just printed out each complete term, and modified your input to show off the ability to recognize a term that doesn't include a power (explicitly, anyway).
Edit: to collect the results into a vector instead of sending them to std::cout, you'd do something like this:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string poly = "4x^2+3x^1+2x";
std::tr1::regex term("[-+]?(\\d*)x(\\^[0-9])*");
std::vector<std::string> terms;
std::copy(std::tr1::sregex_token_iterator(poly.begin(), poly.end(), term),
std::tr1::sregex_token_iterator(),
std::back_inserter(terms));
// Now terms[0] is the first term, terms[1] the second, and so on.
return 0;
}
I'm familiar with Regex itself, but whenever I try to find any examples or documentation to use regex with Unix computers, I just get tutorials on how to write regex or how to use the .NET specific libraries available for Windows. I've been searching for a while and I can't find any good tutorials on C++ regex on Unix machines.
What I'm trying to do:
Parse a string using regex by breaking it up and then reading the different subgroups. To make a PHP analogy, something like preg_match that returns all $matches.
Consider using Boost.Regex.
An example (from the website):
bool validate_card_format(const std::string& s)
{
static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
return regex_match(s, e);
}
Another example:
// match any format with the regular expression:
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");
std::string machine_readable_card_number(const std::string s)
{
return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed);
}
std::string human_readable_card_number(const std::string s)
{
return regex_replace(s, e, human_format, boost::match_default | boost::format_sed);
}
Look up the documentation for TR1 regexes or (almost equivalently) boost regex. Both work quite nicely on various Unix systems. The TR1 regex classes have been accepted into C++ 0x, so though they're not exactly part of the standard yet, they will be reasonably soon.
Edit: To break a string into subgroups, you can use an sregex_token_iterator. You can specify either what you want matched as tokens, or what you want matched as separators. Here's a quickie demo of both:
#include <iterator>
#include <regex>
#include <string>
#include <iostream>
int main() {
std::string line;
std::cout << "Please enter some words: " << std::flush;
std::getline(std::cin, line);
std::tr1::regex r("[ .,:;\\t\\n]+");
std::tr1::regex w("[A-Za-z]+");
std::cout << "Matching words:\n";
std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), w),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
std::cout << "\nMatching separators:\n";
std::copy(std::tr1::sregex_token_iterator(line.begin(), line.end(), r, -1),
std::tr1::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
If you give it input like this: "This is some 999 text", the result is like this:
Matching words:
This
is
some
text
Matching separators:
This
is
some
999
text
You are looking for regcomp, regexec and regfree.
One thing to be careful about is that the Posix regular expressions actually implement two different languages, regular (default) and extended (include the flag REG_EXTENDED in the call to regcomp). If you are coming from the PHP world, the extended language closer to what you are used to.
For perl-compatible regular expressions (pcre/preg), I'd suggest boost.regex.
My best bet would be boost::regex.
Try pcre. And pcrepp.
Feel free to have a look at this small color grep tool I wrote.
At github
It uses regcomp, regexec and regfree that R Samuel Klatchko refers to.
I use "GNU regex": http://www.gnu.org/s/libc/manual/html_node/Regular-Expressions.html
Works well but can't find clear solution for UTF-8 regexp.
Regards