std::regex - lookahead assertion not always working

std::regex - lookahead assertion not always working - c++

I'm writing a module that's making some string substitutions into text to give to a scripting language. The language's syntax is vaugely lisp-y, so expressions are bounded by parentheses and symbols separated by spaces, most of them starting with '$'. A regular expression like this seems like it should give matches at the appropriate symbol boundaries:
auto re_match_abc = std::regex{ "(?=.*[[:space:]()])\\$abc(?=[()[:space:]].*)" };
But in my environment (Visual C++ 2017, 15.9.19, targetting C++-17) it can match strings without a suitable boundary in front of them:
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "***") << std::endl;
// Result from VC++ 2017:
//
// $abc -> ***
// ($abc) -> (***)
// xyz$abc -> xyz*** <= What's going wrong here?
// $abcdef -> $abcdef
Why is that regex ignoring the positive-lookahead requirement to have at least one space or parenthesis before the matching text?
[I realize that there are other ways to do this job and to do it really robustly maybe I should use something to turn the string into a token stream, but for the immediate job I have (and because the person authoring the strings that get processed is sitting next to me, so we can coordinate) I thought that regex replacements would do for now.]

You need to use a positive lookbehind instead. What you really want is this:
auto re_match_abc = std::regex{ "(?<=[[:space:]()])\\$abc(?=[()[:space:]])" };
You can try it out on a website like https://regex101.com/ (just remove the escaped backslash that's required for the C++ string). It explains what each piece of the regex is doing and shows you everything that matches.
Keep in mind that this will also match things like )$abc)
Edit: std::regex apparently does not support lookbehind. For you specific case you might try something like this:
auto re_match_abc = std::regex{ "([[:space:]()])\\$abc(?=[()[:space:]])" };
std::cout << " $abc -> " << std::regex_replace(" $abc ", re_match_abc, "$1***") << std::endl;
std::cout << " ($abc) -> " << std::regex_replace("($abc)", re_match_abc, "$1***") << std::endl;
std::cout << "xyz$abc -> " << std::regex_replace("xyz$abc ", re_match_abc, "$1***") << std::endl;
std::cout << " $abcdef -> " << std::regex_replace(" $abcdef", re_match_abc, "$1***") << std::endl;
output:
$abc -> ***
($abc) -> (***)
xyz$abc -> xyz$abc
$abcdef -> $abcdef
try it here
Here instead of a lookbehind we have a normal capture group. In the replacement we're emitting whatever we captured (a parenthesis or space) followed by the actual string we want to replace $abc with.

Related

Problem with special characters with RegEx in C++

I have an issue to replace a special characters in string (from IIS Sharepoint log files) that contains a domain name with forward slash and names that starts with t, n, r that makes confusions with regular expressions. my code is as follow:
std::setlocale(LC_ALL, ".ACP"); //Sets the locale to the ANSI code page obtained from the operating system. FR characters
std::string subject("2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984");
std::string result;
std::string g1, g2, g5, g9, g10; //str groups in regex
try {
std::regex re("(\\d{4}-\\d{2}-\\d{2})( \\d{2}:\\d{2}:\\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \\d+.\\d+.\\d+.\\d+)");
std::sregex_iterator next(subject.begin(), subject.end(), re);
std::sregex_iterator end;
while (next != end) {
std::smatch match = *next;
std::cout << match.str() << "\n";
std::cout << "-------------------------------------------" << "\n";
g1 = match.str(1);
g2 = match.str(2);
g5 = match.str(5);
g9 = match.str(9);
g10 = match.str(10);
next++;
}
std::cout << "Date: " + g1 << "\n";
std::cout << "Time: " + g2 << "\n";
std::replace(g5.begin(), g5.end(), '+', ' ');
std::cout << "Link Document : " + g5 << "\n";
std::cout << "User: " + g9 << "\n";
std::cout << "IP: " + g10 << "\n";
}
catch (std::regex_error& e) {
std::cout << "Syntax error in the regular expression" << "\n";
}
My output for domain name is: domainname onzaro
Any help please for this problem with \, \t, \n or \r ?

I'd urge you to use raw string literals. This is solution designed for cases where the literal should not be processed in any way, such as yours.
The syntax is R "delimiter( raw_characters )delimiter", so in your case it could be:
std::string subject(R"raw(2018-08-26 11:38:20 172.20.1.148 GET /BaseDocumentaire/Documents+de+la+page+Notes+de+services/Rappel+du+dispositif+de+Sécurité+relatif+aux+Moyens+de+paiement+et+d’épargne+en+agence.pdf - 80 0#.w|domainname\tonzaro 10.12.105.24 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64;+rv:61.0)+Gecko/20100101+Firefox/61.0 200 0 0 29984)raw");
std::regex re( R"raw((\d{4}-\d{2}-\d{2})( \d{2}:\d{2}:\d{2})( 172.20.1.148)( GET | POST | HEAD )((/.*){1,4}/.*.(pdf|aspx))( -.*)(domainname.[a-zA-Z0-9]*)( \d+.\d+.\d+.\d+))raw");
(I might have missed some superfluous \ above). See it live.
Those special characters are called escape sequences are being processed in string literals at compilation level (in phase 5 to be precise). For raw string literals this transformation is suppressed.
You don't care about any special character handling. You just need to take care that ")delimiter" doesn't appear in your literal, which I imagine could happen in regex.

'\t' is one character, a horizontal tab. If you want the characters \ and t, you need to escape the backslash: "\\t".

While left shift operator (<<) using before std::cout , what does it work for?

Those code lines:
std::cout << "observerIndex : " <<
std::cout << pobserverIndex -> observerInt() ;
Generate the compiler error below:
file.C:2917:37: error: no match for 'operator<<' (operand types are 'std::basic_ostream<char>' and 'std::basic_ostream<char>')
std::cout << "observerIndex : " <<
^
Could anyone please tell me what left shift operator(<<) is doing on there (before std::cout << pobserverIndex -> observerInt())?

You appear to be missing a semicolon at the end of your first statement, plus you are repeating std::cout.
You need to use
std::cout << "observerIndex: " << pobserverIndex -> observerInt();
A variant like
std::cout << "a" << std::cout<< "b";
is outputting the address of the object cout in the std namespace, formatted as hexadecimal, between the strings "a" and "b".

PCRE does not match

I suppose it's something very stupid, however this does not match, and I have no idea why.
I compiles successfully and everything, but it just doesn't match.
I've already used RE(".*") but it doesn't work as well.
The system is OS X (installed pcre using brew).
std::string s;
if (pcrecpp::RE("h.*o").FullMatch("hello", &s))
{
std::cout << "Successful match " << s << std::endl;
}

You are trying to extract one subpattern (in &s), but have not included any parentheses to capture that subpattern. Try this (untested, note parentheses).
std::string s;
if (pcrecpp::RE("(h.*o)").FullMatch("hello", &s))
{
std::cout << "Successful match " << s << std::endl;
}
The documentation at http://www.pcre.org/original/doc/html/pcrecpp.html has a similar example, stating:
Example: fails because there aren't enough sub-patterns:
!pcrecpp::RE("\w+:\d+").FullMatch("ruby:1234", &s);

Boost regex don't match tabs

I'm using boost regex_match and I have a problem with matching no tab characters.
My test application looks as follows:
#include <iostream>
#include <string>
#include <boost/spirit/include/classic_regex.hpp>
int
main(int args, char** argv)
{
boost::match_results<std::string::const_iterator> what;
if(args == 3) {
std::string text(argv[1]);
boost::regex expression(argv[2]);
std::cout << "Text : " << text << std::endl;
std::cout << "Regex: " << expression << std::endl;
if(boost::regex_match(text, what, expression, boost::match_default) != 0) {
int i = 0;
std::cout << text;
if(what[0].matched)
std::cout << " matches with regex pattern!" << std::endl;
else
std::cout << " does not match with regex pattern!" << std::endl;
for(boost::match_results<std::string::const_iterator>::const_iterator it=what.begin(); it!=what.end(); ++it) {
std::cout << "[" << (i++) << "] " << it->str() << std::endl;
}
} else {
std::cout << "Expression does not match!" << std::endl;
}
} else {
std::cout << "Usage: $> ./boost-regex <text> <regex>" << std::endl;
}
return 0;
}
If I run the program with these arguments, I don't get the expected result:
$> ./boost-regex "`cat file`" "(?=.*[^\t]).*"
Text : This text includes some tabulators
Regex: (?=.*[^\t]).*
This text includes some tabulators matches with regex pattern!
[0] This text includes some tabulators
In this case I would have expected that what[0].matched is false, but it's not.
Is there any mistake in my regular expression?
Or do I have to use other format/match flag?
Thank you in advance!

I am not sure what you want to do. My understanding is, you want the regex to fail as soon as there is a tab in the text.
Your positive lookahead assertion (?=.*[^\t]) is true as soon as it finds a non tab, and there are a lot of non tabs in your text.
If you want it to fail, when there is a tab, go the other way round and use a negative lookahead assertion.
(?!.*\t).*
this assertion will fail as soon as it find a tab.

Constructing boost regex

I want to match every single number in the following string:
-0.237522264173E+01 0.110011117918E+01 0.563118085683E-01 0.540571836345E-01 -0.237680494785E+01 0.109394729137E+01 -0.237680494785E+01 0.109394729137E+01 0.392277532367E+02 0.478587433035E+02
However, for some reason the following boost::regex doesn't work:
(.*)(-?\\d+\\.\\d+E\\+\\d+ *){10}(.*)
What's wrong with it?
EDIT: posting relevant code:
std::ifstream plik("chains/peak-summary.txt");
std::string mystr((std::istreambuf_iterator<char>(plik)), std::istreambuf_iterator<char>());
plik.close();
boost::cmatch what;
boost::regex expression("(.*)(-?\\d+\\.\\d+E\\+\\d+ *){10}(.*)");
std::cout << "String to match against: \"" << mystr << "\"" << std::endl;
if(regex_match(mystr.c_str(), what, expression))
{
std::cout << "Match!";
std::cout << std::endl << what[0] << std::endl << what[1] << std::endl;
} else {
std::cout << "No match." << std::endl;
}
output:
String to match against: " -0.237555275450E+01 0.109397523269E+01 0.560420828508E-01 0.556732715285E-01 -0.237472295761E+01 0.110192835331E+01 -0.237472295761E+01 0.110192835331E+01 0.393040553508E+02 0.478540190640E+02
"
No match.
Also posting the contents of file read into the string:
[dare2be#schroedinger multinest-peak]$ cat chains/peak-summary.txt
-0.237555275450E+01 0.109397523269E+01 0.560420828508E-01 0.556732715285E-01 -0.237472295761E+01 0.110192835331E+01 -0.237472295761E+01 0.110192835331E+01 0.393040553508E+02 0.478540190640E+02

The (.*) around your regex match and consume all text at the start and end of the string, so if there are more than ten numbers, the first ones won't be matched.
Also, you're not allowing for negative exponents.
(-?\\d\\.\\d+E[+-]\\d+ *){10,}
should work.
This will match all of the numbers in a single string; if you want to match each number separately, you have to use (-?\\d\\.\\d+E[+-]\\d+) iteratively.

Try with:
(-?[0-9]+\\.[0-9]+E[+-][0-9]+)
Your (.*) in the beggining matches greedy whole string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

std::regex - lookahead assertion not always working - c++

Related

Problem with special characters with RegEx in C++

While left shift operator (<<) using before std::cout , what does it work for?

PCRE does not match

Boost regex don't match tabs

Constructing boost regex

Categories

Resources