std::regex and ignoring flags - c++

After learning basic c++ rules,I specialized my focus on std::regex, creating two console apps: 1.renrem and 2.bfind.
And I decided to create some convenient functions to deal with regex in c++ as easy as possible plus all with std; named RFC ( = regex function collection )
There are several strange things that always make me surprise, but this one ruined all my attempt and those two console apps.
One of the important functions is count_match that counts number of match inside a string. Here is the full code:
unsigned int count_match( const std::string& user_string, const std::string& user_pattern, const std::string& flags = "o" ){
const bool flags_has_i = flags.find( "i" ) < flags.size();
const bool flags_has_g = flags.find( "g" ) < flags.size();
std::regex::flag_type regex_flag = flags_has_i ? std::regex_constants::icase : std::regex_constants::ECMAScript;
// std::regex_constants::match_flag_type search_flag = flags_has_g ? std::regex_constants::match_default : std::regex_constants::format_first_only;
std::regex rx( user_pattern, regex_flag );
std::match_results< std::string::const_iterator > mr;
unsigned int counter = 0;
std::string temp = user_string;
while( std::regex_search( temp, mr, rx ) ){
temp = mr.suffix().str();
++counter;
}
if( flags_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce. So std::regex_search ignores the format_first_only but std::regex_replace accepts it. Let's it goes.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
Supposing this string s = "ONE TWO THREE four five six seven"
the output for c++ std
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 3 => Global match plus insensitive
whereas for the exact perl and d laugauge and c++ with boost the output is:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
I know about regex flavors PCRE; or ECMAScript 262 that c++ uses it, But I have no ides why a simple flag, is ignored for the only search function that c++ has? Since std::regex_iterator and std::regex_token_iterator are also use this function internally.
And shortly, I can not use those two my apps and RFC with std library because if this!
So if someone knows according to which rule it is maybe a valid rude in ECMAScript 262 or perhaps if I am wrong anywhere please tell me. Thanks.
tested with
gcc version 6.3.0 20170519 (Ubuntu/Linaro 6.3.0-18ubuntu2~16.04)
clang version 3.8.0-2ubuntu4
perl code:
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/g; print $c ;' "ONE TWO THREE four five six seven" // 3
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/gi; print $c ;' "ONE TWO THREE four five six seven" // 7
d code:
uint count_match( ref const (char[]) user_string, const (char[]) user_pattern, const (char[]) flags ){
const bool flag_has_g = flags.indexOf( "g" ) != -1;
Regex!( char ) rx = regex( user_pattern, flags );
uint counter = 0;
foreach( mr; matchAll( user_string, rx ) ){
++counter;
}
if( flag_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
the output:
writeln( count_match( s, "[A-Z]+", "g" ) ); // 3
writeln( count_match( s, "[A-Z]+", "gi" ) ); // 7
js code:
var s = "ONE TWO THREE four five six seven";
var rx1 = new RegExp( "[A-Z]+" , "g" );
var rx2 = new RegExp( "[A-Z]+" , "gi" );
var counter = 0;
while( rx1.exec( s ) ){
++counter;
}
document.write( counter + "<br>" ); // 3
counter = 0;
while( rx2.exec( s ) ){
++counter;
}
document.write( counter ); // 7
Okay. After testing with gcc 7.1.0 it turned out that with version below 6.3.0 the output is: 1 3 3 and but with 7.1.0 the output is 1 3 7
here is the link.
Also with this version of clang the output is correct. Here is the link. thanks to igor-tandetnik user

First of all I thought may this is a rule for ECMAScript, but after testing js code and seeing Igor Tandetnik commend I test the code with gcc 7.1.0 and it outputs the correct result.
For test the regex library, I use:
std::cout << ( rx.flags() & std::regex_constants::icase == std::regex_constants::icase ? "yes" : "no" ) << '\n';
So when the icase is set it returns true otherwise returns false. So I think there is no library fault.
Here is the test with gcc 7.1.0
Therefore all versions below gcc 7.1.0 has incorrect output.
For clang I have no ideas since I have clang 3.8.0 and it has incorrect output. But the online version even 3.7.1 output is correct.
screenshot with clang 3.8.0 for this code:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
So with online compiler the output is incorrect for clang 3.2 and below. But higher version outputs the correct result.
Please correct me if I am wrong

First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce.
The flag in question is format_first_only. This flag makes sense only for a "replace" operation. In regex_replace, the default is "replace all" but if you pass this flag it becomes "replace first only."
In regex_match and regex_search, there is no replacement going on at all; both of those functions just find the first match (and in the case of regex_match, that match must consume the entire string). Since the flag is meaningless in that case, I would expect the implementation to ignore it; but I wouldn't fault the implementation for throwing an exception, either, if it chose to be noisy about it.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
icase working wrong for character classes is definitely a bug in your vendor's library.
Looks like libstdc++'s bug was fixed between GCC 6.3 (Dec 2016) and GCC 7.1 (May 2017).
Looks like libc++'s bug was fixed between Clang 3.2 (Dec 2012) and Clang 3.3 (Jun 2013).

Related

Regex search for Text then match elements

I would like to install a condition. I have already tried a lot but without success. Would someone be so kind and would help me?
My goal: if SharedKeys true then match elements(A B C D).
SharedKeys = "A","B","C","D"
Regex: (?!\")[A-Z]*?(?=\")
Match elements: A B C D
Update: (SharedKeys)?(?(1)|(?!\")[A-Z]*?(?=\"))
Match elements: SharedKeys A B C D
Link: https://regex101.com/r/WFxZh4/1
I think i have now what i needed, maybe helps others.
SharedKeys = "A","B","C","D"
BlaBla = "B", "B", "C"
Result: (SharedKeys|BlaBla)?(?(1)|(?!\")[A-Z]*?(?=\"))
Match elements: SharedKeys A B C D BlaBla B B C
Result for c++: [A-Za-z]+|(?!\")[A-Z]*?(?=\") (std::regex)
with std::regex:
std::string s1( R"(SharedKeys = "A","B","C","D")" );
std::regex rx( "(SharedKeys)?(?(1)|[A-Z](?=\\\"))" );
std::match_results< std::string::const_iterator > mr;
while( std::regex_search( s1, mr, rx ) ){
std::cout << mr.str() << '\n';
s1 = mr.suffix().str();
}
the output:
terminate called after throwing an instance of 'std::regex_error'
what(): Invalid special open parenthesis. Aborted (core dumped)
with boost::regex:
std::string s1( R"(SharedKeys = "A","B","C","D")" );
boost::regex rx( "(SharedKeys)?(?(1)|[A-Z](?=\\\"))" );
boost::match_results< std::string::const_iterator > mr;
while( boost::regex_search( s1, mr, rx ) ){
std::cout << mr.str() << '\n';
s1 = mr.suffix().str();
}
the output:
SharedKeys
A
B
C
D
here you can see the difference between this two flavors:
source of the screenshot
Note. The std::regex is buggy especially with gcc version 6.3.0 and lower versions

C++11 regex matching capturing group multiple times

Could someone please help me to extract the text between the : and the ^ symbols using a JavaScript (ECMAScript) regular expression in C++11. I do not need to capture the hw-descriptor itself - but it does have to be present in the line in order for the rest of the line to be considered for a match. Also the :p....^, :m....^ and :u....^ can arrive in any order and there has to be at least 1 present.
I tried using the following regular expression:
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
against the following text line:
"hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^"
Here is the code which posted on a live coliru. It shows how I attempted to solve this problem, however I am only getting 1 match. I need to see how to extract each of the potential 3 matches corresponding to the p m or u characters described earlier.
#include <iostream>
#include <string>
#include <vector>
#include <regex>
int main()
{
static const std::regex gRegex("(?:hw-descriptor)(:[pmu](.*?)\\^)+", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^";
// I seem to only get 1 match here, I was expecting
// to loop through each of the matches, looks like I need something like
// a pcre global option but I don't know how.
std::for_each(std::sregex_iterator(foo.cbegin(), foo.cend(), gRegex), std::sregex_iterator(),
[&](const auto& rMatch) {
for (int i=0; i< static_cast<int>(rMatch.size()); ++i) {
std::cout << rMatch[i] << std::endl;
}
});
}
The above program gives the following output:
g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
:uTEXT3^
TEXT3
With std::regex, you cannot keep mutliple repeated captures when matching a certain string with consecutive repeated patterns.
What you may do is to match the overall texts containing the prefix and the repeated chunks, capture the latter into a separate group, and then use a second smaller regex to grab all the occurrences of the substrings you want separately.
The first regex here may be
hw-descriptor((?::[pmu][^^]*\\^)+)
See the online demo. It will match hw-descriptor and ((?::[pmu][^^]*\\^)+) will capture into Group 1 one or more repetitions of :[pmu][^^]*\^ pattern: :, p/m/u, 0 or more chars other than ^ and then ^. Upon finding a match, use :[pmu][^^]*\^ regex to return all the real "matches".
C++ demo:
static const std::regex gRegex("hw-descriptor((?::[pmu][^^]*\\^)+)", std::regex::icase);
static const std::regex lRegex(":[pmu][^^]*\\^", std::regex::icase);
std::string foo = "hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^ hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^";
std::smatch smtch;
for(std::sregex_iterator i = std::sregex_iterator(foo.begin(), foo.end(), gRegex);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << std::endl;
std::string x = m.str(1);
for(std::sregex_iterator j = std::sregex_iterator(x.begin(), x.end(), lRegex);
j != std::sregex_iterator();
++j)
{
std::cout << "Element value: " << (*j).str() << std::endl;
}
}
Output:
Match value: hw-descriptor:pTEXT1^:mTEXT2^:uTEXT3^
Element value: :pTEXT1^
Element value: :mTEXT2^
Element value: :uTEXT3^
Match value: hw-descriptor:pTEXT8^:mTEXT8^:uTEXT83^
Element value: :pTEXT8^
Element value: :mTEXT8^
Element value: :uTEXT83^

C++ regex_match not working

Here is part of my code
bool CSettings::bParseLine ( const char* input )
{
//_asm INT 3
std::string line ( input );
std::size_t position = std::string::npos, comment;
regex cvarPattern ( "\\.([a-zA-Z_]+)" );
regex parentPattern ( "^([a-zA-Z0-9_]+)\\." );
regex cvarValue ( "\\.[a-zA-Z0-9_]+[ ]*=[ ]*(\\d+\\.*\\d*)" );
std::cmatch matchedParent, matchedCvar;
if ( line.empty ( ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedParent, parentPattern ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedCvar, cvarPattern ) )
return false;
...
}
I try to separate with it lines which I read from file - lines look like:
foo.bar = 15
baz.asd = 13
ddd.dgh = 66
and I want to extract parts from it - e.g. for 1st line foo.bar = 15, I want to end up with something like:
a = foo
b = bar
c = 15
but now, regex is returning always false, I tested it on many online regex checkers, and even in visual studio, and it's working great, do I need some different syntax for C++ regex_match? I'm using visual studio 2013 community
The problem is that std::regex_match must match the entire string but you are trying to match only part of it.
You need to either use std::regex_search or alter your regular expression to match all three parts at once:
#include <regex>
#include <string>
#include <iostream>
const auto test =
{
"foo.bar = 15"
, "baz.asd = 13"
, "ddd.dgh = 66"
};
int main()
{
const std::regex r(R"~(([^.]+)\.([^\s]+)[^0-9]+(\d+))~");
// ( 1 ) ( 2 ) ( 3 ) <- capture groups
std::cmatch m;
for(const auto& line: test)
{
if(std::regex_match(line, m, r))
{
// m.str(0) is the entire matched string
// m.str(1) is the 1st capture group
// etc...
std::cout << "a = " << m.str(1) << '\n';
std::cout << "b = " << m.str(2) << '\n';
std::cout << "c = " << m.str(3) << '\n';
std::cout << '\n';
}
}
}
Regular expression: https://regex101.com/r/kB2cX3/2
Output:
a = foo
b = bar
c = 15
a = baz
b = asd
c = 13
a = ddd
b = dgh
c = 66
To focus on regex patterns I'd prefer to use raw string literals in c++:
regex cvarPattern ( R"rgx(\.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+)\.)rgx" );
regex cvarValue ( R"rgx(\.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+\.*\d*))rgx" );
Everything between the rgx( )rgx delimiters doesn't need any extra escaping for c++ char literal characters.
Actually what you have written in your question resembles to those regular expressions I've been writing as raw string literals.
You probably simply meant something like
regex cvarPattern ( R"rgx(.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+).)rgx" );
regex cvarValue ( R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+(\.\d*)?))rgx" );
I didn't dig in deeper, but I'm not getting all of these escaped characters in your regular expression patterns now.
As for your question in the comment, you can use a choice of matching sub-pattern groups, and check for which of them was applied in the matches structure:
regex cvarValue (
R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*((\d+)|(\d+\.\d?)|([a-zA-Z]+)){1})rgx" );
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You probably don't need these cvarPattern and parentPattern regular expressions to inspect other (more detailed) views about the matching pattern.

Bug in std::regex?

Here is code :
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("[^c]ei");
pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
std::cout << results.str() << std::endl;
return 0;
}
Output :
cei
The compiler used is gcc 4.9.1.
I'm a newbie learning regular expression.I expected nothing should be output,since "cei" doesn't match the pattern here. Am I doing it right? What's the problem?
Update:
This one has been reported and confirmed as a bug, for detail please visit here :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497
It's a bug in the implementation. Not only do a couple other tools I tried agree that your pattern does not match your input, but I tried this:
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
{
std::cout << results.str() << std::endl;
for (size_t i = 0; i < results.size(); ++i) {
std::ssub_match sub_match = results[i];
std::string sub_match_str = sub_match.str();
std::cout << i << ": " << sub_match_str << '\n';
}
}
}
This is basically similar to what you had, but I replaced [:alpha:] with [a-z] for simplicity, and I also temporarily replaced [^c] with [a-z] because that seems to make it work correctly. Here's what it prints (GCC 4.9.0 on Linux x86-64):
cei
0: cei
1:
2: c
3: e
4: i
5:
If I replace [a-z] where you had [^c] and just put f there instead, it correctly says the pattern doesn't match. But if I use [^c] like you did:
std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");
Then I get this output:
cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted (core dumped)
So it claims to match successfully, and results[0] is "cei" which is expected. Then, results[1] is "cei" also, which I guess might be OK. But then results[2] crashes, because it tries to construct a std::string of length 18446744073709551614 with begin=nullptr. And that giant number is exactly 2^64 - 2, aka std::string::npos - 1 (on my system).
So I think there is an off-by-one error somewhere, and the impact can be much more than just a spurious regex match--it can crash at runtime.
The regex is correct and should not match the string "cei".
The regex can be tested and explained best in Perl:
my $regex = qr{ # start regular expression
[[:alpha:]]* # 0 or any number of alpha chars
[^c] # followed by NOT-c character
ei # followed by e and i characters
[[:alpha:]]* # followed by 0 or any number of alpha chars
}x; # end + declare 'x' mode (ignore whitespace)
print "xei" =~ /$regex/ ? "match\n" : "no match\n";
print "cei" =~ /$regex/ ? "match\n" : "no match\n";
The regex will first consume all chars to the end of the string ([[:alpha:]]*), then backtrack to find the NON-c char [^c] and proceed with the e and i matches (by backtracking another time).
Result:
"xei" --> match
"cei" --> no match
for obvious reasons. Any discrepancies to this in various C++ libraries and testing tools are the problem of the implementation there, imho.

boost regex whitespace followed by one or more asterisk

why does the following boost regex not return the results I am looking for (starts with 0 ore more whitespace followed by one or more asterisk)?
boost::regex tmpCommentRegex("(^\\s*)\\*+");
for (std::vector<std::string>::iterator vect_it =
tmpInputStringLines.begin(); vect_it != tmpInputStringLines.end();
++vect_it) {
boost::match_results<std::string::const_iterator> tmpMatch;
if (boost::regex_match((*vect_it), tmpMatch, tmpCommentRegex,
boost::match_default) == 0) {
std::cout << "Found comment " << (*vect_it) << std::endl;
} else {
std::cout << "No comment" << std::endl;
}
}
On the following input:
* Script 7
[P]%OMO * change
[P]%QMS * change
[T]%OMO * change
[T]%QMM * change
[S]%G1 * Resume
[]
This should read
Found comment * Script 7
No comment
No comment
No comment
No comment
No comment
No comment
Quoting from the documentation for regex_match:
Note that the result is true only if the expression matches the whole of the input sequence. If you want to search for an expression somewhere within the sequence then use regex_search. If you want to match a prefix of the character string then use regex_search with the flag match_continuous set.
None of your input lines are matched by your regular expression as a whole, so the program works as expected. You should use regex_search to get the desired behavior.
Besides, regex_match and regex_search both return bool and not int, so testing for == 0 is wrong.