Regex search for Text then match elements - c++

I would like to install a condition. I have already tried a lot but without success. Would someone be so kind and would help me?
My goal: if SharedKeys true then match elements(A B C D).
SharedKeys = "A","B","C","D"
Regex: (?!\")[A-Z]*?(?=\")
Match elements: A B C D
Update: (SharedKeys)?(?(1)|(?!\")[A-Z]*?(?=\"))
Match elements: SharedKeys A B C D
Link: https://regex101.com/r/WFxZh4/1
I think i have now what i needed, maybe helps others.
SharedKeys = "A","B","C","D"
BlaBla = "B", "B", "C"
Result: (SharedKeys|BlaBla)?(?(1)|(?!\")[A-Z]*?(?=\"))
Match elements: SharedKeys A B C D BlaBla B B C
Result for c++: [A-Za-z]+|(?!\")[A-Z]*?(?=\") (std::regex)

with std::regex:
std::string s1( R"(SharedKeys = "A","B","C","D")" );
std::regex rx( "(SharedKeys)?(?(1)|[A-Z](?=\\\"))" );
std::match_results< std::string::const_iterator > mr;
while( std::regex_search( s1, mr, rx ) ){
std::cout << mr.str() << '\n';
s1 = mr.suffix().str();
}
the output:
terminate called after throwing an instance of 'std::regex_error'
what(): Invalid special open parenthesis. Aborted (core dumped)
with boost::regex:
std::string s1( R"(SharedKeys = "A","B","C","D")" );
boost::regex rx( "(SharedKeys)?(?(1)|[A-Z](?=\\\"))" );
boost::match_results< std::string::const_iterator > mr;
while( boost::regex_search( s1, mr, rx ) ){
std::cout << mr.str() << '\n';
s1 = mr.suffix().str();
}
the output:
SharedKeys
A
B
C
D
here you can see the difference between this two flavors:
source of the screenshot
Note. The std::regex is buggy especially with gcc version 6.3.0 and lower versions

Related

C++ regex replace with a callback function

I have a map that stores id to value mapping, an input string can contain a bunch of ids. I need to replace those ids with their corresponding values. For example:
string = "I am in #1 city, it is now #2 time" // (#1 and #2 are ids here)
id_to_val_map = {1 => "New York", 2 => "summer"}
Desired output:
"I am in New York city, it is now summer time"
Is there a way I can have a callback function (that takes in the matched string and returns the string to be used as replacement) ? std::regex_replace doesn't seem to support that.
The alternative is to find all the matches, then compute their replacement values, and then perform the actual replacement. Which won't be that efficient.
You might do:
const std::map<int, std::string> m = {{1, "New York"}, {2, "summer"}};
std::string s = "I am in #1 city, it is now #2 time";
for (const auto& [id, value] : m) {
s = std::regex_replace(s, std::regex("#" + std::to_string(id)), value);
}
std::cout << s << std::endl;
Demo
A homegrown way is to use a while loop with regex_search() then
build the output string as you go.
This is essentially what regex_replace() does in a single pass.
No need to do a separate regex for each map item which has overhead of
reassignment on every item ( s=regex_replace() ) as well as covering the same
real estate with every pass.
Something like this regex
(?s)
( .*? ) # (1)
(?:
\#
( \d+ ) # (2)
| $
)
with this code
typedef std::string::const_iterator SITR;
typedef std::smatch X_smatch;
#define REGEX_SEARCH std::regex_search
std::regex _Rx = std::regex( "(?s)(.*?)(?:\\#(\\d+)|$)" );
SITR start = oldstr.begin();
SITR end = oldstr.end();
X_smatch m;
std::string newstr = "";
while ( REGEX_SEARCH( start, end, m, _Rx ) )
{
newstr.append( m[1].str() );
if ( m[2].matched ) {
{
// append the map keys value here, do error checking etc..
// std::string key = m[2].str();
int ndx = std::atoi( m[2].str() );
newstr.append( mymap[ ndx ] );
}
start = m[0].second;
}
// assign the old string with new string if need be
oldstr = newstr;

std::regex and ignoring flags

After learning basic c++ rules,I specialized my focus on std::regex, creating two console apps: 1.renrem and 2.bfind.
And I decided to create some convenient functions to deal with regex in c++ as easy as possible plus all with std; named RFC ( = regex function collection )
There are several strange things that always make me surprise, but this one ruined all my attempt and those two console apps.
One of the important functions is count_match that counts number of match inside a string. Here is the full code:
unsigned int count_match( const std::string& user_string, const std::string& user_pattern, const std::string& flags = "o" ){
const bool flags_has_i = flags.find( "i" ) < flags.size();
const bool flags_has_g = flags.find( "g" ) < flags.size();
std::regex::flag_type regex_flag = flags_has_i ? std::regex_constants::icase : std::regex_constants::ECMAScript;
// std::regex_constants::match_flag_type search_flag = flags_has_g ? std::regex_constants::match_default : std::regex_constants::format_first_only;
std::regex rx( user_pattern, regex_flag );
std::match_results< std::string::const_iterator > mr;
unsigned int counter = 0;
std::string temp = user_string;
while( std::regex_search( temp, mr, rx ) ){
temp = mr.suffix().str();
++counter;
}
if( flags_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce. So std::regex_search ignores the format_first_only but std::regex_replace accepts it. Let's it goes.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
Supposing this string s = "ONE TWO THREE four five six seven"
the output for c++ std
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 3 => Global match plus insensitive
whereas for the exact perl and d laugauge and c++ with boost the output is:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
I know about regex flavors PCRE; or ECMAScript 262 that c++ uses it, But I have no ides why a simple flag, is ignored for the only search function that c++ has? Since std::regex_iterator and std::regex_token_iterator are also use this function internally.
And shortly, I can not use those two my apps and RFC with std library because if this!
So if someone knows according to which rule it is maybe a valid rude in ECMAScript 262 or perhaps if I am wrong anywhere please tell me. Thanks.
tested with
gcc version 6.3.0 20170519 (Ubuntu/Linaro 6.3.0-18ubuntu2~16.04)
clang version 3.8.0-2ubuntu4
perl code:
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/g; print $c ;' "ONE TWO THREE four five six seven" // 3
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/gi; print $c ;' "ONE TWO THREE four five six seven" // 7
d code:
uint count_match( ref const (char[]) user_string, const (char[]) user_pattern, const (char[]) flags ){
const bool flag_has_g = flags.indexOf( "g" ) != -1;
Regex!( char ) rx = regex( user_pattern, flags );
uint counter = 0;
foreach( mr; matchAll( user_string, rx ) ){
++counter;
}
if( flag_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
the output:
writeln( count_match( s, "[A-Z]+", "g" ) ); // 3
writeln( count_match( s, "[A-Z]+", "gi" ) ); // 7
js code:
var s = "ONE TWO THREE four five six seven";
var rx1 = new RegExp( "[A-Z]+" , "g" );
var rx2 = new RegExp( "[A-Z]+" , "gi" );
var counter = 0;
while( rx1.exec( s ) ){
++counter;
}
document.write( counter + "<br>" ); // 3
counter = 0;
while( rx2.exec( s ) ){
++counter;
}
document.write( counter ); // 7
Okay. After testing with gcc 7.1.0 it turned out that with version below 6.3.0 the output is: 1 3 3 and but with 7.1.0 the output is 1 3 7
here is the link.
Also with this version of clang the output is correct. Here is the link. thanks to igor-tandetnik user
First of all I thought may this is a rule for ECMAScript, but after testing js code and seeing Igor Tandetnik commend I test the code with gcc 7.1.0 and it outputs the correct result.
For test the regex library, I use:
std::cout << ( rx.flags() & std::regex_constants::icase == std::regex_constants::icase ? "yes" : "no" ) << '\n';
So when the icase is set it returns true otherwise returns false. So I think there is no library fault.
Here is the test with gcc 7.1.0
Therefore all versions below gcc 7.1.0 has incorrect output.
For clang I have no ideas since I have clang 3.8.0 and it has incorrect output. But the online version even 3.7.1 output is correct.
screenshot with clang 3.8.0 for this code:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
So with online compiler the output is incorrect for clang 3.2 and below. But higher version outputs the correct result.
Please correct me if I am wrong
First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce.
The flag in question is format_first_only. This flag makes sense only for a "replace" operation. In regex_replace, the default is "replace all" but if you pass this flag it becomes "replace first only."
In regex_match and regex_search, there is no replacement going on at all; both of those functions just find the first match (and in the case of regex_match, that match must consume the entire string). Since the flag is meaningless in that case, I would expect the implementation to ignore it; but I wouldn't fault the implementation for throwing an exception, either, if it chose to be noisy about it.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
icase working wrong for character classes is definitely a bug in your vendor's library.
Looks like libstdc++'s bug was fixed between GCC 6.3 (Dec 2016) and GCC 7.1 (May 2017).
Looks like libc++'s bug was fixed between Clang 3.2 (Dec 2012) and Clang 3.3 (Jun 2013).

Fail to match with negative look-ahead assertion when tracking down on a delimiter

As long as I know a little about RegExp, the two below patterns are the same:
`^(?:[ab]|ab)(.)(?:(?!\1).)+\1$`
`^(?:ab|[ab])(.)(?:(?!\1).)+\1$`
but the number [ 1 ] is false and the [ 2 ] is true, whereas it should be true for both:
code:
void main( immutable string[] args ){
immutable string str = "ab some-word ";
Regex!( char ) rx = regex( `^(?:[ab]|ab)(.)(?:(?!\1).)+\1$` );
immutable bool b1 = !matchFirst( str, rx ).empty();
writeln( b1 ); // false ( should be true )
rx = regex( `^(?:ab|[ab])(.)(?:(?!\1).)+\1$` );
immutable bool b2 = !matchFirst( str, rx ).empty();
writeln( b2 ); // true
}
Demo on regex101.com
the main problem is not related to character class [], since the following is true for both:
^(?:ab|[ab])(.)-\1$
^(?:[ab]|ab)(.)-\1$
but with: (.)(?:(?!\1).) it fails if a character-class appears at the beginning.
I am not sure but may it is the same bug that GCC below the version 5.3.0 have had.
here is my question and found out this bug:
the same regex but different results on Linux and Windows only C++
Please correct me if I am wrong

C++ regex_match not working

Here is part of my code
bool CSettings::bParseLine ( const char* input )
{
//_asm INT 3
std::string line ( input );
std::size_t position = std::string::npos, comment;
regex cvarPattern ( "\\.([a-zA-Z_]+)" );
regex parentPattern ( "^([a-zA-Z0-9_]+)\\." );
regex cvarValue ( "\\.[a-zA-Z0-9_]+[ ]*=[ ]*(\\d+\\.*\\d*)" );
std::cmatch matchedParent, matchedCvar;
if ( line.empty ( ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedParent, parentPattern ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedCvar, cvarPattern ) )
return false;
...
}
I try to separate with it lines which I read from file - lines look like:
foo.bar = 15
baz.asd = 13
ddd.dgh = 66
and I want to extract parts from it - e.g. for 1st line foo.bar = 15, I want to end up with something like:
a = foo
b = bar
c = 15
but now, regex is returning always false, I tested it on many online regex checkers, and even in visual studio, and it's working great, do I need some different syntax for C++ regex_match? I'm using visual studio 2013 community
The problem is that std::regex_match must match the entire string but you are trying to match only part of it.
You need to either use std::regex_search or alter your regular expression to match all three parts at once:
#include <regex>
#include <string>
#include <iostream>
const auto test =
{
"foo.bar = 15"
, "baz.asd = 13"
, "ddd.dgh = 66"
};
int main()
{
const std::regex r(R"~(([^.]+)\.([^\s]+)[^0-9]+(\d+))~");
// ( 1 ) ( 2 ) ( 3 ) <- capture groups
std::cmatch m;
for(const auto& line: test)
{
if(std::regex_match(line, m, r))
{
// m.str(0) is the entire matched string
// m.str(1) is the 1st capture group
// etc...
std::cout << "a = " << m.str(1) << '\n';
std::cout << "b = " << m.str(2) << '\n';
std::cout << "c = " << m.str(3) << '\n';
std::cout << '\n';
}
}
}
Regular expression: https://regex101.com/r/kB2cX3/2
Output:
a = foo
b = bar
c = 15
a = baz
b = asd
c = 13
a = ddd
b = dgh
c = 66
To focus on regex patterns I'd prefer to use raw string literals in c++:
regex cvarPattern ( R"rgx(\.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+)\.)rgx" );
regex cvarValue ( R"rgx(\.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+\.*\d*))rgx" );
Everything between the rgx( )rgx delimiters doesn't need any extra escaping for c++ char literal characters.
Actually what you have written in your question resembles to those regular expressions I've been writing as raw string literals.
You probably simply meant something like
regex cvarPattern ( R"rgx(.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+).)rgx" );
regex cvarValue ( R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+(\.\d*)?))rgx" );
I didn't dig in deeper, but I'm not getting all of these escaped characters in your regular expression patterns now.
As for your question in the comment, you can use a choice of matching sub-pattern groups, and check for which of them was applied in the matches structure:
regex cvarValue (
R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*((\d+)|(\d+\.\d?)|([a-zA-Z]+)){1})rgx" );
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You probably don't need these cvarPattern and parentPattern regular expressions to inspect other (more detailed) views about the matching pattern.

Bug in std::regex?

Here is code :
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("[^c]ei");
pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
std::cout << results.str() << std::endl;
return 0;
}
Output :
cei
The compiler used is gcc 4.9.1.
I'm a newbie learning regular expression.I expected nothing should be output,since "cei" doesn't match the pattern here. Am I doing it right? What's the problem?
Update:
This one has been reported and confirmed as a bug, for detail please visit here :
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63497
It's a bug in the implementation. Not only do a couple other tools I tried agree that your pattern does not match your input, but I tried this:
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string pattern("([a-z]*)([a-z])(e)(i)([a-z]*)");
std::regex r(pattern);
std::smatch results;
std::string test_str = "cei";
if (std::regex_search(test_str, results, r))
{
std::cout << results.str() << std::endl;
for (size_t i = 0; i < results.size(); ++i) {
std::ssub_match sub_match = results[i];
std::string sub_match_str = sub_match.str();
std::cout << i << ": " << sub_match_str << '\n';
}
}
}
This is basically similar to what you had, but I replaced [:alpha:] with [a-z] for simplicity, and I also temporarily replaced [^c] with [a-z] because that seems to make it work correctly. Here's what it prints (GCC 4.9.0 on Linux x86-64):
cei
0: cei
1:
2: c
3: e
4: i
5:
If I replace [a-z] where you had [^c] and just put f there instead, it correctly says the pattern doesn't match. But if I use [^c] like you did:
std::string pattern("([a-z]*)([^c])(e)(i)([a-z]*)");
Then I get this output:
cei
0: cei
1: cei
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted (core dumped)
So it claims to match successfully, and results[0] is "cei" which is expected. Then, results[1] is "cei" also, which I guess might be OK. But then results[2] crashes, because it tries to construct a std::string of length 18446744073709551614 with begin=nullptr. And that giant number is exactly 2^64 - 2, aka std::string::npos - 1 (on my system).
So I think there is an off-by-one error somewhere, and the impact can be much more than just a spurious regex match--it can crash at runtime.
The regex is correct and should not match the string "cei".
The regex can be tested and explained best in Perl:
my $regex = qr{ # start regular expression
[[:alpha:]]* # 0 or any number of alpha chars
[^c] # followed by NOT-c character
ei # followed by e and i characters
[[:alpha:]]* # followed by 0 or any number of alpha chars
}x; # end + declare 'x' mode (ignore whitespace)
print "xei" =~ /$regex/ ? "match\n" : "no match\n";
print "cei" =~ /$regex/ ? "match\n" : "no match\n";
The regex will first consume all chars to the end of the string ([[:alpha:]]*), then backtrack to find the NON-c char [^c] and proceed with the e and i matches (by backtracking another time).
Result:
"xei" --> match
"cei" --> no match
for obvious reasons. Any discrepancies to this in various C++ libraries and testing tools are the problem of the implementation there, imho.