C++ regex_match not working - c++

Here is part of my code
bool CSettings::bParseLine ( const char* input )
{
//_asm INT 3
std::string line ( input );
std::size_t position = std::string::npos, comment;
regex cvarPattern ( "\\.([a-zA-Z_]+)" );
regex parentPattern ( "^([a-zA-Z0-9_]+)\\." );
regex cvarValue ( "\\.[a-zA-Z0-9_]+[ ]*=[ ]*(\\d+\\.*\\d*)" );
std::cmatch matchedParent, matchedCvar;
if ( line.empty ( ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedParent, parentPattern ) )
return false;
if ( !std::regex_match ( line.c_str ( ), matchedCvar, cvarPattern ) )
return false;
...
}
I try to separate with it lines which I read from file - lines look like:
foo.bar = 15
baz.asd = 13
ddd.dgh = 66
and I want to extract parts from it - e.g. for 1st line foo.bar = 15, I want to end up with something like:
a = foo
b = bar
c = 15
but now, regex is returning always false, I tested it on many online regex checkers, and even in visual studio, and it's working great, do I need some different syntax for C++ regex_match? I'm using visual studio 2013 community

The problem is that std::regex_match must match the entire string but you are trying to match only part of it.
You need to either use std::regex_search or alter your regular expression to match all three parts at once:
#include <regex>
#include <string>
#include <iostream>
const auto test =
{
"foo.bar = 15"
, "baz.asd = 13"
, "ddd.dgh = 66"
};
int main()
{
const std::regex r(R"~(([^.]+)\.([^\s]+)[^0-9]+(\d+))~");
// ( 1 ) ( 2 ) ( 3 ) <- capture groups
std::cmatch m;
for(const auto& line: test)
{
if(std::regex_match(line, m, r))
{
// m.str(0) is the entire matched string
// m.str(1) is the 1st capture group
// etc...
std::cout << "a = " << m.str(1) << '\n';
std::cout << "b = " << m.str(2) << '\n';
std::cout << "c = " << m.str(3) << '\n';
std::cout << '\n';
}
}
}
Regular expression: https://regex101.com/r/kB2cX3/2
Output:
a = foo
b = bar
c = 15
a = baz
b = asd
c = 13
a = ddd
b = dgh
c = 66

To focus on regex patterns I'd prefer to use raw string literals in c++:
regex cvarPattern ( R"rgx(\.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+)\.)rgx" );
regex cvarValue ( R"rgx(\.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+\.*\d*))rgx" );
Everything between the rgx( )rgx delimiters doesn't need any extra escaping for c++ char literal characters.
Actually what you have written in your question resembles to those regular expressions I've been writing as raw string literals.
You probably simply meant something like
regex cvarPattern ( R"rgx(.([a-zA-Z_]+))rgx" );
regex parentPattern ( R"rgx(^([a-zA-Z0-9_]+).)rgx" );
regex cvarValue ( R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*(\d+(\.\d*)?))rgx" );
I didn't dig in deeper, but I'm not getting all of these escaped characters in your regular expression patterns now.
As for your question in the comment, you can use a choice of matching sub-pattern groups, and check for which of them was applied in the matches structure:
regex cvarValue (
R"rgx(.[a-zA-Z0-9_]+[ ]*=[ ]*((\d+)|(\d+\.\d?)|([a-zA-Z]+)){1})rgx" );
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You probably don't need these cvarPattern and parentPattern regular expressions to inspect other (more detailed) views about the matching pattern.

Related

IntelliJ: Regular expression to join multiple lines into single CSV line?

Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
input: (lines pasted into some Android Studio editor tab)
Rush
IQ
Saga
Yes
desired output:
'Rush','IQ','Saga','Yes'
Edit > Find > Replace I got close with this regex pattern to match newline character (\n) with goal eliminate it:
search: ^(.*)$\n
replace: '$1',
[x] Regex
but produces this undesired output:
'Rush',IQ
'Saga',Yes
because after the a new line is eliminated the following line is already adjoining so it's skipped... so we get this "every other line" behavior.
The fastest and easiest way I could think of is to replace \n by ',' and then manually wrap the whole line in quotes:
The result of the first replacement would be:
Rush','IQ','Saga','Yes
And then just manually add first and last quote.
Step 1: Concatenate the lines, use
(.+)(?:\R|\z)
Replace with '$1',.
The (.+)(?:\R|\z) pattern matches any 1+ chars other than line break chars as many as possible (.+) and captures this into Group 1 and (?:\R|\z) matches either a line break sequence (\R) or (|) the very end of the string (\z).
Step 2: Post-process by repalcing ,$ with an empty string. This pattern matches , at the end of the line.
Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
Regex may not be the best solution for this.
CSV library
There are several comma-separated values (CSV) libraries available to make quick work of this.
The libraries will handle a particular problem you may overlook in writing your own code: Some of your lines of input having the single-quote mark within their content. Such cases need to be escaped. Quoting RFC 4180 section 2.7:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Here is an example of using Apache Commons CSV library.
We use lambda syntax with a Scanner to get an Iterable of the lines of text from your input.
We specify using a single-quote, as you desire, rather than default of double-quote in standard CSV.
We use try-with-resources syntax to automatically close the CSVPrinter object, whether our code runs successfully or throws an exception.
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Yes";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
}
catch ( IOException e )
{
e.printStackTrace();
}
String output = stringBuilder.toString();
System.out.println( "output: " + output );
When run:
output: 'Rush','IQ','Saga','Yes'
We can shorten that code.
try (
CSVPrinter printer = new CSVPrinter( new StringBuilder() , CSVFormat.RFC4180.withQuoteMode( QuoteMode.ALL ).withQuote( '\'' ) ) ;
)
{
printer.printRecord( ( Iterable < String > ) ( ) -> new Scanner( input ).useDelimiter( "\n" ) );
System.out.println( printer.getOut().toString() ); // Or: `return printer.getOut()` returning an `Appendable` object.
}
catch ( IOException e )
{
e.printStackTrace();
}
Not that this is particularly better shortened. Personally, I would use the longer version wrapped in a method in a utility class. Like this:
public String enquoteLines( String input ) {
String output = "";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
output = printer.getOut().toString();
}
catch ( IOException e )
{
e.printStackTrace();
}
return output;
}
Calling it:
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Oui";
String output = this.enquoteLines( input );

C++ regex replace with a callback function

I have a map that stores id to value mapping, an input string can contain a bunch of ids. I need to replace those ids with their corresponding values. For example:
string = "I am in #1 city, it is now #2 time" // (#1 and #2 are ids here)
id_to_val_map = {1 => "New York", 2 => "summer"}
Desired output:
"I am in New York city, it is now summer time"
Is there a way I can have a callback function (that takes in the matched string and returns the string to be used as replacement) ? std::regex_replace doesn't seem to support that.
The alternative is to find all the matches, then compute their replacement values, and then perform the actual replacement. Which won't be that efficient.
You might do:
const std::map<int, std::string> m = {{1, "New York"}, {2, "summer"}};
std::string s = "I am in #1 city, it is now #2 time";
for (const auto& [id, value] : m) {
s = std::regex_replace(s, std::regex("#" + std::to_string(id)), value);
}
std::cout << s << std::endl;
Demo
A homegrown way is to use a while loop with regex_search() then
build the output string as you go.
This is essentially what regex_replace() does in a single pass.
No need to do a separate regex for each map item which has overhead of
reassignment on every item ( s=regex_replace() ) as well as covering the same
real estate with every pass.
Something like this regex
(?s)
( .*? ) # (1)
(?:
\#
( \d+ ) # (2)
| $
)
with this code
typedef std::string::const_iterator SITR;
typedef std::smatch X_smatch;
#define REGEX_SEARCH std::regex_search
std::regex _Rx = std::regex( "(?s)(.*?)(?:\\#(\\d+)|$)" );
SITR start = oldstr.begin();
SITR end = oldstr.end();
X_smatch m;
std::string newstr = "";
while ( REGEX_SEARCH( start, end, m, _Rx ) )
{
newstr.append( m[1].str() );
if ( m[2].matched ) {
{
// append the map keys value here, do error checking etc..
// std::string key = m[2].str();
int ndx = std::atoi( m[2].str() );
newstr.append( mymap[ ndx ] );
}
start = m[0].second;
}
// assign the old string with new string if need be
oldstr = newstr;

Find regex matches & remove outer part of the match

I have a string
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
and I find to find all instances with func(...), and remove the function call. So that I would get
content = "std::cout << some_val << std::endl; auto i = some_other_val;"
So I've tried this:
import re
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
c = re.compile('func\([a-zA-Z0-9_]+\)')
print(c.sub('', content)) # gives "std::cout << << std::endl; auto i = ;"
but this removes the entire match, not just the func( and ).
Basically, how do I keep whatever matched with [a-zA-Z0-9_]+?
You can use re.sub to replace all the outer func(...) with only the value like below, See regex here , Here I've used [w]+, you can do changes if you use
import re
regex = r"func\(([\w]+)\)"
test_str = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
subst = "\\1"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo: https://rextester.com/QZJLF65281
Output:
std::cout << some_val << std::endl; auto i = some_other_val;
You should capture the part of the match that you want to keep into a group:
re.compile(r'func\(([a-zA-Z0-9_]+)\)')
Here I captured it into group 1.
And then you can refer to group 1 with \1:
print(c.sub(r'\1', content))
Note that in general, you should not use regex to parse source code of a non-regular language (such as C in this case) with regex. It might work in a few very specific cases, where the input is very limited, but you should still use a C parser to parse C code. I have found libraries such as this and this.

std::regex and ignoring flags

After learning basic c++ rules,I specialized my focus on std::regex, creating two console apps: 1.renrem and 2.bfind.
And I decided to create some convenient functions to deal with regex in c++ as easy as possible plus all with std; named RFC ( = regex function collection )
There are several strange things that always make me surprise, but this one ruined all my attempt and those two console apps.
One of the important functions is count_match that counts number of match inside a string. Here is the full code:
unsigned int count_match( const std::string& user_string, const std::string& user_pattern, const std::string& flags = "o" ){
const bool flags_has_i = flags.find( "i" ) < flags.size();
const bool flags_has_g = flags.find( "g" ) < flags.size();
std::regex::flag_type regex_flag = flags_has_i ? std::regex_constants::icase : std::regex_constants::ECMAScript;
// std::regex_constants::match_flag_type search_flag = flags_has_g ? std::regex_constants::match_default : std::regex_constants::format_first_only;
std::regex rx( user_pattern, regex_flag );
std::match_results< std::string::const_iterator > mr;
unsigned int counter = 0;
std::string temp = user_string;
while( std::regex_search( temp, mr, rx ) ){
temp = mr.suffix().str();
++counter;
}
if( flags_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce. So std::regex_search ignores the format_first_only but std::regex_replace accepts it. Let's it goes.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
Supposing this string s = "ONE TWO THREE four five six seven"
the output for c++ std
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 3 => Global match plus insensitive
whereas for the exact perl and d laugauge and c++ with boost the output is:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
I know about regex flavors PCRE; or ECMAScript 262 that c++ uses it, But I have no ides why a simple flag, is ignored for the only search function that c++ has? Since std::regex_iterator and std::regex_token_iterator are also use this function internally.
And shortly, I can not use those two my apps and RFC with std library because if this!
So if someone knows according to which rule it is maybe a valid rude in ECMAScript 262 or perhaps if I am wrong anywhere please tell me. Thanks.
tested with
gcc version 6.3.0 20170519 (Ubuntu/Linaro 6.3.0-18ubuntu2~16.04)
clang version 3.8.0-2ubuntu4
perl code:
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/g; print $c ;' "ONE TWO THREE four five six seven" // 3
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/gi; print $c ;' "ONE TWO THREE four five six seven" // 7
d code:
uint count_match( ref const (char[]) user_string, const (char[]) user_pattern, const (char[]) flags ){
const bool flag_has_g = flags.indexOf( "g" ) != -1;
Regex!( char ) rx = regex( user_pattern, flags );
uint counter = 0;
foreach( mr; matchAll( user_string, rx ) ){
++counter;
}
if( flag_has_g ){
return counter;
} else {
if( counter >= 1 ) return 1;
else return 0;
}
}
the output:
writeln( count_match( s, "[A-Z]+", "g" ) ); // 3
writeln( count_match( s, "[A-Z]+", "gi" ) ); // 7
js code:
var s = "ONE TWO THREE four five six seven";
var rx1 = new RegExp( "[A-Z]+" , "g" );
var rx2 = new RegExp( "[A-Z]+" , "gi" );
var counter = 0;
while( rx1.exec( s ) ){
++counter;
}
document.write( counter + "<br>" ); // 3
counter = 0;
while( rx2.exec( s ) ){
++counter;
}
document.write( counter ); // 7
Okay. After testing with gcc 7.1.0 it turned out that with version below 6.3.0 the output is: 1 3 3 and but with 7.1.0 the output is 1 3 7
here is the link.
Also with this version of clang the output is correct. Here is the link. thanks to igor-tandetnik user
First of all I thought may this is a rule for ECMAScript, but after testing js code and seeing Igor Tandetnik commend I test the code with gcc 7.1.0 and it outputs the correct result.
For test the regex library, I use:
std::cout << ( rx.flags() & std::regex_constants::icase == std::regex_constants::icase ? "yes" : "no" ) << '\n';
So when the icase is set it returns true otherwise returns false. So I think there is no library fault.
Here is the test with gcc 7.1.0
Therefore all versions below gcc 7.1.0 has incorrect output.
For clang I have no ideas since I have clang 3.8.0 and it has incorrect output. But the online version even 3.7.1 output is correct.
screenshot with clang 3.8.0 for this code:
std::cout << count_match( s, "[A-Z]+" ) << '\n'; // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n'; // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n'; // 7 => Global match plus insensitive
So with online compiler the output is incorrect for clang 3.2 and below. But higher version outputs the correct result.
Please correct me if I am wrong
First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce.
The flag in question is format_first_only. This flag makes sense only for a "replace" operation. In regex_replace, the default is "replace all" but if you pass this flag it becomes "replace first only."
In regex_match and regex_search, there is no replacement going on at all; both of those functions just find the first match (and in the case of regex_match, that match must consume the entire string). Since the flag is meaningless in that case, I would expect the implementation to ignore it; but I wouldn't fault the implementation for throwing an exception, either, if it chose to be noisy about it.
The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]
icase working wrong for character classes is definitely a bug in your vendor's library.
Looks like libstdc++'s bug was fixed between GCC 6.3 (Dec 2016) and GCC 7.1 (May 2017).
Looks like libc++'s bug was fixed between Clang 3.2 (Dec 2012) and Clang 3.3 (Jun 2013).

A regex for extracting " ; " or "=" symbols from source code?

For example
int val = 13;
Serial.begin(9600);
val = DigitalWrite(900,HIGH);
I really want to extract special symbols like = and ;.
I've been able to extracted symbols that appear adjacent in the code, but I need all occurrences.
I tried [^ "//"A-Za-z\t\n0-9]* and [\;\=\{\}\,]+. Neither worked.
what's wrong?
i had made a rule for my scanner like below.(had been changed)
semicolon [;]([\n]|[^ "//"])
assignment (.)?[=]+
brace ([{]|[}])([\n]|[^ "//"])
roundbarcket ("()")" "
the problem was occurred like these situations
int val= 13; // it couldn't recognize "=" because "val" and "=" is adjoined. i want to recognize them either adjoined or not
serial.read(); // it couldn't recognize () and ; with individually. if i add semicolon rule and roundbarcket rule, (); was recognized.
how can i solve them ?
You want to break "DigitalWrite(900,HIGH);" into "DigitalWrite" "(" "900" "," "HIGH" ")" ";". I think looping each substring is the fastest way.
string text = "val = DigitalWrite(900,HIGH);";
string[] symbols = new string[] { "(", ")", ",", "=", ";"};
List<string> tokens = new List<string>();
string word = "";
for( int i = 0; i < text.Length; i++ )
{
string letter = text.Substring( i, 1 );
if( !letter.Equals( " " ) )
{
if( tokens.Contains( letter ) )
{
if( word.Length > 0 )
{
tokens.Add( word );
word = "";
}
tokens.Add( letter );
}
else
{
word += letter;
if(i == text.Length - 1 )
tokens.Add( word );
}
}
}
So searching for ";" and "=" is the ultimate goal you want to achieve?
In such case, why don't you just use something like .find() function?
Or, you can split strings by ";" first and search for "=" after.
If you want to grab text between "=" and ";", try use =([^;]*); or =(.*?);