regex.h matching differences between OSX and Linux - regex

I need to match the following line with multiple capturing groups:
0.625846 29Si 29 [4934.39 0] [0.84 100000000000000.0]
I use the regex:
^(0+\.[0-9]?e?[+-]?[0-9]+)\s+([0-9]+\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\s+([0-9][0-9]?[0-9]?)\s+(\[.*\])\s+(\[.*\])$
see this link for a regex101 workspace. However I find that when I'm trying the matching using regex.h it behaves differently on OSX or linux, specifically:
Fails on:
OSX: 10.14.6
LLVM: 10.0.1 (clang-1001.0.46.4)
Works on:
linux: Ubuntu 18.04
g++: 7.5.0
I worked up a brief code the reproduces the problem, compiled with g++ regex.cpp -o regex:
#include <iostream>
//regex
#include <regex.h>
using namespace std;
int main(int argc, char** argv) {
//define a buffer for keeping results of regex matching
char buffer[100];
//regex object to use
regex_t regex;
//*****regex match and input file line*******
string iline = "0.625846 29Si 29 [4934.39 0] [0.84 100000000000000.0]";
string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)\\s+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\\s+([0-9][0-9]?[0-9]?)\\s+(\\[.*\\])\\s+(\\[.*\\])$";
//compile the regex
int reti = regcomp(&regex,matchfile.c_str(),REG_EXTENDED);
regerror(reti, &regex, buffer, 100);
if(reti==0)
printf("regex compile success!\n");
else
printf("regcomp() failed with '%s'\n", buffer);
//match the input line
regmatch_t input_matchptr[6];
reti = regexec(&regex,iline.c_str(),6,input_matchptr,0);
regerror(reti, &regex, buffer, 100);
if(reti==0)
printf("regex compile success!\n");
else
printf("regexec() failed with '%s'\n", buffer);
//******************************************
return 0;
I have also modified my regex to comply with POSIX (I think?) by removing the previous use of +? and *? operators as per this post but may have missed something that makes me incompatible with POSIX? However, the regex now seems to compile correctly which makes me thing I used a valid regex but I still don't understand why no match is obtained. Which I understand that LLVM requires.
How can I modify my regex to correctly match?

To answer the immediate question, you need to use
string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)[[:space:]]+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)[[:space:]]+([0-9][0-9]?[0-9]?)[[:space:]]+(\\[.*\\])[[:space:]]+(\\[.*\\])$";
That is, instead of Perl-like \s, you can use [:space:] POSIX character class inside a bracket expression.
You mention that you tried [:space:] outside of a bracket expression, and it did not work - that is expected. As per Character Classes,
[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]].
This means that POSIX character classes are only parse as such when used inside bracket expressions.

Related

regex_error being thrown when trying to do simple things like [:digit:] or \d

Every time I put [:digit:] in a regex like so: regex r("[:digit:]") it throws an exception and .what() just returns regex_error instead of a descriptive, meaningful explanation of the error. Same things happens when I try regex r("\\d"). And when I try regex r("\d") my compiler says that \d is an unfamiliar character escape sequence. I'm in Code::Blocks by the way. Here's my code:
#include <regex>
#include <iostream>
using namespace std;
int main()
{
regex r("\d"); //and or r("[:digit:]")
string i = "5";
if(regex_match(i,r))
{
cout << "Integer";
}
return 0;
}
After getting a newer version of Code::Blocks and the MinGW GCC compiler suite it worked.
P.S. I kept having an error when trying to set the compiler after downloading Code::Blocks. I had to go to Global compiler settings and click Reset defaults for it to auto-detect my compiler. As seen here.

The expression contained mismatched brackets

My program give me an error message: libc++abi.dylib: terminating with uncaught exception of type std::__1::regex_error: The expression contained mismatched ( and ) and I can't find how fix it. What the brackets?
#include <regex>
#include <iostream>
#include <string>
using namespace std;
using namespace regex_constants;
int main() {
// here can be url, or hashtag or split string:
// www.example.com or #followback or currentratesoughttogodown and output must be example or followback or currentratesoughttogodown.
string input = "www.example.com";
smatch m;
regex_match ( input, m, regex ("(?<=www.|\\#)(\\w+)(?=\\.)?") );
return 0;
}
I compile withgcc:
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix
C++11 uses ECMAScript's (ECMA-262) regex syntax, so it will not have look-behind (other flavors of regex that C++11 supports also doesn't have look-behind).
Lookahead ie (?=) is supported though
You can probably try (?:www\\.|\\#)(\\w+)(?=\\.)
and grab the group 1 though.Or you can Look into using Boost.Regex or Boost.Xpressive which supports lookbehind.

What's wrong with this regex? MSVC <regex> accepts but gcc <regex> and <regex.h> reject

Here's the code that runs fine with MSVC
std::regex r;
//r = "setdata\\(\\\"([^\\\"]*)\\\",[^\\\"]*\\\"([^\\\"]*)";
r = "setdata\\(";
and it is also parsed fine by online validators (of course when removing \ escapes).
But with g++ 4.6 it throws regex_error. I know that its regex support is not finished so I tried <regex.h>:
char buf[1024];
regex_t regex;
//int reti = regcomp(&regex, "setdata\\(\\\"([^\\\"]*)\\\",[^\\\"]*\\\"([^\\\"]*)", 0);
int reti = regcomp(&regex, "setdata\\(", 0);
regerror(reti, &regex, buf, 1024);
it reports "Unmatched ( or \("
UPDATE: here's what I found:
Unfortunately the native platforms provide Posix regular expressions support that contains bugs and/or violates the specification. This is especially true for the GNU C library (GLIBC) used by Linux distributions (c) http://www.haskell.org/haskellwiki/Regex_Posix#regex-posix_Bugs
Can this be one of those POSIX regex library bugs?
UPDATE: Thanks to the comment, I simplified pattern, seems like gcc looks for a pair for the escaped \( which doesn't seem to be correct.

Regex library not working correctly in c++

I have been looking up places to work with regex in c++ , as I want to learn regular expressions in c++ (do give me a step by step link also if you guys have any). I am using g++ to compile my programs and working in Ubuntu.
earlier my program were not compiling but then I read this post where it said to compile the program by
"g++ -std=c++0x sample.cpp"
to use the regex header.
My first program works correctly, i tried implementing regex_match
#include<stdio.h>
#include<iostream>
#include<regex>
using namespace std;
int main()
{
string str = "Hello world";
regex rx ("ello");
if(regex_match(str.begin(), str.end(), rx))
{
cout<<"True"<<endl;
}
else
cout<<"False"<<endl;
return(0);
}
for which my program returned false ... ( as the expression is not matching completely)
I also rechecked it by making it match...it works.
Now I am writing another program to implement regex_replace and regex_search . Both of which doesnt work ( for regex_search just replace regex_match in the above program with regex_search. kindly help.I dont know where I am getting wrong.
The <regex> header is not fully supported by GCC.
You can see GCC support here.

How to strip C++ style single line comments (`// ...`)

For a small DSL I'm writing I'm looking for a regex to match a comment string at the end of the like the // syntax of C++.
The simple case:
someVariable = 12345; // assignment
Is trivial to match but the problem starts when I have a string in the same line:
someFunctionCall("Hello // world"); // call with a string
The // in the string should not match as a comment
EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.
EDIT
Since you are effectively preprocessing a sourcefile, why don't you use an existing preprocessor? If the language is sufficiently similar to C/C++ (especially regarding quoting and string literals), you will be able to just use cpp -P:
echo 'int main() { char* sz="Hello//world"; /*profit*/ } // comment' | cpp -P
Output: int main() { char* sz="Hello//world"; }
Other ideas:
Use a proper lexer/parser instead
Have a look at
CoCo/R (available for Java, C++, C#, etc.)
ANTLR (idem)
Boost Spirit (with Spirit Lex to make it even easier to strip the comments)
All suites come with sample grammars that parse C, C++ or a subset thereof
shoosh wrote:
EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.
In that case, create a very simple lexer that matches one of three tokens:
// ... comments
string literals: " ... "
or, if none of the above matches, match any single character
Now, while you iterate ov er these 3 different type of tokens, simply print tokens (2) and (3) to the stdout (or to a file) to get the uncommented version of your source file.
A demo with GNU Flex:
example input file, in.txt:
someVariable = 12345; // assignment
// only a comment
someFunctionCall("Hello // world"); // call with a string
someOtherFunctionCall("Hello // \" world"); // call with a string and
// an escaped quote
The lexer grammar file, demo.l:
%%
"//"[^\r\n]* { /* skip comments */ }
"\""([^"]|[\\].)*"\"" {printf("%s", yytext);}
. {printf("%s", yytext);}
%%
int main(int argc, char **argv)
{
while(yylex() != 0);
return 0;
}
And to run the demo, do:
flex demo.l
cc lex.yy.c -lfl
./a.out < in.txt
which will print the following to the console:
someVariable = 12345;
someFunctionCall("Hello // world");
someOtherFunctionCall("Hello // \" world");
EDIT
I'm not really familiar with C/C++, and just saw #sehe's recommendation of using a pre-processor. That looks to be a far better option than creating your own (small) lexer. But I think I'll leave this answer since it shows how to handle this kind of stuff if no pre-processor is available (for whatever reason: perhaps cpp doesn't recognise certain parts of the DSL?).