Regex for matching C++ string constant - c++

I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string.
I'm currently working with this one here \"([^\\\"]+|\\.)+\" but it fails on one of my test cases.
Test cases:
std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
Expected output:
std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");
On the second one I instead get
std::cout << String("He said: \")bananas\"String(" << ")...";
Short repro code (using the regex by AR.3):
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");

Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex. Let's use PCRE (or boost::regex) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.
We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:
Singleline comments
Multiline comments
Character literals
String literals
We'll be using the extended (x) option, which ignores whitespace in the pattern.
Comments
Here's what [lex.comment] says:
The characters /* start a comment, which terminates with the characters */. These comments do not nest.
The characters // start a comment, which terminates immediately before the next new-line character. If
there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear
between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment
characters //, /*, and */ have no special meaning within a // comment and are treated just like other
characters. Similarly, the comment characters // and /* have no special meaning within a /* comment.
— end note ]
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
Easy peasy. If you match anything there, just (*SKIP)(*FAIL) - meaning that you throw away the match. The (?s: .*? ) applies the s (singleline) modifier to the . metacharacter, meaning it's allowed to match newlines.
Character literals
Here's the grammar from [lex.ccon]:
character-literal:
encoding-prefix(opt) ’ c-char-sequence ’
encoding-prefix:
one of u8 u U L
c-char-sequence:
c-char
c-char-sequence c-char
c-char:
any member of the source character set except the single-quote ’, backslash \, or new-line character
escape-sequence
universal-character-name
escape-sequence:
simple-escape-sequence
octal-escape-sequence
hexadecimal-escape-sequence
simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
octal-escape-sequence:
\ octal-digit
\ octal-digit octal-digit
\ octal-digit octal-digit octal-digit
hexadecimal-escape-sequence:
\x hexadecimal-digit
hexadecimal-escape-sequence hexadecimal-digit
Let's define a few things first, which we'll need later on:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
prefix is defined as an optional u8, u, U or L
escape is defined as per the standard, except that I've merged universal-character-name into it for the sake of simplicity
Once we have these, a character literal is pretty simple:
(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
We throw it away with (*SKIP)(*FAIL)
Simple strings
They're defined in almost the same way as character literals. Here's a part of [lex.string]:
string-literal:
encoding-prefix(opt) " s-char-sequence(opt) "
encoding-prefix(opt) R raw-string
s-char-sequence:
s-char
s-char-sequence s-char
s-char:
any member of the source character set except the double-quote ", backslash \, or new-line character
escape-sequence
universal-character-name
This will mirror the character literals:
(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
The differences are:
The character sequence is optional this time (* instead of +)
The double quote is disallowed when unescaped instead of the single quote
We actually don't throw it away :)
Raw strings
Here's the raw string part:
raw-string:
" d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
r-char-sequence:
r-char
r-char-sequence r-char
r-char:
any member of the source character set, except a right parenthesis )
followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
d-char-sequence:
d-char
d-char-sequence d-char
d-char:
any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \,
and the control characters representing horizontal tab,
vertical tab, form feed, and newline.
The regex for this is:
(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
[^ ()\\\t\x0B\r\n]* is the set of characters that are allowed in delimiters (d-char)
\k<delimiter> refers to the previously matched delimiter
The full pattern
The full pattern is:
(?(DEFINE)
(?<prefix> (?:u8?|U|L)? )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
See the demo here.
boost::regex
Here's a simple demo program using boost::regex:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
static void test()
{
boost::regex re(R"regex(
(?(DEFINE)
(?<prefix> (?:u8?|U|L) )
(?<escape> \\ (?:
['"?\\abfnrtv] # simple escape
| [0-7]{1,3} # octal escape
| x [0-9a-fA-F]{1,2} # hex escape
| u [0-9a-fA-F]{4} # universal character name
| U [0-9a-fA-F]{8} # universal character name
))
)
# singleline comment
// .* (*SKIP)(*FAIL)
# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
# character literal
| (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
# standard string
| (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "
# raw string
| (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
)regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);
std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
)subject");
std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}
int main(int argc, char **argv)
{
try
{
test();
}
catch(std::exception ex)
{
std::cerr << ex.what() << std::endl;
}
return 0;
}
(I left syntax highlighting disabled because it goes nuts on this code)
For some reason, I had to take the ? quantifier out of prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)?) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.
What if we don't have a fancy regex engine at hand?
Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.
A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>...) with normal groups (?:...) if we don't nest quantifiers in order to avoid catastrophic backtracking.
We can also avoid (*SKIP)(*FAIL) if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.
All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:
(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
And here's an interactive demo you can play with:
function run() {
var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
var input = document.getElementById("input").value;
var output = input.replace(re, function(m, ignore) {
return ignore ? m : "String(" + m + ")";
});
document.getElementById("output").innerText = output;
}
document.getElementById("input").addEventListener("input", run);
run();
<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";
"" // empty string
'"' // character literal
// this is "a string literal" in a comment
/* this is
"also inside"
//a comment */
// and this /*
"is not in a comment"
// */
"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>

Regular expressions can be tricky for beginners but once you understand it's basics and well tested divide and conquer strategy, it will be your goto tool.
What you need to search for quote (") not starting with () back slash and read all characters upto next quote.
The regex I came up is (".*?[^\\]"). See a code snippet below.
std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex re(R"((".*?[^\\]"))");
in_line = std::regex_replace(in_line, re, "String($1)");
std::cout << in_line << endl;
Output:
std::cout << String("He said: \"bananas\"") << String("...");
Regex Explanation:
(".*?[^\\]")
Options: Case sensitive; Numbered capture; Allow zero-length matches; Regex syntax only
Match the regex below and capture its match into backreference number 1 (".*?[^\\]")
Match the character “"” literally "
Match any single character that is NOT a line break character (line feed, carriage return) .*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match any character that is NOT the backslash character [^\\]
Match the character “"” literally "
String($1)
Insert the character string “String” literally String
Insert an opening parenthesis (
Insert the text that was last matched by capturing group number 1 $1
Insert a closing parenthesis )

Read the relevant sections from the C++ standard, they are called lex.ccon and lex.string.
Then convert each rule you find there into a regular expression (if you really want to use regular expressions; it might turn out that they are not capable of doing this job).
Then, build more complicated regular expressions out of them. Be sure to name your regular expressions exactly as the rules from the C++ standard, so that you can recheck them later.
If, instead of using regular expressions, you want to use an existing tool, here is one: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html. Have a look at the LexStringLiteral function.

Related

regex_replace is returning empty string

I am trying to remove all characters that are not digit, dot (.), plus/minus sign (+/-) with empty character/string for float conversion.
When I pass my string through regex_replace function I am returned an empty string.
I belive something is wrong with my regex expression std::regex reg_exp("\\D|[^+-.]")
Code
#include <iostream>
#include <regex>
int main()
{
std::string temporary_recieve_data = " S S +456.789 tg\r\n";
std::string::size_type sz;
const std::regex reg_exp("\\D|[^+-.]"); // matches not digit, decimal point (.), plus sign, minus sign
std::string numeric_string = std::regex_replace(temporary_recieve_data, reg_exp, ""); //replace the character that are not digit, dot (.), plus-minus sign (+,-) with empty character/string for float conversion
std::cout << "Numeric String : " << numeric_string << std::endl;
if (numeric_string.empty())
{
return 0;
}
float data_value = std::stof(numeric_string, &sz);
std::cout << "Float Value : " << data_value << std::endl;
return 0;
}
I have been trying to evaluate my regex expression on regex101.com for past 2 days but I am unable to figure out where I am wrong with my regular expression. When I just put \D, the editor substitutes non-digit character properly but soon as I add or condition | for not dot . or plus + or minus - sign the editor returns empty string.
The string is empty because your regex matches each character.
\D already matches every character that is not a digit.
So plus, hyphen and the period thus far are consumed.
And digits get consumed by the negated class: [^+-.]
Further the hyphen indicates a range inside a character class.
Either escape it or put it at the start or end of the char-class.
(funnily the used range +-. 43-46 even contained a hyphen)
Remove the alternation with \D and put \d into the negated class:
[^\d.+-]+
See this demo at regex101 (attaching + for one or more is efficient)

Using Regex to remove leading/trailing whitespaces, except for quotes

I am trying to write a regular expression which recognises whitespaces from a user input string, except for between quotation marks ("..."). For example, if the user enters
#load "my folder/my files/ program.prog" ;
I want my regex substitution to transform this into
#load "my folder/my files/ program.prog" ;
So far I've implemented the following (you can run it here).
#include <iostream>
#include <string>
#include <regex>
int main(){
// Variables for user input
std::string input_line;
std::string program;
// User prompt
std::cout << ">>> ";
std::getline(std::cin, input_line);
// Remove leading/trailing whitespaces
input_line = std::regex_replace(input_line, std::regex("^ +| +$|( ) +"), "$1");
// Check result
std::cout << input_line << std::endl;
return 0;
}
But this removes whitespaces between quotes too. Is there any way I can use regex to ignore spaces between quotes?
You may add another alternative to match and capture double quoted string literals and re-insert it into the result with another backreference:
input_line = std::regex_replace(
input_line,
std::regex(R"(^ +| +$|(\"[^\"\\]*(?:\\[\s\S][^\"\\]*)*\")|( ) +)"),
"$1$2");
See the C++ demo.
The "[^"\\]*(?:\\[\s\S][^"\\]*)*\" part matches a ", then 0+ chars other than \ and ", then 0 or more occurrences of any escaped char (\ and then any char matched with [\s\S]) and then 0+ chars other than \ and ".
Note I used a raw string literal R"(...)" to avoid having to escape regex escape backslashes (R"([\s\S])" = "[\\s\\S]").

How to match a string with an opening brace { in C++ regex

I have about writing regexes in C++. I have 2 regexes which work fine in java. But these throws an error namely
one of * + was not preceded by a valid regular expression C++
These regexes are as follows:
regex r1("^[\s]*{[\s]*\n"); //Space followed by '{' then followed by spaces and '\n'
regex r2("^[\s]*{[\s]*\/\/.*\n") // Space followed by '{' then by '//' and '\n'
Can someone help me how to fix this error or re-write these regex in C++?
See basic_regex reference:
By default, regex patterns follow the ECMAScript syntax.
ECMAScript syntax reference states:
characters:
\character
description: character
matches: the character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
So, you need to escape { to get the code working:
std::string s("\r\n { \r\nSome text here");
regex r1(R"(^\s*\{\s*\n)");
regex r2(R"(^\s*\{\s*//.*\n)");
std::string newtext = std::regex_replace( s, r1, "" );
std::cout << newtext << std::endl;
See IDEONE demo
Also, note how the R"(pattern_here_with_single_escaping_backslashes)" raw string literal syntax simplifies a regex declaration.

How to match line-break in c++ regex?

I tried the following regex:
const static char * regex_string = "([a-zA-Z0-9]+).*";
void find_first(const std::string str);
int main(int argc, char ** argv)
{
find_first("0s7fg9078dfg09d78fg097dsfg7sdg\r\nfdfgdfg");
}
void find_first(const std::string str)
{
std::cout << str << std::endl;
std::regex rgx(regex_string);
std::smatch matcher;
if(std::regex_match(str, matcher, rgx))
{
std::cout << "Found : " << matcher.str(0) << std::endl;
} else {
std::cout << "Not found" << std::endl;
}
}
DEMO
I expected the regex will be completely correct and the group will be found. But it wasn't. Why? How can I match the line-break in c++ regex? In Java it works fine.
The dot in regex usually matches any character other than a newline std::ECMAScript syntax.
.   not newline   any character except line terminators (LF, CR, LS, PS).
0s7fg9078dfg09d78fg097dsfg7sdg\r\nfdfgdfg
[a-zA-Z0-9]+ matches until \r ↑___↑ .* would match from here
In many regex flavors there is a dotall flag available to make the dot also match newlines.
If not, there are workarounds in different languages such as [^] not nothing or [\S\s] any whitespace or non-whitespace together in a class wich results in any character including \n
regex_string = "([a-zA-Z0-9]+)[\\S\\s]*";
Or use optional line breaks: ([a-zA-Z0-9]+).*(?:\\r?\\n.*)* or ([a-zA-Z0-9]+)(?:.|\\r?\\n)*
See your updated demo
Update - Another idea worth mentioning: std::regex::extended
A <period> ( '.' ), when used outside a bracket expression, is an ERE that shall match any character in the supported character set except NUL.
std::regex rgx(regex_string, std::regex::extended);
See this demo at tio.run
You may try const static char * regex_string = "((.|\r\n)*)";
I hope It will help you.
I am using CTest and PROPERTIES PASS_REGULAR_EXPRESSION.
[\S\s]* did not work, but (.|\r|\n)* did.
This regex:
Function registered for ID 2 was called(.|\r|\n)*PASS
Matches:
Running test function: RegisterThreeDiffItemsTest04
ID 2 registered for callback
ID 4 registered for callback
ID 11 registered for callback
Function registered for ID 2 was called
ID 2 callback deregistered
ID 4 callback deregistered
ID 11 callback deregistered
Setup: PASS
Note: CMakeLists.txt needs to escape the backslashes:
SET (ANDPASS "(.|\\r|\\n)*PASS")

QRegExp not finding expected string pattern

I am working in Qt 5.2, and I have a piece of code that takes in a string and enters one of several if statements based on its format. One of the formats searched for is the letters "RCV", followed by a variable amount of numbers, a decimal, and then one more number. There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9". Right now I am using QRegExp class to find this, like this:
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^[RCV\d+\.\d\|?]+$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;
I want it to use the if statement, but it keeps going into the else statement. Is there something I'm doing wrong with the regular expression?
Your expression should be like #vahancho mentionet in a comment:
if(QRegExp("^[RCV\\d+\\.\\d\\|?]+$").exactMatch(value))
If you use C++11, then you can use its raw strings feature:
if(QRegExp(R"(^[RCV\d+\.\d\|?]+$)").exactMatch(value))
Aside from escaping the backslashes which others has mentioned in answers and comments,
There can be more than one of these values in the line, separated by "|", for example it could one value like "RCV0123456.1" or mulitple values like "RCV12345.1|RCV678.9".
[RCV\d+\.\d\|?] may not be doing what you expect. Perhaps you want () instead of []:
/^
[RCV\d+\.\d\|?]+ # More than one of characters from the list:
# R, C, V, a digit, a +, a dot, a digit, a |, a ?
$/x
/^
(
RCV\d+\.\d # RCV, some digits, a dot, followed by a digit
\|? # Optional: a |
)+ # Quantifier of one or more
$/x
Also, maybe you could revise the regex such that the optional | requires the group to be matched *again*:
/^
(RCV\d+\.\d) # RCV, some digits, a dot, followed by a digit
(
\|(?1) # A |, then match subpattern 1 (Above)
)+ # Quantifier of one or more
$/x
Check if only valid occurences in line with the addition to require an | starting second occurence (having your implementation would not require the | even with double quotes):
QString value = "RCV000030249.2|RCV000035360.2"; //Note: real test value from my code
if(QRegExp("^RCV\\d+\\.\\d(\\|RCV\\d+\\.\\d)*$").exactMatch(value))
std::cout << ":D" << std::endl;
else
std::cout << ":(" << std::endl;